zoukankan      html  css  js  c++  java
  • python之新手一看就懂的小说爬虫

    晚上回来学学爬虫,记住,很多网站一般新手是爬不出来的,来个简单的,往下看:


    import urllib.request
    from bs4 import BeautifulSoup #我用的pycharm需要手动导入这个包的
    import lxml  #同上



    def getHtml(url,headers):
    req = urllib.request.Request(url=url, headers=headers)
    res =urllib.request.urlopen(req)
    html = res.read()
    return html

    def saveTxt(path,html):
    f = open(path,'wb')
    f.write(html)

    def praseHtml(currentURL,headers,path):
    # html = html.decode('utf-8')
    chapter = 0
    flag = 1
    while flag:
    chapter = chapter+1
    if chapter >= 30: #控制下载的数量,太多数据电脑要爆。
    flag = 0 #停止下载
    html = getHtml(currentURL,headers)
    savePath = path +"\"+str(chapter)+ ".txt"
    f = open(savePath,"w")
    soup =BeautifulSoup(html,"lxml") #注意这里是lxml格式,我第一次居然写成了html,不小心就会吃亏的
    nameText = soup.find('h3',attrs={'class':'j_chapterName'})
    contentText = soup.find('div',attrs={'class':'read-content j_readContent'})
    result = nameText.getText()+' '+contentText.getText()
    result = result.replace(' ',' ')
    f = open(savePath,"w")
    f.write(result)

    nextpage = soup.find('a',attrs={'id':'j_chapterNext'})
    if next :
    currentURL = "http:" + nextpage['href']
    else:
    currentURL = None
    flag = 0

    def main():
    url = "https://www.readnovel.com/chapter/22160402000540402/107513768840595159"
    headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'} #请求头自己可以再网页中查看 (f12->network->刷新)
    path = "D:\novel"
    praseHtml(url,headers,path)

    main()
    学习,永无止境!
  • 相关阅读:
    Git 版本更新(Windows下)
    Qt Quick 构建 Android app
    Android Studio更新到3.6.0 Gradle报错No cached version of com.android.tools.build:aapt2:3.2.0-alpha16-4748712 available for of...
    在自定义类中使用setMouseCallBack
    关于Qt 状态机
    C++ 新特性 移动构造函数和移动赋值
    关于Qt MetaObject connectSlotsByName
    Prototype 设计模式在框架中的应用
    关于设计模式
    如何提高Qt工程的编译速度
  • 原文地址:https://www.cnblogs.com/litinghappy/p/9180434.html
Copyright ? 2011-2022 开发猿


http://www.vxiaotou.com