如何使用python实现垂直爬虫系统

2023-06-29 09:06:36 154人浏览安东尼

Python 官方文档：入门教程 => 点击学习

摘要

小编给大家分享一下如何使用python实现垂直爬虫系统，希望大家阅读完这篇文章之后都有所收获，下面让我们一起去探讨吧！html_downloaderfrom urllib import requestdef&n

小编给大家分享一下如何使用python实现垂直爬虫系统，希望大家阅读完这篇文章之后都有所收获，下面让我们一起去探讨吧！

html_downloader

from urllib import requestdef download(url):    if url is None:        return    response = request.urlopen(url)    if response.getcode() != 200:        return None    return response.read()

html_outeputer

data_list = []def collect_data(data):    data_list.append(data)def output_html():    fout = open('output.html', 'w')    fout.write('<html>')    fout.write('<body>')    fout.write('<table>')    for dataitem in data_list:        fout.write('<tr>')        fout.write('<td>%s</td>' % dataitem['url'])        fout.write('<td>%s</td>' % dataitem['title'])        fout.write('<td>%s</td>' % dataitem['datetime'])        fout.write('<td>%s</td>' % dataitem['visitcount'])        fout.write('</tr>')    fout.write('</table>')    fout.write('</body>')    fout.write('</html>')    fout.close()

html_parser

import refrom bs4 import BeautifulSoupfrom urllib.parse import urljoindef get_new_urls(page_url, soup):    new_urls = set()    links = soup.find_all('a', href=re.compile(r"/\d+/\d+/\w+/page\.htm"))    for link in links:        new_url = link['href']        new_full_url = urljoin(page_url, new_url)        new_urls.add(new_full_url)    return new_urlsdef get_new_data(page_url, soup):    res_data = {}    title_node = soup.find('h2', class_='arti-title')    if title_node is None:        return res_data    res_data['title'] = title_node.get_text()    datetime_node = soup.find('span', class_='arti-update')    res_data['datetime'] = datetime_node.get_text()    visitcount_node = soup.find('span', class_='WP_VisitCount')    res_data['visitcount'] = visitcount_node.get_text()    res_data['url'] = page_url    return res_datadef parse(page_url, html_cont):    if page_url is None or html_cont is None:        return    soup = BeautifulSoup(html_cont, 'html.parser', from_encoding='utf-8')    new_urls = get_new_urls(page_url, soup)    new_data = get_new_data(page_url, soup)    return new_urls, new_data

spider_main

import urls_manager, html_downloader, \    html_parser, html_outputerdef craw(root_url):    count = 1    urls_manager.add_new_url(root_url)    #启动爬虫循环    while urls_manager.has_new_url():        new_url = urls_manager.get_new_url()        print('craw %d : %s' % (count, new_url))        html_cont = html_downloader.download(new_url)        new_urls, new_data = html_parser.parse(new_url, html_cont)        urls_manager.add_new_urls(new_urls)        if new_data:            html_outputer.collect_data(new_data)        if count == 10:            break        count = count + 1    html_outputer.output_html()if __name__ == '__main__':    root_url = 'Http://news.zzuli.edu.cn/'    craw(root_url)import urls_manager, html_downloader, \    html_parser, html_outputerdef craw(root_url):    count = 1    urls_manager.add_new_url(root_url)    #启动爬虫循环    while urls_manager.has_new_url():        new_url = urls_manager.get_new_url()        print('craw %d : %s' % (count, new_url))        html_cont = html_downloader.download(new_url)        new_urls, new_data = html_parser.parse(new_url, html_cont)        urls_manager.add_new_urls(new_urls)        if new_data:            html_outputer.collect_data(new_data)        if count == 10:            break        count = count + 1    html_outputer.output_html()if __name__ == '__main__':    root_url = 'http://news.zzuli.edu.cn/'    craw(root_url)

test_64

from bs4 import BeautifulSoupimport rehtml_doc = """<html><head><title>The DORMouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</p>"""soup = BeautifulSoup(html_doc, 'html.parser')print('获取所有链接')links = soup.find_all('a')for link in links:    print(link.name, link['href'], link.get_text())print('获取lacie链接')link_node = soup.find('a', href='http://example.com/lacie')print(link_node.name, link_node['href'], link_node.get_text())print('正则匹配')link_node = soup.find('a', href=re.compile(r'ill'))print(link_node.name, link_node['href'], link_node.get_text())print('获取P段落文字')p_node = soup.find('p', class_='title')print(p_node.name, p_node.get_text())

urls_manager

new_urls = set()old_urls = set()def add_new_url(url):    if url is None:        return    if url not in new_urls and url not in old_urls:        new_urls.add(url)def add_new_urls(urls):    if urls is None or len(urls) == 0:        return    for url in urls:        add_new_url(url)def get_new_url():    new_url = new_urls.pop()    old_urls.add(new_url)    return new_urldef has_new_url():    return len(new_urls) != 0

看完了这篇文章，相信你对“如何使用Python实现垂直爬虫系统”有了一定的了解，如果想了解更多相关知识，欢迎关注编程网Python频道，感谢各位的阅读！

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: 如何使用python实现垂直爬虫系统

本文链接: https://www.lsjlt.com/news/323940.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

如何使用python实现垂直爬虫系统

小编给大家分享一下如何使用python实现垂直爬虫系统，希望大家阅读完这篇文章之后都有所收获，下面让我们一起去探讨吧！html_downloaderfrom urllib import requestdef&n...

99+

2023-06-29
基于python实现垂直爬虫系统的方法详解

html_downloader from urllib import request def download(url): if url is None: r...

99+

2024-04-02
如何使用css实现垂直居中

这篇文章主要为大家展示了“如何使用css实现垂直居中”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“如何使用css实现垂直居中”这篇文章吧。假设现在给出这种场景：...

99+

2024-04-02
python中如何使用Scrapy实现定时爬虫

这篇文章将为大家详细讲解有关python中如何使用Scrapy实现定时爬虫，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。python的数据类型有哪些python的数据类型：1. 数字类型，包括int（整型...

99+

2023-06-14
如何使用纯CSS实现垂直居中

这篇文章主要为大家展示了“如何使用纯CSS实现垂直居中”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“如何使用纯CSS实现垂直居中”这篇文章吧。 ...

99+

2024-04-02
C#如何使用selenium实现爬虫

本文小编为大家详细介绍“C#如何使用selenium实现爬虫”，内容详细，步骤清晰，细节处理妥当，希望这篇“C#如何使用selenium实现爬虫”文章能帮助大家解决疑惑，下面跟着小编的思路慢慢深入，一起来学习新知识吧。一、介绍：Seleni...

99+

2023-07-02
如何使用python爬虫实现最新12306抢票

这篇文章将为大家详细讲解有关如何使用python爬虫实现最新12306抢票，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。1.环境python 3.7谷歌浏览器chromedriver.exe(浏览器驱动程...

99+

2023-06-26
如何实现爬虫

这篇文章主要为大家展示了“如何实现爬虫”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“如何实现爬虫”这篇文章吧。第一步要确定爬取页面的链接由于我们通常爬取的内容不止一页，所以要注意看看翻页、关键字...

99+

2023-06-02
如何使用CSS实现水平垂直居中

小编给大家分享一下如何使用CSS实现水平垂直居中，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇文章后大有收获，下面让我们一起去了解一下吧！仅...

99+

2024-04-02
Python爬虫Requests库如何使用

本篇内容主要讲解“Python爬虫Requests库如何使用”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“Python爬虫Requests库如何使用”吧!1、安装 requests 库因为学习过...

99+

2023-07-06
Python如何实现短视频爬虫

这篇文章主要介绍了Python如何实现短视频爬虫，具有一定借鉴价值，感兴趣的朋友可以参考下，希望大家阅读完这篇文章之后大有收获，下面让小编带着大家一起了解一下。网站地址在代码里面，大家用心一下就能看到了。使用的软件python 3...

99+

2023-06-25
如何使用python网络爬虫基于selenium爬取斗鱼直播信息

这篇文章给大家分享的是有关如何使用python网络爬虫基于selenium爬取斗鱼直播信息的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。一、本文使用的第三方包和工具python 3.8 谷歌...

99+

2023-06-29
Python如何实现简易Web爬虫

这篇文章给大家分享的是有关Python如何实现简易Web爬虫的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。　　简介：　　网络爬虫（又被称为网页蜘蛛)，网络机器人，是一种按...

99+

2024-04-02
如何使用Python爬虫实现自动下载图片

小编给大家分享一下如何使用Python爬虫实现自动下载图片，希望大家阅读完这篇文章之后都有所收获，下面让我们一起去探讨吧！python的数据类型有哪些python的数据类型：1. 数字类型，包括int（整型）、long（长整型）和float...

99+

2023-06-14
Html如何实现垂直框架

这篇文章主要介绍了Html如何实现垂直框架，具有一定借鉴价值，感兴趣的朋友可以参考下，希望大家阅读完这篇文章之后大有收获，下面让小编带着大家一起了解一下。垂直框架不可与Body标签同时出现<frame...

99+

2024-04-02
vue如何实现垂直居中

目录vue垂直居中vue div让文字内容水平垂直居中思想代码vue垂直居中有些时候没有使用form表单，文字和输入框/按钮会出现不对齐的情况，此时需要将文字垂直居中。情况如下：...

99+

2024-04-02
如何使用css实现垂直居中任何元素

这篇文章给大家分享的是有关如何使用css实现垂直居中任何元素的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。垂直居中任何元素 (vertical-center anything)在没有准备使用CSSGrid 布局的...

99+

2023-06-27
golang如何实现爬虫

作为当前互联网时代最热门的编程语言之一，Golang在爬虫领域也拥有着不俗的表现。那么，本文就来介绍一下，如何利用Golang语言来实现一个高效的爬虫程序。一、爬虫的概念爬虫，又称网络爬虫、网络蜘蛛、网络机器人等，是一种模拟人类在网络上对网...

99+

2023-05-15
css如何实现垂直居中

css实现垂直居中的方法：1、使用“line-height”属性使文字垂直居中，只需要在css中添加“line-height:300px;”样式来实现，其中300px表示宽度和行高相同。2、使用绝对定位“position: relative...

99+

2024-04-02
python 爬虫如何使用代理IP

作为一名数据采集者，我们都是知道，一个网站要是频繁访问都会被封IP，那要怎么解决这个问题呢？不可能一直频繁的更换设备，不光数据不能同步，这些设备的成本也是无法预计的，所以这个时候就需要代理IP了。以亿牛云(https://www.16yun...

99+

2023-06-02