首页 > 资讯 > 后端开发 > Python >VII Python（7）爬虫

609

分享到

VII Python（7）爬虫

爬虫 VII Python 2023-01-31 06:01:34 609人浏览薄情痞子

Python 官方文档：入门教程 => 点击学习

摘要

VII python（7）爬虫网络爬虫（网页蜘蛛）：Python访问互联网：urllib和urllib2模块（python2.*分urllib和urllib2；python3..4.1中把urllib和urllib2合并统一为一个包pac

VII python（7）爬虫

网络爬虫（网页蜘蛛）：

Python访问互联网：

urllib和urllib2模块（python2.*分urllib和urllib2；python3..4.1中把urllib和urllib2合并统一为一个包package，注意版本3是包不是模块）；

JSON模块（json轻量级的数据交换格式，此处对其应用是用字符串形式将python的数据结构封装起来）；

URL的一般格式：

protocol://hostname[:port]/path/to/file

protocal有：Http、https、ftp、file、ed2k

In [1]: import urllib

In [2]: dir(urllib)

……

'urlopen',

'urlretrieve']

In [6]: help(urllib.urlopen)

urlopen(url, data=None, proxies=None)

Create a file-like object for the specified URL to read from.

In [18]: help(urllib.urlretrieve)

urlretrieve(url, filename=None,reporthook=None, data=None)

In [19]: help(urllib.urlencode)

urlencode(query, doseq=0)

Encode a sequence of two-element tuples or dictionary into a URL querystring.

In [1]: import urllib2

In [2]: help(urllib2.urlopen)

urlopen(url, data=None, timeout=<objectobject>)

In [8]: help(urllib2.Request)

__init__(self, url, data=None, headers={},origin_req_host=None, unverifiable=False)

add_header(self, key, val)

In [19]: help(urllib2.ProxyHandler)

__init__(self, proxies=None)

proxy_open(self, req, proxy, type)

In [23]: import json

In [24]: json.<TAB>

json.JSONDecoder json.decoder json.dumps json.load json.scanner

json.JSONEncoder json.dump json.encoder json.loads

In [24]: help(json.loads)

loads(s, encoding=None, cls=None,object_hook=None, parse_float=None, parse_int=None, parse_constant=None,object_pairs_hook=None, **kw)

Deserialize ``s`` (a ``str`` or ``unicode`` instance containing a JSON

document) to a Python object.

In [10]: import time

In [11]: time.<TAB>

time.accept2dyear time.clock time.gmtime time.sleep time.struct_time time.tzname

time.altzone time.ctime time.localtime time.strftime time.time time.tzset

time.asctime time.daylight time.mktime time.strptime time.timezone

In [11]: help(time.sleep)

sleep(...)

sleep(seconds)

举例1：

In [13]: response=urllib.urlopen('http://www.FishC.com')

In [14]: html=response.read()

In [15]: print html #（若此处打印的内容（即是网页中审查元素看到的代码）不规整，则要根据网站编码进行转码，html=html.decode('utf-8')）

<!DOCTYPE html PUBLIC "-//W3C//DTDXHTML 1.0 Strict//EN"

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

<!--

-->

<htmlxmlns="http://www.w3.org/1999/xhtml">

<head>

<metahttp-equiv="content-type" content="text/html; charset=utf-8" />

……

In [19]: response.<TAB> #（对于打开的网页，可施加的方法或属性，geturl()得到访问的地址，info()返回的是文件对象（内容是请求的网页的代码），getcode()返回的是http的状态码）

response.close response.fp response.headers response.read response.url

response.code response.getcode response.info response.readline

response.fileno response.geturl response.next response.readlines

In [19]: response.geturl()

Out[19]: 'http://www.FishC.com'

In [20]: response.info()

Out[20]: <httplib.HTTPMessage instanceat 0x16a7b48>

In [21]: print response.info

In [22]:response.getcode()

Out[22]: 200

举例2（保存网站placekitten.com中的图片）：

[root@localhost ~]# vim download_cat.py

-----------------------script start-----------------------

#!/usr/bin/python2.7

#filename:download_cat.py

import urllib

response=urllib.urlopen('http://placekitten.com/g/500/600')

cat_img=response.read()

with open('cat_500_600.jpg','wb') as f:

f.write(cat_img)

----------------------script end--------------------------

[root@localhost ~]# chmod 755download_cat.py

[root@localhost ~]# python2.7 download_cat.py

[root@localhost ~]# ll cat_500_600.jpg

-rw-r--r--. 1 root root 26590 Jun 19 22:10 cat_500_600.jpg

举例3（模拟在线浏览器翻译）：

网页中右键审查元素-->Network-->找到如下信息，在Headers中的内容是我们需要的

Headers中，General段中的RequestURL（用此处的地址才可翻译），Request Headers段中的User-Agent（服务器用来判断是否非人类访问，不过此处信息可自定义），From Data（POST提交的主要内容）

注：GET（从server请求获得数据）；POST（向指定server提交被处理的数据）

[root@localhost ~]# vim translation.py

---------------------------script start------------------------

#!/usr/bin/python2.7

#filename:translation.py

import urllib

import json

content=raw_input('please input translatecontent: ')

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=uGC&sessionFrom=dict2.index'

data={}

data['type']='AUTO'

data['i']=content

data['doctype']='json'

data['xmlVersion']='1.8'

data['keyfrom']='fanyi.WEB'

data['ue']='UTF-8'

data['action']='FY_BY_CLICKBUTTON'

data['typoResult']='true'

data=urllib.urlencode(data)

response=urllib.urlopen(url,data)

html=response.read()

target=json.loads(html)

print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

-----------------------------script end---------------------------

[root@localhost ~]# python2.7 translation.py

please input translate content: girl

Translate the result: 女孩

注：

此脚本优化：

可将代码放在while循环中，当输入quit或q时退出；

此脚本不能运行在生产环境中，因为server会根据User-Agent判断是人工访问还是机器代码访问，若机器代码访问多了会被server屏蔽，解决方法：隐藏修改User-Agent，（1）先事先定义好head={'User-Agend':'……'}再传递给urllib2.Request(url,data,head)；（2）在请求urllib2.Request(url,data)之后通过urllib2.Request.add_header()添加；

修改User-Agent方法虽可行，但server还会根据IP访问的次数，在超过预值（阈值）会认为是网络爬虫，server会要求其填验证码之类的，若是用户可识别验证码，但以上脚本仍无法应付会被屏蔽，解决方法：（1）通过time模块延迟提交时间time.sleep(3)，让脚本代码（爬虫）看上去是人类在正常访问；（2）使用代理IP（推荐使用此方法）

注：

使用代理IP三步骤：

1）proxy_support=urllib2.ProxyHandler({'http':'112.111.53.173:8888'})，注意此方法扩号中要是一个字典，格式：urllib2.ProxyHandler('类型':'代理ip:port'）；

2）定制、创建一个opener（可理解为私人定制），opener=urllib2.build_opener(proxy_support)；

3）安装opener，urllib2.install_opener(opener)，opener.open(url)；

举例4（优化例3，修改User-Agent，使用方法1）：

[root@localhost ~]# vim translation.py

----------------------script start--------------------

#!/usr/bin/python2.7

#filename:translation.py

import urllib

import urllib2

import json

while True:

content=raw_input('please input translate content: ')

if content=='q':

break

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

head={}

head['User-Agend']='Mozilla/5.0(windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36'

data={}

data['type']='AUTO'

data['i']=content

data['doctype']='json'

data['xmlVersion']='1.8'

data['keyfrom']='fanyi.web'

data['ue']='UTF-8'

data['action']='FY_BY_CLICKBUTTON'

data['typoResult']='true'

data=urllib.urlencode(data)

req=urllib2.Request(url,data,head)

response=urllib2.urlopen(req)

html=response.read()

target=json.loads(html)

print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

------------------------------script end----------------------

[root@localhost ~]# python2.7 translation.py

please input translate content: ladies

Translate the result: 女士们

please input translate content: gentleman

Translate the result: 绅士

please input translate content: q

举例5（优化例3，修改User-Agent，使用方法2）：

[root@localhost ~]# vim translation.py

------------------------script start---------------------

#!/usr/bin/python2.7

#filename:translation.py

import urllib

import urllib2

import json

while True:

content=raw_input('please input translate content: ')

if content=='q':

break

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

#head={}

#head['User-Agend']='Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36'

data={}

data['type']='AUTO'

data['i']=content

data['doctype']='json'

data['xmlVersion']='1.8'

data['keyfrom']='fanyi.web'

data['ue']='UTF-8'

data['action']='FY_BY_CLICKBUTTON'

data['typoResult']='true'

data=urllib.urlencode(data)

req=urllib2.Request(url,data)

req.add_header('User-Agent','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')

response=urllib2.urlopen(req)

html=response.read()

target=json.loads(html)

print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

----------------------------script end---------------------------

[root@localhost ~]# python2.7 translation.py

please input translate content: cat

Translate the result: 猫

please input translate content: dog

Translate the result: 狗

please input translate content: q

举例6（优化例3，使用代码频繁访问翻译server防止将我们的IP屏蔽，方法一延迟提交时间，这样在每翻译一个条目后间隔3s才允许翻译下个条目）：

[root@localhost ~]# vim translation.py

----------------script start----------------

#filename:translation.py

import urllib

import urllib2

import json

import time

while True:

……

time.sleep(3)

-----------------script end----------------

[root@localhost ~]# python2.7 translation.py

please input translate content: chinese

Translate the result: 中国

please input translate content: japanese

Translate the result: 日本

please input translate content: q#!/usr/bin/python2.7

举例7（使用代理访问网页）：

准备（通过http://www.whatismyip.com.tw/得到当前正在使用的IP，通过http://www.xicidaili.com/得到代理IP）

[root@localhost ~]# vim proxy_egg.py

---------------------script start--------------------

#!/usr/bin/python2.7

#filename:proxy_egg.py

import urllib2

import random

url='http://www.whatismyip.com.tw'

ip_list=['110.6.35.181:8888','122.193.55.64:81']

proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})

opener=urllib2.build_opener(proxy_support)

#opener.addheaders=[('User-Agend','Mozilla/5.0(Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)Chrome/44.0.2403.155 Safari/537.36')]

urllib2.install_opener(opener)

response=urllib2.urlopen(url)

html=response.read()

print html

-------------------------scirpt end------------------------

[root@localhost ~]# python2.7 proxy_egg.py

<html>

<head>

<title>我的IP位址查詢</title>

</head>

<body>

<scripttype="text/javascript">

var sc_project=6392240;

var sc_invisible=1;

var sc_security="65d86b9d";

var scJsHost = (("https:" ==document.location.protocol) ? "https://secure." :"http://www.");

document.write("<sc"+"ripttype='text/javascript' src='" + scJsHost +"statcounter.com/counter/counter.js'></"+"script>");

</script>

</body>

</html>

举例8（优化例3，使用脚本代码频繁访问翻译server，防止server将我们的IP屏蔽，方法二使用代理IP）：

注：使用免费代理IP极不稳定，应尽可能在ip_list中多加一些代理IP

[root@localhost ~]# vim translation.py

-----------------------script start-------------------

#!/usr/bin/python2.7

#filename:translation.py

import urllib

import urllib2

import json

import random

while True:

content=raw_input('please input translate content: ')

if content=='q':

break

url='http://fanyi.youdao.com/translate?smartresult=dict&smartresult=rule&smartresult=ugc&sessionFrom=dict2.index'

ip_list=['123.185.109.86:8888','124.235.47.141:8888']

proxy_support=urllib2.ProxyHandler({'http':random.choice(ip_list)})

opener=urllib2.build_opener(proxy_support)

opener.addheaders=[('User-Agend','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.155 Safari/537.36')]

urllib2.install_opener(opener)

data={}

data['type']='AUTO'

data['i']=content

data['doctype']='json'

data['xmlVersion']='1.8'

data['keyfrom']='fanyi.web'

data['ue']='UTF-8'

data['action']='FY_BY_CLICKBUTTON'

data['typoResult']='true'

data=urllib.urlencode(data)

req=urllib2.Request(url,data)

response=urllib2.urlopen(req)

html=response.read()

target=json.loads(html)

print 'Translate the result: %s' %(target['translateResult'][0][0]['tgt'])

----------------scipt end----------------

[root@localhost ~]# python2.7 translation.py

please input translate content: boy

Translate the result: 男孩

please input translate content: girl

Translate the result: 女孩

please input translate content: man

Traceback (most recent call last):

File "translation.py", line 32, in <module>

response=urllib2.urlopen(req)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line127, in urlopen

return _opener.open(url, data, timeout)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line404, in open

response = self._open(req, data)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line422, in _open

'_open', req)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line382, in _call_chain

result = func(*args)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1214, in http_open

return self.do_open(httplib.HTTPConnection, req)

File "/usr/local/python2.7/lib/python2.7/urllib2.py", line1184, in do_open

raise URLError(err)

urllib2.URLError:<urlopen error [Errno 111] Connection refused>

举例（下载指定网页中的图片，默认下载至当前目录，使用urllib.urlretrieve()将文件保存至本地）：

此脚本缺陷：仅下载指定页面的图片，不能更新到该网站最新的图片进行下载

[root@localhost ~]# vim download_pic.py

------------------script start-------------------

#!/usr/bin/python2.7

#filename:download_pic.py

import urllib

import urllib2

import re

url='http://jandan.net/ooxx'

def getHtml(url):

req=urllib2.Request(url)

req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36')

response=urllib2.urlopen(req)

html=response.read()

return html

def getImg(html):

imglist=re.findall(r'src="(.*?\.jpg)"',html)

#print imglist

x=1

for imgurl in imglist:

urllib.urlretrieve(imgurl,'%s.jpg' % x)

x+=1

html=getHtml(url)

#print html

getImg(html)

--------------------script end------------------

[root@localhost ~]# python2.7 download_pic.py

[root@localhost ~]# ll

total 31664

-rw-r--r--. 1 root root 174584 Jun 21 23:18 10.jpg

-rw-r--r--. 1 root root 153359 Jun 21 23:18 11.jpg

-rw-r--r--. 1 root root 125877 Jun 21 23:18 12.jpg

-rw-r--r--. 1 root root 152194 Jun 21 23:18 13.jpg

-rw-r--r--. 1 root root 91847 Jun 21 23:18 14.jpg

-rw-r--r--. 1 root root 78389 Jun 21 23:18 15.jpg

-rw-r--r--. 1 root root 68577 Jun 21 23:18 16.jpg

-rw-r--r--. 1 root root 99573 Jun 21 23:18 17.jpg

-rw-r--r--. 1 root root 32444 Jun 21 23:18 18.jpg

-rw-r--r--. 1 root root 79730 Jun 21 23:18 19.jpg

-rw-r--r--. 1 root root 144334 Jun 21 23:18 1.jpg

……

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: VII Python（7）爬虫

本文链接: https://www.lsjlt.com/news/190504.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

VII Python（7）爬虫

VII Python（7）爬虫网络爬虫（网页蜘蛛）：python访问互联网：urllib和urllib2模块（python2.*分urllib和urllib2；python3..4.1中把urllib和urllib2合并统一为一个包pac...

99+

2023-01-31

爬虫 VII Python
Python爬虫入门教程 7-100 蜂

蜂鸟网图片--简介今天玩点新鲜的，使用一个新库 aiohttp ，利用它提高咱爬虫的爬取速度。安装模块常规套路 pip install aiohttp 运行之后等待，安装完毕，想要深造，那么官方文档必备：https://aioht...

99+

2023-01-30

爬虫入门教程 Python
Python3网络爬虫实战-7、APP爬

MitmProxy 是一个支持 HTTP 和 HTTPS 的抓包程序，类似 Fiddler、Charles 的功能，只不过它是一个控制台的形式操作。同时 MitmProxy 还有两个关联组件，一个是 MitmDump，它是 MitmProx...

99+

2023-01-31

爬虫实战网络
python爬虫

#!/usr/bin/python import re #导入正则模块 import urllib #导入url模块 def getHtml(url): #定义获取网页函数 page = urllib.urlopen(url...

99+

2023-01-31

爬虫 python
Python 爬虫

--安装爬虫需要的库C:\python37>pip install requestsCollecting requests Downloading https://files.pythonhosted.org/packag...

99+

2023-06-02
python—爬虫

1.1 介绍通过过滤和分析HTML代码，实现对文件、图片等资源的获取，一般用到：urllib和urllib2模块正则表达式（re模块）requests模块Scrapy框架urllib库：1）获取web页面2）在远程http服务器上验证3）表...

99+

2023-01-31

爬虫 python
Python爬虫教程-01-爬虫介绍

Python 爬虫的知识量不是特别大，但是需要不停和网页打交道，每个网页情况都有所差异，所以对应变能力有些要求参考资料精通Python爬虫框架Scrapy，人民邮电出版社 url, http web前端，html，css，...

99+

2023-01-30

爬虫教程 Python
python爬虫（六）

Scrapy(一) scrapy是一个网络爬虫的通用框架，在许多应用当中可以用于数据提取，信息处理等。如何安装scrapy呢？如果你安装了Anaconda，则可以使用：conda install scrapy进行安装，如果没有，但电脑...

99+

2023-01-30

爬虫 python
python爬虫（二）

HTTP和HTTPS HTTP，全称超文本传送协议，是属于计算机网络中应用层的协议，而HTTPS是HTTP加上SSL，HTTP是明文传输，速度快，但安全系数很低，而HTTPS比HTTP安全很多，但缺点是传输速度比较慢。一．HTTP之请求...

99+

2023-01-30

爬虫 python
python爬虫（四）

Json，lxml模块一.JSON模块 Json是一种网络中常用的数据交换类型，一个文件要想在网络进行传输，需要将文件转换为一种便于在网络之间传输的类型，便于人们进行阅读，json就是这样应运而生的。Json中的数据是由键值对构成的，与...

99+

2023-01-30

爬虫 python
Python 爬虫—scrapy

scrapy用于从网站中提取所需数据的开源协作框架。以一种快速、简单但可扩展的方式。该爬虫框架适合于那种静态页面， js 加载的话，如果你无法模拟它的 API 请求，可能就需要使用 seleni...

99+

2023-09-06

python 爬虫 scrapy
Python爬虫-04：贴吧爬虫以及GE

目录 1. URL的组成 2. 贴吧爬虫 2.1. 只爬贴吧第一页 2.2. 爬取所有贴吧的页面 ...

99+

2023-01-30

爬虫贴吧 Python
爬虫笔记1：Python爬虫常用库

请求库：1、urllib：urllib库是Python3自带的库（Python2有urllib和urllib2，到了Python3统一为urllib），这个库是爬虫里最简单的库。2、requests：requests属于第三方库，使用起来...

99+

2023-01-31

爬虫常用笔记
Python爬虫入门：爬虫基础了解

Python爬虫入门（1）：综述 Python爬虫入门（2）：爬虫基础了解 Python爬虫入门（3）：Urllib库的基本使用 Python爬虫入门（4）：Urllib库的高级用法 Python爬虫入门（5）：URLError异常...

99+

2023-01-30

爬虫入门基础
55. Python 爬虫（4）

webdriverSelenium是ThroughtWorks公司开发的一套Web自动化测试工具。它分为三个组件：Selenium IDE Selenium RC (Remote Control) Selenium WebdriverSel...

99+

2023-01-31

爬虫 Python
python爬虫基础

Note：一：简单爬虫的基本步骤1.爬虫的前奏： (1)明确目的 (2)找到数据对应的网页 (3)分析网页的结构，找到数据的位置2.爬虫第二步：__fetch_content方法模拟HTTP请求，向服务器发送这个...

99+

2023-01-30

爬虫基础 python
python图片爬虫

#!/usr/bin/env python# -*- coding:utf-8 -*-import argparseimport osimport reimport sysimport urllibi...

99+

2023-08-31

python 爬虫开发语言
python - 爬虫简介

什么是爬虫？模拟浏览器对网站服务器发送请求解析服务器返回的响应数据，并保存数据爬虫能获取哪些数据？原则上所有可以通过浏览器获取的数据都可以爬取爬虫也只能获取爬取浏览器可以正常获取的数据爬虫的应用场景？数据分析 (如电影票房、股票信...

99+

2023-09-10

爬虫
Python爬虫教程-34-分布式爬虫介

Python爬虫教程-34-分布式爬虫介绍分布式爬虫在实际应用中还算是多的，本篇简单介绍一下分布式爬虫什么是分布式爬虫分布式爬虫就是多台计算机上都安装爬虫程序，重点是联合采集。单机爬虫就是只在一台计算机上的爬虫。其实搜索引擎...

99+

2023-01-30

爬虫分布式教程
Python爬虫-01：爬虫的概念及分类

目录 # 1. 为什么要爬虫 2. 什么是爬虫？ 3. 爬虫如何抓取网页数据？ # 4. Python爬虫的优势？ ...

99+

2023-01-30

爬虫概念 Python