首页 > 资讯 > 后端开发 > Python >Python3 Urllib库的基本使用

708

分享到

Python3 Urllib库的基本使用

Urllib 2023-01-31 01:01:07 708人浏览安东尼

Python 官方文档：入门教程 => 点击学习

摘要

一、什么是Urllib 　　Urllib库是python自带的一个Http请求库，包含以下几个模块： urllib.request　　　　请求模块 urllib.error　　　　异常处理模块 urllib.parse　　　　

一、什么是Urllib

　　Urllib库是python自带的一个Http请求库，包含以下几个模块：

urllib.request　　　　请求模块
urllib.error　　　　异常处理模块
urllib.parse　　　　 url解析模块
urllib.robotparser 　　robots.txt解析模块

　　其中前三个模块比较常用，第四个仅作了解。

二、Urllib方法介绍

　　将结合Urllib的官方文档进行说明。首先是urllib.request模块：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

　　示例代码1：


import urllib.request
response = urllib.request.urlopen('http://www.baidu.com')
print(response.read().decode('utf-8'))

　　这里用到了方法的第一个参数，即为URL地址，这种请求方式为GET请求，因为没有附加任何的参数。read()方法从返回中读取响应体的内容，读取完是二进制字节流，因此需要调用decode()方法通过utf8编码方式转换成我们所能读懂的网页代码。

　　示例代码2：


import urllib.parse
import urllib.request
d = bytes(urllib.parse.urlencode({'name':'zhangsan'}),encoding='utf8')
response = urllib.request.urlopen('http://httpbin.org/post',data=d)
print(response.read().decode('utf-8'))

res:


{
  "args": {},
  "data": "",
  "files": {},
  "fORM": {
    "name": "zhangsan"
  },
  "headers": {
    "Accept-Encoding": "identity",
    "Connection": "close",
    "Content-Length": "13",
    "Content-Type": "application/x-www-form-urlencoded",
    "Host": "httpbin.org",
    "User-Agent": "Python-urllib/3.7"
  },
  "JSON": null,
  "origin": "183.209.153.56",
  "url": "http://httpbin.org/post"
}

这里用到了第二个参数data，这次相当于一次post请求，该url是http测试网址。因为urlopen方法的data要求传入的参数形式是二进制，所以我们需要对字典进行二进制转码。

　　示例代码3：


# 设置请求的超时时间
import Socket
import urllib.request

try:
	response = urllib.request.urlopen('http://www.baidu.com',timeout=0.01)
except urllib.error.URLError as e:
	if isinstance(e.reason,socket.timeout):
		print('Time Out')

　　这里使用了timeout参数，设置了一个极短的时间以至于不会在时间内返回。所以程序会抛出异常。通过判断异常的类型去打印异常信息是常用的手段，因此，当异常为timeout时，将打印‘Time Out’。

　　示例代码4：


1 # response有用的方法或参数
2 import urllib.request
3 
4 response = urllib.request.urlopen('http://www.python.org')
5 print(response.status)
6 print(response.getHeaders()) # 元祖列表
7 print(response.getHeader('Server'))

　　status为状态码，getHeaders()返回响应头的信息。但是当我们想传递request headers的时候，urlopen就无法支持了，因此这里需要一个新的方法。

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

　　示例代码1：


 1 from urllib import request,parse
 2 
 3 url = 'http://httpbin.org/post'    
 4 headers = {
 5      'User-Agent':'Mozilla/5.0 (windows NT 6.1; Win64; x64) AppleWEBKit/537.36 (Khtml, like Gecko) Chrome/60.0.3100.0 Safari/537.36',   
 6      'Host':'httpbin.org'  
 7 }        
 8 dict = {
 9      'name':'zhangsan'  
10 }    
11 
12 data = bytes(parse.urlencode(dict),encoding='utf8')
13 req = request.Request(url=url,data=data,headers=headers,method=post)
14 response = request.urlopen(req)
15 print(response.read().decode('utf-8'))

　　用Request方法进行post请求并加入了请求头。

urllib.request.build_opener([handler, ...])

　　Handler是urllib中十分好用的一个工具，当我们进行IP代理访问或者爬虫过程保持对话（cookie）时，可以用相应的handler进行操作。以处理cookie的handler为例。

　　代码示例2：


1 import http.cookiejar,urllib.request
2 
3 cookie = http.cookiejar.CookieJar()
4 handler = urllib.request.HttpCookieProcessor(cookie)
5 opener = urllib.request.build_opener(handler)
6 response = opener.open('http://www.baidu.com')
7 
8 for item in cookie:
9     print(item.name,item.value)

　　通过CookieJar()来构造一个cookie对象，然后调用urllib.request.HttpCookieProcesser()创建一个关于cookie的handler对象，通过这个handler构造opener，然后就可以进行http请求了。返回的response包含cookie信息，这个handler就可以拿到该cookie信息并保存到cookie对象中。cookie的作用在于，如果爬虫过程中需要维持会话，那可以将cookie加入到Request中。

　　示例代码3：


1 import http.cookiejar,urllib.request
2 
3 filename = 'cookie.txt'
4 cookie = http.cookiejar.MozillaCookieJar(filename)
5 handler = urllib.request.HttpCookieProcessor(cookie)
6 opener = urllib.request.build_opener(handler)
7 response = opener.open('http://www.baidu.com')
8 cookie.save(ignore_discard=True,ignore_expires=True)

　　MozillaCookieJar是CookieJar的子类，可以将cookie写入本地文件。

　　示例代码4：


1 import http.cookiejar,urllib.request
2 
3 cookie = http.cookiejar.MozillaCookieJar()
4 cookie.load('cookie.txt',Ignore_discard=True,Ignore_expires=True)
5 handler = urllib.request.HttpCookieProcessor(cookie)
6 opener = urllib.request.build_opener(handler)
7 response = opener.open('http://www.baidu.com')
8 print(response.read().decode('utf-8'))

　　通过cookie对象的load()方法可以从本地文件读取cookie内容，然后可以在request中维持会话状态。

　　其次是urllib.error模块。

urllib.error

　　示例代码1：


 1 from urllib import request,error
 2 
 3 try:
 4     response = request.urlopen('http://bucunzai.com/index.html')
 5 except error.HTTPError as e:
 6     print(e.reason,e.code.e.header,sep='\n')
 7 except error.URLError as e:
 8     print(e.reason)
 9 else:
10     print('Request Successfully')

　　通过官方文档可以看出，httperror是URLerror的子类，所以需要先捕捉子类异常。实例证明HTTPError被捕获。文档中可以看出，HTTPError有三个参数，分别是reason，code和header。通过实例可以得到code为404。下面将说明一种常见的用法，显示异常时哪一类异常的方法。

　　示例代码2：


1 from urllib import request,error
2 import socket
3 
4 try:
5     response = request.urlopen('http://www.baidu.com',timeout=0.01)
6 except error.URLError as e:
7     if isinstance(e.reason,socket.timeout):
8         print('Time Out')

　　最后看一下urllib.parse中提供的常用方法。

urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)

　　示例代码1：


1 from urllib.parse import urlparse
2 
3 result = urlparse('http://www.baidu.com/index.html;user?id=5#comment',scheme='https')
4 print(result)
5 # ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

　　最后一行为输出结果。urlparse方法分析传入的url结构，并且拆分成相应的元组。scheme参数的作用是提供一个默认值，当url没有协议信息时，分析结果的scheme为默认值，如果有则默认值被覆盖。

　　示例代码2：


1 from urllib.parse import urlparse
2 
3 result = urlparse('http://www.baidu.com/index.html;user#comment',allow_fragments=False)
4 print(result)
5 # ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html',params='user#comment', query='', fragment='')

　　可以看到，当fragment参数被设置为false的时候，url中的fragment会被添加到前面有数据的那一项中。如果不清楚URL各部分的含义，可参考本篇备注。

urllib.parse.urlunparse(parts)

　　进行url各部分的拼接，参数形式是一个列表类型。

　　示例代码1：


1 from urllib.parse import urlunparse
2 
3 data = ['http','www.baidu.com','index.html','user','a=6','comment']
4 print(urlunparse(data))
5 
6 # http://www.baidu.com/index.html;user?a=6#comment

urllib.parse.urljoin(base, url, allow_fragments=True)

　　示例代码1：


1 from urllib.parse import urljoin
2 
3 print(urljoin('http://www.baidu.com','index.html'))
4 print(urljoin('http://www.baidu.com#comment','?username="zhangsan"'))
5 print(urljoin('http://www.baidu.com','www.sohu.com'))
6 
7 # http://www.baidu.com/index.html
8 # http://www.baidu.com?username="zhangsan"
9 # http://www.baidu.com/www.sohu.com

　　这种拼接需要注意其规则，如果第二个参数是第一个参数中没有的url组成部分，那将进行添加，否则进行覆盖。第二个print则是一种需要避免的现象，这种join方式会覆盖掉低级别的参数。这里的第三个print是一个反例，很多人认为解析是从域名开始的，实际上是从‘//’开始解析的，官方文档给出了很明确的解释：If url is an absolute URL (that is, starting with // or scheme://), the url‘s host name and/or scheme will be present in the result。所以再次建议，官方文档是最好的学习工具。

urllib.parse.urlencode()

　　urlencode()方法将字典转换成url的query参数形式的字符串 。

　　示例代码1：


 1 from urllib.parse import urlencode
 2 
 3 params = {
 4   'name':'zhangsan',
 5   'age':22    
 6 }
 7 
 8 base_url = 'http://www.baidu.com?'
 9 url = base_url + urlencode(params)
10 print(url)
11 
12 # 'http://www.baidu.com?name=zhangsan&age=22'

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: Python3 Urllib库的基本使用

本文链接: https://www.lsjlt.com/news/184475.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

Python3 Urllib库的基本使用

一、什么是Urllib 　　Urllib库是Python自带的一个http请求库，包含以下几个模块： urllib.request　　　　请求模块 urllib.error　　　　异常处理模块 urllib.parse　　　　...

99+

2023-01-31

Urllib
Python爬虫之Urllib库的基本使

# get请求 import urllib.request response = urllib.request.urlopen("http://www.baidu.com") print(response.read().decode('...

99+

2023-01-30

爬虫 Python Urllib
Python3 使用urllib 发送a

在爬取某网站的时候，碰到的一个问题，在进行一个post请求时，postman 里面可以正常请求到数据，但是一模一样放到python里面就不行了，后面通过抓包发现了问题。直接贴代码： raw = {‘number’: ‘12...

99+

2023-01-31

urllib
python3使用urllib抓取用户

#python34 # -*- coding: utf-8 -*- import http.cookiejar import urllib.error import urllib.parse import urllib.request ...

99+

2023-01-31

用户 urllib
python urllib库的使用详解

目录1、请求模块：urllib.request data参数：post请求urlopen()中的参数timeout：设置请求超时时间：响应类型：响应的状态码、响应头：使用代理：url...

99+

2024-04-02
python爬虫之请求模块urllib的基本使用

目录前言urllib的子模块HttpResponse常用方法与属性获取信息urlli.parse的使用(一般用于处理带中文的url)✅爬取baidu官网HTML源代码✅添加请求头信息...

99+

2024-04-02
Python urllib库的使用指南详解

目录urlopenRequestUser-Agent添加更多的Header信息添加一个特定的header随机添加/修改User-Agent所谓网页抓取，就是把URL地址中指定的网络资...

99+

2024-04-02
urllib库如何在python中使用

今天就跟大家聊聊有关urllib库如何在python中使用，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。1、请求模块：urllib.requestpython2import urll...

99+

2023-06-14
【urllib的使用（上）】

文章目录一、urllib的基本用法二、urllib类型和方法类型方法三、urllib下载下载网页下载图片下载视频四、请求对象的定制五、编解码1.get请求方式urllib.par...

99+

2023-09-15

python 前端爬虫
Python爬虫库urllib的使用教程详解

目录Python urllib库urllib.request模块urlopen函数Request 类urllib.error模块URLError 示例HTTPError示例...

99+

2022-11-21

Python爬虫库urllib使用 Python urllib使用 Python urllib
Python argparse库的基本使用步骤

目录基本用法1、int支持2、str支持3、float支持4、bool支持5、数组支持完整调用argparse库是python下的一个命令行参数管理库，支持int、str、float...

99+

2024-04-02
Python爬虫进阶之如何使用urllib库

这篇文章主要介绍了Python爬虫进阶之如何使用urllib库，具有一定借鉴价值，感兴趣的朋友可以参考下，希望大家阅读完这篇文章之后大有收获，下面让小编带着大家一起了解一下。python的数据类型有哪些python的数据类型：1. 数字类型...

99+

2023-06-14
Flutter网络请求库DIO的基本使用

目录1. 导入dio包 2. 导入并创建实例 3.基本配置 4.使用示例 1. 导入dio包目前dio库的最新版本是3.0.1,同使用其他三方库一样，Flutter中使用dio库...

99+

2024-04-02
Navicat （连接mysql数据库）的基本使用

目录前言一、navicat-15的下载安装二、navicat连接数据库 1、登录 2、连接数据库失败情况 3、登录数据库需要授权三、navicat的基础操作 ☆ 1、数据库的基本操作 2、对表进行操作 3、sql 语句管理数据库 ...

99+

2023-09-02

mysql 数据库
【Android -- 开源库】表格 SmartTable 的基本使用

介绍 1. 功能快速配置自动生成表格；自动计算表格宽高；表格列标题组合；表格固定左序列、顶部序列、第一行、列标题、统计行；自动统计，排序（自定义统计规则）；表格图文、序列号、列标题格式化；表格各...

99+

2023-09-10

android 表格
python3的os基本操作有哪些

本篇内容主要讲解“python3的os基本操作有哪些”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“python3的os基本操作有哪些”吧!　　import os　　# 获取当前的工作目录　　pr...

99+

2023-06-02
python爬虫urllib库中parse模块urlparse的使用方法

这篇文章主要介绍了python爬虫urllib库中parse模块urlparse的使用方法，具有一定借鉴价值，感兴趣的朋友可以参考下，希望大家阅读完这篇文章之后大有收获，下面让小编带着大家一起了解一下。在python爬虫urllib库中，u...

99+

2023-06-14
EasyExcel 的基本使用

EasyExcel EasyExcel 是一个基于 Java 的简单、省内存的读写 Excel 的开源项目。在尽可能节约内存的情况下支持读写百 M 的 Excel。官网：https://easye...

99+

2023-09-12

java excel spring boot
nacos的基本使用

1、nacos的安装 1、首先要使用nacos那肯定得先下载nacos nacos的GitHub下载地址 2、解压并且放到自己想放的目录，打开bin目录下的startup.cmd(windows下)...

99+

2023-10-25

spring cloud java spring boot
详解react-navigation6.x路由库的基本使用

目录react-native项目初始化安装react-native项目react-navigation路由库安装使用路由库路由跳转与路由传参设置路由标题自定义标题组件标题按钮reac...

99+

2024-04-02