怎么使用Python中的正则表达式处理html文件

html Python 2023-05-17 21:05:04 547人浏览八月长安

Python 官方文档：入门教程 => 点击学习

摘要

使用Python中的正则表达式处理html文件finditer方法是一种全匹配方法。已经使用过findall方法的话，该方法将返回由多个匹配字符串组成的列表。对于多个匹配项，finditer会按顺序返回一个迭代器，每个迭代生成一个匹配对象。

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。已经使用过findall方法的话，该方法将返回由多个匹配字符串组成的列表。对于多个匹配项，finditer会按顺序返回一个迭代器，每个迭代生成一个匹配对象。这些匹配对象可通过for循环访问，在下面的代码中，因此组1可以被打印。

您需要撰写 python 正则表达式，以便在 html 文本文件中识别特定的模式。将代码添加到STARTER脚本为这些模式编译RE（将它们分配给有意义的变量名称），并将这些RE应用于文件的每一行，打印出找到的匹配项。

1.编写识别HTML标签的模式，然后将其打印为“TAG:TAG string”（例如“TAG:b”代表标签）。为了简单起见，假设左括号和右括号每个标记的（<，>）将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点，找出为什么这不是一个好的解决方案。编写一个更好的解决方案，解决这个问题

2.修改代码，使其区分开头和结尾标记（例如p与/p)打印OPENTAG和CLOSETAG

import sys, re

#------------------------------

testRE = re.compile('(logic|sicstus)', re.I)
testI = re.compile('<[A-Za-z]>', re.I)
testO = re.compile('<[^/](\S*?)[^>]*>')
testC = re.compile('</(\S*?)[^>]*>')

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        m = testRE.search(line)
        if m:
            print('** TEST-RE:', m.group(1))

        mm = testRE.finditer(line)
        for m in mm:
            print('** TEST-RE:', m.group(1))
        
        index= testI.finditer(line)
        for i in index:
           print('Tag:',i.group().replace('<', '').replace('>', ''))
           
        open1= testO.finditer(line)
        for m in open1:
           print('opening:',m.group().replace('<', '').replace('>', ''))
           
        close1= testC.finditer(line)
        for n in close1:
           print('closing:',n.group().replace('<', '').replace('>', ''))

请注意，有些HTML标签有参数，例如：

<table border=1 cellspacing=0 cellpadding=8>

成功查找到并打印标记标签，确保启用带参数和不带参数的标记模式。现在扩展您的代码，以便打印两个打开的标签标签和参数，例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

 		open1= testO.finditer(line)
        for m in open1:
            #print('opening:',m.group().replace('<', '').replace('>', ''))
            firstm= m.group().replace('<', '').replace('>', '').split()
            num = 0
            for otherm in firstm:
                if num == 0:
                    print('opening:',otherm)
                else:
                    print('pram:',otherm)
                num+= 1

在正则表达式中，可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N（其中N为正整数），并返回到第N个匹配的文本正则表达式组。例如，正则表达式，如：r" (\w+) \1 仅当与组（\w+）完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如，“the”出现两次。使用反向引用编写一个模式，当一行包含成对的open和关闭标签，例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本，即一个HTML文件，并返回一个纯文本文件，所有HTML标记都已从中删除出来这里我们不打算这样做，而是考虑一个更简单的例子，即删除我们在输入数据文件的任何行中找到的HTML标记。

如果您已经定义了一条RE来识别HTML标签，您应该可以将生成的文本输出为STRIPPED，并将其打印在屏幕上。。

import sys, re

#------------------------------
# PART 1: 

   # Key thing is to avoid matching strings that include
   # multiple tags, e.g. treating '<p><b>' as a single
   # tag. Can do this in several ways. Firstly, use
   # non-greedy matching, so get shortest possible match
   # including the two angle brackets:

tag = re.compile('</?(.*?)>') 

   # The above treats the '/' of a close tag as a separate
   # optional component - so that this doesn't turn up as
   # part of the match '.group(1)', which is meant to return
   # the tag label. 
   # Following alternative solution uses a negated character
   # class to explicitly prevent this including '>': 

tag = re.compile('</?([^>]+)>') 

   # Finally, following version separates finding the tag
   # label string from any (optional) parameters that might
   # also appear before the close angle bracket:

tag = re.compile(r'</?(\w+\b)([^>]+)?>') 

   # Note that use of '\b' (as Word boundary anchor) here means
   # we must mark the regex string as a 'raw' string (r'..'). 

#------------------------------
# PART 2: 

   # Following closeTag definition requires first first char
   # after the open angle bracket to be '/', while openTag
   # definition excludes this by requiring first char to be
   # a 'word char' (\w):

openTag  = re.compile(r'<(\w[^>]*)>')
closeTag = re.compile(r'</([^>]*)>')

   # Following revised definitions are more carefully stated
   # for correct extraction of tag label (separately from
   # any parameters:

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')
closeTag = re.compile(r'</(\w+\b)\s*>')

#------------------------------
# PART 3: 

   # Above openTag definition will already get the string
   # encompassing any parameters, and return it as
   # m.group(2), i.e. defn: 

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')

   # If assume that parameters are continuous non-whitespace
   # chars separated by whitespace chars, then we can divide
   # them up using split - and that's how we handle them
   # here. (In reality, parameter strings can be a lot more
   # messy than this, but we won't try to deal with that.)

#------------------------------
# PART 4: 

openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>')

   # Note use of non-greedy matching for the text falling
   # *between* the open/close tag pair - to avoid false
   # results where have two similar tag pairs on same line.

#------------------------------
# PART 5: URLS

   # This is quite tricky. The URL expressions in the file
   # are of two kinds, of which the first is a string
   # between double quotes ("..") which may include
   # whitespace. For this case we might have a regex: 

url = re.compile('href=("[^">]+")', re.I)

   # The second case does not have quotes, and does not
   # allow whitespace, consisting of a continuous sequence
   # of non-whitespace material (that ends when you reach a
   # space or close bracket '>'). This might be: 

url = re.compile('href=([^">\s]+)', re.I)

   # We can combine these two cases as follows, and still
   # get the expression back as group(1):

url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I)

   # Note that I've done nothing here to exclude 'mailto:'
   # links as being accepted as URLS. 

#------------------------------

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        # PART 1: find HTML tags
        # (The following uses 'finditer' to find ALL matches
        # within the line)
    
        mm = tag.finditer(line)
        for m in mm:
            print('** TAG:', m.group(1), ' + [%s]' % m.group(2))
    
        # PART 2,3: find open/close tags (+ params of open tags)
    
        mm = openTag.finditer(line)
        for m in mm:
            print('** OPENTAG:', m.group(1))
            if m.group(2):
                for param in m.group(2).split():
                    print('    PARAM:', param)
    
        mm = closeTag.finditer(line)
        for m in mm:
            print('** CLOSETAG:', m.group(1))
    
        # PART 4: find open/close tag pairs appearing on same line
    
        mm = openCloseTagPair.finditer(line)
        for m in mm:
            print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
    
        # PART 5: find URLs:
    
        mm = url.finditer(line)
        for m in mm:
            print('** URL:', m.group(1))

        # PART 6: Strip out HTML tags (note that .sub will do all
        # possible substitutions, unless number is limited by count
        # keyword arg - which is fortunately what we want here)

        stripped = tag.sub('', line)
        print('** STRIPPED:', stripped, end = '')

以上就是怎么使用Python中的正则表达式处理html文件的详细内容，更多请关注编程网其它相关文章！

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: 怎么使用Python中的正则表达式处理html文件

本文链接: https://www.lsjlt.com/news/211548.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

怎么使用Python中的正则表达式处理html文件

使用Python中的正则表达式处理html文件finditer方法是一种全匹配方法。已经使用过findall方法的话，该方法将返回由多个匹配字符串组成的列表。对于多个匹配项，finditer会按顺序返回一个迭代器，每个迭代生成一个匹配对象。...

99+

2023-05-17

html Python
如何使用Python中的正则表达式处理html文件

使用Python中的正则表达式处理html文件 finditer方法是一种全匹配方法。您可能已经使用了findall方法，它返回多个匹配字符串的列表。finditer返回一个迭代器顺...

99+

2023-05-18

python 处理html python操作html文件 python正则表达式
在python正则表达式中是怎样正确使用正则表达式

这篇文章将为大家详细讲解有关在python正则表达式中是怎样正确使用正则表达式，文章内容质量较高，因此小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。现在我们已经看了一些简单的正则表达式，那么我们实际在 Python...

99+

2023-06-17
python中的正则表达式怎么使用

这篇文章主要讲解了“python中的正则表达式怎么使用”，文中的讲解内容简单清晰，易于学习与理解，下面请大家跟着小编的思路慢慢深入，一起来研究和学习“python中的正则表达式怎么使用”吧！在Python中需要通过正则表达式对字符串进行匹配...

99+

2023-07-04
python 中正则表达式的使用

正则表达式（re）（Regular Expression）。正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。在pyth...

99+

2023-01-31

正则表达式 python
Python中怎么使用正则表达式及正则表达式匹配规则是什么

1 导库import re2 使用模板re_pattern = re.compile(pattern, flags=0) result = re.findall(re_pattern,string)3 说明参数描述pattern匹配的正则表...

99+

2023-05-14

Python
Python中的正则表达式怎么用

这篇文章主要为大家展示了“Python中的正则表达式怎么用”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“Python中的正则表达式怎么用”这篇文章吧。1.正则表达式是什么很简单就是一种字符串匹配...

99+

2023-06-25
python 正则表达式的使用

目录1、正则表达式 1.1 正则表达式字符串1.1.1 元字符1.1.2 字符转义1.1.3 开始与结束字符1.2 字符类1.2.1 定义字符类1.2.2 字符串取反1.2.3 区间...

99+

2024-04-02
批处理中正则表达式findstr怎么用

这篇文章给大家分享的是有关批处理中正则表达式findstr怎么用的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。语法 findstr [/b] [/e] [/l] [/r] [/s] [/i] [/x] [/v] ...

99+

2023-06-09
Python中使用正则表达式及正则表达式匹配规则详解

目录1 导库2 使用模板3 说明4 示例5 正则表达式匹配规则1 导库 import re 2 使用模板 re_pattern = re.compile(pattern, flags...

99+

2023-03-22

Python正则表达式匹配规则 Python正则表达式
Python的RegEx正则表达式怎么使用

RegEx 或正则表达式是形成搜索模式的字符序列。RegEx 可用于检查字符串是否包含指定的搜索模式。RegEx 模块Python 提供名为 re 的内置包，可用于处理正则表达式。导入 re 模块：import rePython 中的 Re...

99+

2023-05-19

Python regex
python的正则表达式怎么用

这篇文章主要为大家展示了“python的正则表达式怎么用”，内容简而易懂，条理清晰，希望能够帮助大家解决疑惑，下面让小编带领大家一起研究并学习一下“python的正则表达式怎么用”这篇文章吧。一、正则表达式–元字符re 模块使 Python...

99+

2023-06-25
Python中的正则表达式

正则表达式是包含文本和特殊字符的字符串, 为高级的文本模式匹配, 抽取, 与文本形式的搜索和替换功能提供了基础 Python通过标准库re模块来支持正则表达式模式匹配的两种方法完成匹配(模式匹配) 搜索(search())...

99+

2023-01-31

正则表达式 Python
批处理FINDSTR正则表达式怎么用

这篇文章给大家分享的是有关批处理FINDSTR正则表达式怎么用的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。findstr . 2.txt 或 Findstr "." 2.txt 从文件2.t...

99+

2023-06-09
SQL正则表达式及mybatis中使用正则表达式的方法

小编给大家分享一下SQL正则表达式及mybatis中使用正则表达式的方法，相信大部分人都还不怎么了解，因此分享这篇文章给大家参考一下，希望大家阅读完这篇文章后大有收获，下面让我们一起去了解一下吧！mysql...

99+

2024-04-02
MongoDB中怎么使用正则表达式

MongoDB中怎么使用正则表达式，相信很多没有经验的人对此束手无策，为此本文总结了问题出现的原因和解决方法，通过这篇文章希望你能解决这个问题。如果要想实现模糊查询，那么必须使...

99+

2024-04-02
怎么使用Nginx正则表达式处理多域名

在Nginx中使用正则表达式处理多域名的方法如下：配置server块，指定多个域名，并使用正则表达式匹配多个域名： serve...

99+

2024-04-29

Nginx
Python中正则表达式的用法

在这里跟大家分享一个Python编程过程中的小知识点——正则表达式！那正则表达式到底是什么呢？又是用来干什么的呢？正则表达式是包含文本和特殊字符的字符串, 为高级的文本模式匹配, 抽取, 与文本形式的搜索和替换功能提供了基础...

99+

2023-01-31

正则表达式 Python
怎么使用javascript正则表达式

本篇内容介绍了“怎么使用javascript正则表达式”的有关知识，在实际案例的操作过程中，不少人都会遇到这样的困境，接下来就让小编带领大家学习一下如何处理这些情况吧！希望大家仔细阅读，能够学有所成！表单验...

99+

2024-04-02
C++怎么使用正则表达式

今天小编给大家分享一下C++怎么使用正则表达式的相关知识点，内容详细，逻辑清晰，相信大部分人都还太了解这方面的知识，所以分享这篇文章给大家参考一下，希望大家阅读完这篇文章后有所收获，下面我们一起来了解一下吧。目正则表达式正则表达式(regu...

99+

2023-06-30