如何使用Python中的正则表达式处理html文件

python 处理html python操作html文件 python正则表达式 2023-05-18 08:05:23 367人浏览八月长安

Python 官方文档：入门教程 => 点击学习

摘要

使用Python中的正则表达式处理html文件 finditer方法是一种全匹配方法。您可能已经使用了findall方法，它返回多个匹配字符串的列表。finditer返回一个迭代器顺

使用Python中的正则表达式处理html文件

finditer方法是一种全匹配方法。您可能已经使用了findall方法，它返回多个匹配字符串的列表。finditer返回一个迭代器顺序地为多个匹配中的每一个生成匹配对象。在下面的代码中，这些匹配对象被访问（通过for循环），因此可以打印组1。

您的任务是编写python RE来识别html文本文件中的某些模式。将代码添加到STARTER脚本为这些模式编译RE（将它们分配给有意义的变量名称），并将这些RE应用于文件的每一行，打印出找到的匹配项。

1.编写识别HTML标签的模式，然后将其打印为“TAG:TAG string”（例如“TAG:b”代表标签）。为了简单起见，假设左括号和右括号每个标记的（<，>）将始终出现在同一行文本中。第一次尝试可能使regex“<.*>”其中“.”是与任何字符匹配的预定义字符类符号。尝试找出这一点，找出为什么这不是一个好的解决方案。编写一个更好的解决方案，解决这个问题

2.修改代码，使其区分开头和结尾标记（例如p与/p)打印OPENTAG和CLOSETAG

import sys, re

#------------------------------

testRE = re.compile('(logic|sicstus)', re.I)
testI = re.compile('<[A-Za-z]>', re.I)
testO = re.compile('<[^/](\S*?)[^>]*>')
testC = re.compile('</(\S*?)[^>]*>')

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        m = testRE.search(line)
        if m:
            print('** TEST-RE:', m.group(1))

        mm = testRE.finditer(line)
        for m in mm:
            print('** TEST-RE:', m.group(1))
        
        index= testI.finditer(line)
        for i in index:
           print('Tag:',i.group().replace('<', '').replace('>', ''))
           
        open1= testO.finditer(line)
        for m in open1:
           print('opening:',m.group().replace('<', '').replace('>', ''))
           
        close1= testC.finditer(line)
        for n in close1:
           print('closing:',n.group().replace('<', '').replace('>', ''))

请注意，有些HTML标签有参数，例如：

<table border=1 cellspacing=0 cellpadding=8>

确保打开标记的模式适用于带参数和不带参数的标记，即成功找到并打印标签标签。现在扩展您的代码，以便打印两个打开的标签标签和参数，例如:

OPENTAG: table
PARAM: border=1
PARAM: cellspacing=0
PARAM: cellpadding=8

 		open1= testO.finditer(line)
        for m in open1:
            #print('opening:',m.group().replace('<', '').replace('>', ''))
            firstm= m.group().replace('<', '').replace('>', '').split()
            num = 0
            for otherm in firstm:
                if num == 0:
                    print('opening:',otherm)
                else:
                    print('pram:',otherm)
                num+= 1

在正则表达式中，可以使用反向引用来指示匹配早期部分的子字符串,应再次出现正则表达式的。格式为\N（其中N为正整数），并返回到第N个匹配的文本正则表达式组。例如，正则表达式，如：r" (\w+) \1 仅当与组（\w+）完全匹配的字符串再次出现时才匹配 backref\1出现的位置。这可能与字符串“踢”匹配.例如，“the”出现两次。使用反向引用编写一个模式，当一行包含成对的open和关闭标签，例如在粗体中.

考虑到我们可能想要创建一个执行HTML剥离的脚本，即一个HTML文件，并返回一个纯文本文件，所有HTML标记都已从中删除出来这里我们不打算这样做，而是考虑一个更简单的例子，即删除我们在输入数据文件的任何行中找到的HTML标记。

你应该能够让您已经定义的RE识别HTML标签这样做,将生成的文本打印到屏幕上为STRIPPED：。。

import sys, re

#------------------------------
# PART 1: 

   # Key thing is to avoid matching strings that include
   # multiple tags, e.g. treating '<p><b>' as a single
   # tag. Can do this in several ways. Firstly, use
   # non-greedy matching, so get shortest possible match
   # including the two angle brackets:

tag = re.compile('</?(.*?)>') 

   # The above treats the '/' of a close tag as a separate
   # optional component - so that this doesn't turn up as
   # part of the match '.group(1)', which is meant to return
   # the tag label. 
   # Following alternative solution uses a negated character
   # class to explicitly prevent this including '>': 

tag = re.compile('</?([^>]+)>') 

   # Finally, following version separates finding the tag
   # label string from any (optional) parameters that might
   # also appear before the close angle bracket:

tag = re.compile(r'</?(\w+\b)([^>]+)?>') 

   # Note that use of '\b' (as Word boundary anchor) here means
   # we must mark the regex string as a 'raw' string (r'..'). 

#------------------------------
# PART 2: 

   # Following closeTag definition requires first first char
   # after the open angle bracket to be '/', while openTag
   # definition excludes this by requiring first char to be
   # a 'word char' (\w):

openTag  = re.compile(r'<(\w[^>]*)>')
closeTag = re.compile(r'</([^>]*)>')

   # Following revised definitions are more carefully stated
   # for correct extraction of tag label (separately from
   # any parameters:

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')
closeTag = re.compile(r'</(\w+\b)\s*>')

#------------------------------
# PART 3: 

   # Above openTag definition will already get the string
   # encompassing any parameters, and return it as
   # m.group(2), i.e. defn: 

openTag  = re.compile(r'<(\w+\b)([^>]+)?>')

   # If assume that parameters are continuous non-whitespace
   # chars separated by whitespace chars, then we can divide
   # them up using split - and that's how we handle them
   # here. (In reality, parameter strings can be a lot more
   # messy than this, but we won't try to deal with that.)

#------------------------------
# PART 4: 

openCloseTagPair = re.compile(r'<(\w+\b)([^>]+)?>(.*?)</\1\s*>')

   # Note use of non-greedy matching for the text falling
   # *between* the open/close tag pair - to avoid false
   # results where have two similar tag pairs on same line.

#------------------------------
# PART 5: URLS

   # This is quite tricky. The URL expressions in the file
   # are of two kinds, of which the first is a string
   # between double quotes ("..") which may include
   # whitespace. For this case we might have a regex: 

url = re.compile('href=("[^">]+")', re.I)

   # The second case does not have quotes, and does not
   # allow whitespace, consisting of a continuous sequence
   # of non-whitespace material (that ends when you reach a
   # space or close bracket '>'). This might be: 

url = re.compile('href=([^">\s]+)', re.I)

   # We can combine these two cases as follows, and still
   # get the expression back as group(1):

url = re.compile(r'href=("[^">]+"|[^">\s]+)', re.I)

   # Note that I've done nothing here to exclude 'mailto:'
   # links as being accepted as URLS. 

#------------------------------

with open('RGX_DATA.html') as infs: 
    linenum = 0
    for line in infs:
        linenum += 1
        if line.strip() == '':
            continue
        print('  ', '-' * 100, '[%d]' % linenum, '\n   TEXT:', line, end='')
    
        # PART 1: find HTML tags
        # (The following uses 'finditer' to find ALL matches
        # within the line)
    
        mm = tag.finditer(line)
        for m in mm:
            print('** TAG:', m.group(1), ' + [%s]' % m.group(2))
    
        # PART 2,3: find open/close tags (+ params of open tags)
    
        mm = openTag.finditer(line)
        for m in mm:
            print('** OPENTAG:', m.group(1))
            if m.group(2):
                for param in m.group(2).split():
                    print('    PARAM:', param)
    
        mm = closeTag.finditer(line)
        for m in mm:
            print('** CLOSETAG:', m.group(1))
    
        # PART 4: find open/close tag pairs appearing on same line
    
        mm = openCloseTagPair.finditer(line)
        for m in mm:
            print("** PAIR [%s]: \"%s\"" % (m.group(1), m.group(3)))
    
        # PART 5: find URLs:
    
        mm = url.finditer(line)
        for m in mm:
            print('** URL:', m.group(1))

        # PART 6: Strip out HTML tags (note that .sub will do all
        # possible substitutions, unless number is limited by count
        # keyword arg - which is fortunately what we want here)

        stripped = tag.sub('', line)
        print('** STRIPPED:', stripped, end = '')

总结

到此这篇关于如何使用Python中的正则表达式处理html文件的文章就介绍到这了,更多相关Python正则处理html文件内容请搜索编程网以前的文章或继续浏览下面的相关文章希望大家以后多多支持编程网！

您可能感兴趣的文档:

点击免费下载>>软考高级考试备考技巧/历年真题/备考精华资料

--结束END--

本文标题: 如何使用Python中的正则表达式处理html文件

本文链接: https://www.lsjlt.com/news/211872.html(转载时请注明来源链接)

有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

本篇文章演示代码以及资料文档资料下载

下载Word文档到电脑，方便收藏和打印～

下载Word文档

去做题

猜你喜欢

如何使用Python中的正则表达式处理html文件

使用Python中的正则表达式处理html文件 finditer方法是一种全匹配方法。您可能已经使用了findall方法，它返回多个匹配字符串的列表。finditer返回一个迭代器顺...

99+

2023-05-18

python 处理html python操作html文件 python正则表达式
怎么使用Python中的正则表达式处理html文件

使用Python中的正则表达式处理html文件finditer方法是一种全匹配方法。已经使用过findall方法的话，该方法将返回由多个匹配字符串组成的列表。对于多个匹配项，finditer会按顺序返回一个迭代器，每个迭代生成一个匹配对象。...

99+

2023-05-17

html Python
如何使用Python的正则表达式

本篇内容主要讲解“如何使用Python的正则表达式”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“如何使用Python的正则表达式”吧!一、正则表达式介绍1.学习爬虫，为什么必须会正则表达式？&n...

99+

2023-06-02
如何在python中使用正则表达式

本篇内容主要讲解“如何在python中使用正则表达式”，感兴趣的朋友不妨来看看。本文介绍的方法操作简单快捷，实用性强。下面就让小编来带大家学习“如何在python中使用正则表达式”吧! 一、前言我们在做接口自动化的时候，处理接口依...

99+

2023-06-16
python 中正则表达式的使用

正则表达式（re）（Regular Expression）。正则表达式是对字符串操作的一种逻辑公式，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。在pyth...

99+

2023-01-31

正则表达式 python
如何在Python中处理正则表达式的问题

如何在Python中处理正则表达式的问题，需要具体代码示例正则表达式是一种用于匹配和处理文本的强大工具。在Python中，可以使用内置的re模块来处理正则表达式。本文将介绍如何在Python中利用正则表达式进行文本处理，并提供具体的代码示例...

99+

2023-10-22

Python正则表达式处理
在python正则表达式中是怎样正确使用正则表达式

这篇文章将为大家详细讲解有关在python正则表达式中是怎样正确使用正则表达式，文章内容质量较高，因此小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。现在我们已经看了一些简单的正则表达式，那么我们实际在 Python...

99+

2023-06-17
Python如何用正则表达式处理字符串

在Python中，使用正则表达式处理字符串需要先导入re模块。然后可以使用re模块中的方法来进行匹配、搜索、替换等操作。下面是一些...

99+

2024-04-03

Python
如何理解Python的正则表达式

今天就跟大家聊聊有关如何理解Python的正则表达式，可能很多人都不太了解，为了让大家更加了解，小编给大家总结了以下内容，希望大家根据这篇文章可以有所收获。Python正则表达式！前面也有跟大家分享关于正则表达式的Python学习教程，但是...

99+

2023-06-02
python 正则表达式的使用

目录1、正则表达式 1.1 正则表达式字符串1.1.1 元字符1.1.2 字符转义1.1.3 开始与结束字符1.2 字符类1.2.1 定义字符类1.2.2 字符串取反1.2.3 区间...

99+

2024-04-02
Python中使用正则表达式及正则表达式匹配规则详解

目录1 导库2 使用模板3 说明4 示例5 正则表达式匹配规则1 导库 import re 2 使用模板 re_pattern = re.compile(pattern, flags...

99+

2023-03-22

Python正则表达式匹配规则 Python正则表达式
MySQL中正则表达式如何使用

这篇“MySQL中正则表达式如何使用”文章的知识点大部分人都不太理解，所以小编给大家总结了以下内容，内容详细，步骤清晰，具有一定的借鉴价值，希望大家阅读完这篇文章能有所收获，下面我们一起来看看这篇“MySQL中正则表达式如何使用”文章吧。前...

99+

2023-07-02
C++中的正则表达式如何使用

这篇文章主要介绍了C++中的正则表达式如何使用的相关知识，内容详细易懂，操作简单快捷，具有一定借鉴价值，相信大家阅读完这篇C++中的正则表达式如何使用文章都会有所收获，下面我们一起来看看吧。介绍C++ 正则表达式教程解释了 C++ 中正则表...

99+

2023-06-30
VB中如何使用正则表达式

在VB中使用正则表达式，可以通过引用System.Text.RegularExpressions命名空间来实现。以下是一个简单的示例...

99+

2024-04-03

VB
PHP中正则表达式如何使用

本篇文章为大家展示了PHP中正则表达式如何使用，内容简明扼要并且容易理解，绝对能使你眼前一亮，通过这篇文章的详细介绍希望你能有所收获。PHP正则表达式的定义：用于描述字符排列和匹配模式的一种语法规则。它主要用于字符串的模式分割、匹配、查找及...

99+

2023-06-17
MySQL中如何使用正则表达式

这篇文章将为大家详细讲解有关MySQL中如何使用正则表达式，文章内容质量较高，因此小编分享给大家做个参考，希望大家阅读完这篇文章后对相关知识有一定的了解。1、简介MySQL中支持正则表达式匹配，在复杂的过滤条件中，可以考虑使用正则表达式。使...

99+

2023-06-25
grep中如何使用正则表达式

这篇文章给大家分享的是有关grep中如何使用正则表达式的内容。小编觉得挺实用的，因此分享给大家做个参考，一起跟随小编过来看看吧。grep是Linux中用于处理文件的工具之一。grep搜索输入文件，查找与正则表达式匹配的行，并将每个匹配的行标...

99+

2023-06-28
VBS中如何使用正则表达式

这篇文章主要介绍VBS中如何使用正则表达式，文中介绍的非常详细，具有一定的参考价值，感兴趣的小伙伴们一定要看完！使用正则表达式在典型的搜索和替换操作中，必须提供要查找的确切文字。这种技术对于静态文本中的简单搜索和替换任务可能足够了，但是由于...

99+

2023-06-09
Linux中正则表达式如何使用

这篇文章将为大家详细讲解有关Linux中正则表达式如何使用，小编觉得挺实用的，因此分享给大家做个参考，希望大家阅读完这篇文章后可以有所收获。Linux之正则表达式正则表达式与通配符的区别:最常应用正则表达式的命令是grep（e...

99+

2023-06-16
sql中如何使用正则表达式

sql中使用正则表达式可通过regexp_like()函数，使用posix语法匹配字符串。常用字符包括锚点字符、字符类和量词。正则表达式可用于在select、where和其他语...

99+

2024-05-02