over - matching multiple line in python regular expression




regular expression search (4)

As other have suggested the specific problem you are having can be resolved by allowing multi-line matching using re.MULTILINE

However you are going down a treacherous patch parsing HTML with regular expressions. Use an XML/HTML parser instead, BeautifulSoup works great for this!

doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(doc)
all_trs = soup.findAll("tr")

I want to extract the data between <tr> tags from an html page. I used the following code.But i didn't get any result. The html between the <tr> tags is in multiple lines

category =re.findall('<tr>(.*?)</tr>',data);

Please suggest a fix for this problem.


Do not use regular expressions to parse HTML. Use an HTML parser such as lxml or BeautifulSoup.


just to clear up the issue. Despite all those links to re.M it wouldn't work here as simple skimming of the its explanation would reveal. You'd need re.S, if you wouldn't try to parse html, of course:

>>> doc = """<table border="1">
    <tr>
        <td>row 1, cell 1</td>
        <td>row 1, cell 2</td>
    </tr>
    <tr>
        <td>row 2, cell 1</td>
        <td>row 2, cell 2</td>
    </tr>
</table>"""

>>> re.findall('<tr>(.*?)</tr>', doc, re.S)
['\n        <td>row 1, cell 1</td>\n        <td>row 1, cell 2</td>\n    ', 
 '\n        <td>row 2, cell 1</td>\n        <td>row 2, cell 2</td>\n    ']
>>> re.findall('<tr>(.*?)</tr>', doc, re.M)
[]

pat=re.compile('<tr>(.*?)</tr>',re.DOTALL|re.M)
print pat.findall(data)

Or non regex way,

for item in data.split("</tr>"):
    if "<tr>" in item:
       print item[item.find("<tr>")+len("<tr>"):]




python