Parsing html with lxml in python -

- April 15, 2014

i have following html code:

... <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> ...

i trying extract "[[footer]] - feed if want." code including spaces (general task in find strings on page containing text "[[footer]]").

import lxml.etree et html = """ <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> """  elem = et.fromstring(html)  infos = elem.xpath('/p') info in infos:     print 1, info.text print 2, et.tostring(elem) #

results:

1, [[footer]] - 2, <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p>

desired result

[[footer]] - <a href="/rss">feed</a> if want.

question

it humbling have ask question, since doesn't seem should hard.

how can extract strings on page containing text "[[footer]] using lxml?

you can't exact string, since lxml converts html it's own internal data structure, , you'll want use tostring() method convert string (meaning attributes, nesting, etc, come out in different order/format, , whitespace not preserved). example of like:

for info in infos:     #check string in displayed text     if "search string" in info.text:         print et.tostring(info)

since sounds mentioned anywhere on page, you'll want make check info function , call recursively walk elements . . .

edit in response comment:

you this:

for info in infos:     #check string in displayed text     if "search string" in info.text:         output_str = info.text         children in info:             output_str += et.tostring(children)         print output_str

Search This Blog

Sp

Parsing html with lxml in python -

Comments

Post a Comment

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

symfony - imagine_filter() not generating the correct url in LiipImagineBundle -