Parsing html with lxml in python -


i have following html code:

... <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> ... 

i trying extract "[[footer]] - feed if want." code including spaces (general task in find strings on page containing text "[[footer]]").

import lxml.etree et html = """ <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> """  elem = et.fromstring(html)  infos = elem.xpath('/p') info in infos:     print 1, info.text print 2, et.tostring(elem) # 

results:

1, [[footer]] - 2, <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> 

desired result

[[footer]] - <a href="/rss">feed</a> if want. 

question

it humbling have ask question, since doesn't seem should hard.

how can extract strings on page containing text "[[footer]] using lxml?

you can't exact string, since lxml converts html it's own internal data structure, , you'll want use tostring() method convert string (meaning attributes, nesting, etc, come out in different order/format, , whitespace not preserved). example of like:

for info in infos:     #check string in displayed text     if "search string" in info.text:         print et.tostring(info) 

since sounds mentioned anywhere on page, you'll want make check info function , call recursively walk elements . . .

edit in response comment:

you this:

for info in infos:     #check string in displayed text     if "search string" in info.text:         output_str = info.text         children in info:             output_str += et.tostring(children)         print output_str 

Comments

Popular posts from this blog

php - Magento - Deleted Base url key -

javascript - Tooltipster plugin not firing jquery function when button or any click even occur -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -