Parsing html with lxml in python -
i have following html code:
... <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> ...   i trying extract "[[footer]] - feed if want." code including spaces (general task in find strings on page containing text "[[footer]]").
import lxml.etree et html = """ <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> """  elem = et.fromstring(html)  infos = elem.xpath('/p') info in infos:     print 1, info.text print 2, et.tostring(elem) #   results:
1, [[footer]] - 2, <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p>   desired result
[[footer]] - <a href="/rss">feed</a> if want.   question
it humbling have ask question, since doesn't seem should hard.
how can extract strings on page containing text "[[footer]] using lxml?
you can't exact string, since lxml converts html it's own internal data structure, , you'll want use tostring() method convert string (meaning attributes, nesting, etc, come out in different order/format, , whitespace not preserved). example of like:
for info in infos:     #check string in displayed text     if "search string" in info.text:         print et.tostring(info)   since sounds mentioned anywhere on page, you'll want make check info function , call recursively walk elements . . .
edit in response comment:
you this:
for info in infos:     #check string in displayed text     if "search string" in info.text:         output_str = info.text         children in info:             output_str += et.tostring(children)         print output_str      
Comments
Post a Comment