Parsing html with lxml in python -
i have following html code:
... <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> ...
i trying extract "[[footer]] - feed if want." code including spaces (general task in find strings on page containing text "[[footer]]").
import lxml.etree et html = """ <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p> """ elem = et.fromstring(html) infos = elem.xpath('/p') info in infos: print 1, info.text print 2, et.tostring(elem) #
results:
1, [[footer]] - 2, <p class="footer">[[footer]] - <a href="/rss">feed</a> if want.</p>
desired result
[[footer]] - <a href="/rss">feed</a> if want.
question
it humbling have ask question, since doesn't seem should hard.
how can extract strings on page containing text "[[footer]] using lxml?
you can't exact string, since lxml converts html it's own internal data structure, , you'll want use tostring() method convert string (meaning attributes, nesting, etc, come out in different order/format, , whitespace not preserved). example of like:
for info in infos: #check string in displayed text if "search string" in info.text: print et.tostring(info)
since sounds mentioned anywhere on page, you'll want make check info function , call recursively walk elements . . .
edit in response comment:
you this:
for info in infos: #check string in displayed text if "search string" in info.text: output_str = info.text children in info: output_str += et.tostring(children) print output_str
Comments
Post a Comment