html - Regex to capture and store url -
i'm new perl , i'm trying harvest links , images website. i'm reading regular expressions , i've far managed lines of html contain links or images (at least believe so) with
/<img src|<a href/i #i'm comparing every line of html
but how capture , store actual url? turn this:
<img src="http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg"
into this:
http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg
in general, recommend using e.g. html::treebuilder rather regular expressions parse html.
saying that, can of course try , use regexes fetch you're after - dependent on source material. generically capture img src or href bit (this assumes things such double quotes being used, example, , more brittle parsing solution):
/<img[^>]*?src="([^"]*)"|<a[^>]*?href="([^"]*)"/i
then if matches, image url in $1
, or link in $2
.
Comments
Post a Comment