html - Regex to capture and store url -


i'm new perl , i'm trying harvest links , images website. i'm reading regular expressions , i've far managed lines of html contain links or images (at least believe so) with

/<img src|<a href/i     #i'm comparing every line of html 

but how capture , store actual url? turn this:

<img src="http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg" 

into this:

http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg 

in general, recommend using e.g. html::treebuilder rather regular expressions parse html.

saying that, can of course try , use regexes fetch you're after - dependent on source material. generically capture img src or href bit (this assumes things such double quotes being used, example, , more brittle parsing solution):

/<img[^>]*?src="([^"]*)"|<a[^>]*?href="([^"]*)"/i 

then if matches, image url in $1, or link in $2.


Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -