html - Regex to capture and store url -

- April 15, 2012

i'm new perl , i'm trying harvest links , images website. i'm reading regular expressions , i've far managed lines of html contain links or images (at least believe so) with

/<img src|<a href/i     #i'm comparing every line of html

but how capture , store actual url? turn this:

<img src="http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg"

into this:

http://i1.nyt.com/images/2014/03/23/us/23marriage2/23marriage2-largehorizontal375.jpg

in general, recommend using e.g. html::treebuilder rather regular expressions parse html.

saying that, can of course try , use regexes fetch you're after - dependent on source material. generically capture img src or href bit (this assumes things such double quotes being used, example, , more brittle parsing solution):

/<img[^>]*?src="([^"]*)"|<a[^>]*?href="([^"]*)"/i

then if matches, image url in $1, or link in $2.

Search This Blog

Sp

html - Regex to capture and store url -

Comments

Post a Comment

Popular posts from this blog

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

rest - Spring boot: Request method 'PUT' not supported -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -