Seemingly simple python regex does not match -

- February 15, 2010

i'm using beautifulsoup's (python) find_all function regex scrape data off webpage. quite specifically, i'm scraping individual classified ads here. if inspect each classified ad, can see typically encapsulated in either of following divs:

<div class="item c-b-#">...</div>

<div class="item c-b-# premium">...</div>

where # number (typically 0 or 2).

my goal here tell these 2 apart using regex. here's i've done:

regularads = soup.find_all('div', attrs={'class': re.compile('item.*')})

and

premiumads = soup.find_all('div', attrs={'class': re.compile('item.*premium')})

the former works expeced - returns all classifieds (including premium), latter returns nothing. wrong it? why doesn't 'item.*premium' map second div-class?

as secondary question: how alter first regex "i want have word 'item' not word 'premium'?

edit

for future reference: after little trial , error answer secondary question became:

regularads = [tag tag in soup.find_all('div', attrs={'class': re.compile('item')}) if 'premium' not in tag['class']]

which worked nicely.

my quick guess class beautiful soup result of calling class.split(' ') on actual text of class attribute. if do:

premiumads = soup.find_all('div', attrs={'class': 'premium'})

Search This Blog

Sp

Seemingly simple python regex does not match -

Comments

Post a Comment

Popular posts from this blog

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

rest - Spring boot: Request method 'PUT' not supported -