Seemingly simple python regex does not match -


i'm using beautifulsoup's (python) find_all function regex scrape data off webpage. quite specifically, i'm scraping individual classified ads here. if inspect each classified ad, can see typically encapsulated in either of following divs:

<div class="item c-b-#">...</div> 

or

<div class="item c-b-# premium">...</div> 

where # number (typically 0 or 2).

my goal here tell these 2 apart using regex. here's i've done:

regularads = soup.find_all('div', attrs={'class': re.compile('item.*')}) 

and

premiumads = soup.find_all('div', attrs={'class': re.compile('item.*premium')}) 

the former works expeced - returns all classifieds (including premium), latter returns nothing. wrong it? why doesn't 'item.*premium' map second div-class?

as secondary question: how alter first regex "i want have word 'item' not word 'premium'?

edit

for future reference: after little trial , error answer secondary question became:

regularads = [tag tag in soup.find_all('div', attrs={'class': re.compile('item')}) if 'premium' not in tag['class']] 

which worked nicely.

my quick guess class beautiful soup result of calling class.split(' ') on actual text of class attribute. if do:

premiumads = soup.find_all('div', attrs={'class': 'premium'}) 

Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -