Seemingly simple python regex does not match -
i'm using beautifulsoup's (python) find_all function regex scrape data off webpage. quite specifically, i'm scraping individual classified ads here. if inspect each classified ad, can see typically encapsulated in either of following divs:
<div class="item c-b-#">...</div>
or
<div class="item c-b-# premium">...</div>
where #
number (typically 0 or 2).
my goal here tell these 2 apart using regex. here's i've done:
regularads = soup.find_all('div', attrs={'class': re.compile('item.*')})
and
premiumads = soup.find_all('div', attrs={'class': re.compile('item.*premium')})
the former works expeced - returns all classifieds (including premium), latter returns nothing. wrong it? why doesn't 'item.*premium'
map second div-class?
as secondary question: how alter first regex "i want have word 'item'
not word 'premium'
?
edit
for future reference: after little trial , error answer secondary question became:
regularads = [tag tag in soup.find_all('div', attrs={'class': re.compile('item')}) if 'premium' not in tag['class']]
which worked nicely.
my quick guess class
beautiful soup result of calling class.split(' ')
on actual text of class attribute. if do:
premiumads = soup.find_all('div', attrs={'class': 'premium'})
Comments
Post a Comment