python - Pin down exact content location in html for web scraping urllib2 Beautiful Soup -
i'm new web scraping, have little exposure html file systems , wanted know if there better more efficient way search required content on html version of web page. currently, want scrape reviews product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem
for this, have following code:
url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem review_url = url #print review_url #------------------------------------------------------------------------- # scrape ratings #------------------------------------------------------------------------- page_no = 1 sum_total_reviews = 0 more = true while (more): #print "xxxx" # open url review data request = urllib2.request(review_url) try: #print "xxxx" page = urllib2.urlopen(request) except urllib2.urlerror, e: #print "xxxxx" if hasattr(e, 'reason'): print 'failed reach url' print 'reason: ', e.reason sys.exit() elif hasattr(e, 'code'): if e.code == 404: print 'error: ', e.code sys.exit() content = page.read() #print content soup = beautifulsoup(content) results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})
with this, i'm not able read anything. i'm guessing have give accurate destination. suggestions ?
i understand question beautifulsoup
, since haven't had success using in particular situation, suggest taking @ selenium.
selenium uses real browser - don't have deal parsing results of ajax calls. example, here's how can list of review titles , ratings first reviews page:
from selenium.webdriver.firefox import webdriver driver = webdriver.webdriver() driver.get('http://www.walmart.com/ip/29701960?page=seeallreviews') review in driver.find_elements_by_class_name('bvrrreviewdisplaystyle3main'): title = review.find_element_by_class_name('bvrrreviewtitle').text rating = review.find_element_by_xpath('.//div[@class="bvrrratingnormalimage"]//img').get_attribute('title') print title, rating driver.close()
prints:
renee culver loves clorox wipes 5 out of 5 men @ work 5 out of 5 clorox wipes 5 out of 5 ...
also, take account can use headless phantomjs browser (example).
another option make use of walmart api.
hope helps.
Comments
Post a Comment