python - Pin down exact content location in html for web scraping urllib2 Beautiful Soup -

- August 15, 2014

i'm new web scraping, have little exposure html file systems , wanted know if there better more efficient way search required content on html version of web page. currently, want scrape reviews product here: http://www.walmart.com/ip/29701960?wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=62272156621&veh=sem

for this, have following code:

url = http://www.walmart.com/ip/29701960? wmlspartner=wlpa&adid=22222222227022069601&wl0=&wl1=g&wl2=c&wl3=34297254061&wl4=&wl5=pla&wl6=6227215 6621&veh=sem review_url = url        #print review_url     #-------------------------------------------------------------------------     # scrape ratings     #-------------------------------------------------------------------------     page_no = 1     sum_total_reviews = 0     more = true      while (more):         #print "xxxx"         # open url review data         request = urllib2.request(review_url)         try:             #print "xxxx"             page = urllib2.urlopen(request)         except urllib2.urlerror, e:             #print "xxxxx"             if hasattr(e, 'reason'):                 print 'failed reach url'                 print 'reason: ', e.reason                 sys.exit()             elif hasattr(e, 'code'):                 if e.code == 404:                     print 'error: ', e.code                     sys.exit()          content = page.read()         #print content         soup = beautifulsoup(content)         results = soup.find_all('span', {'class': re.compile(r's_star_\d_0')})

with this, i'm not able read anything. i'm guessing have give accurate destination. suggestions ?

i understand question beautifulsoup, since haven't had success using in particular situation, suggest taking @ selenium.

selenium uses real browser - don't have deal parsing results of ajax calls. example, here's how can list of review titles , ratings first reviews page:

from selenium.webdriver.firefox import webdriver   driver = webdriver.webdriver() driver.get('http://www.walmart.com/ip/29701960?page=seeallreviews')  review in driver.find_elements_by_class_name('bvrrreviewdisplaystyle3main'):     title = review.find_element_by_class_name('bvrrreviewtitle').text     rating = review.find_element_by_xpath('.//div[@class="bvrrratingnormalimage"]//img').get_attribute('title')     print title, rating  driver.close()

prints:

renee culver loves clorox wipes 5 out of 5 men @ work 5 out of 5 clorox wipes 5 out of 5 ...

also, take account can use headless phantomjs browser (example).

another option make use of walmart api.

hope helps.

Search This Blog

Sp

python - Pin down exact content location in html for web scraping urllib2 Beautiful Soup -

Comments

Post a Comment

Popular posts from this blog

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

rest - Spring boot: Request method 'PUT' not supported -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -