Scraping the-numbers.com using BeautifulSoup4

In this post I want to detail how I would go about about scraping a specific website and how you can use some of what I learned to grab data from a website of your choice. This is how one would scrape the-numbers.com and is the data required for the post in my portfolio on predicting worldwide movie gross here. The full Jupyter notebook will be linked at the bottom of this post but I will explain snippets of the code. Here’s a look at what the section looks like that we want to pull data from:

film_scrape

In addition to the Production Budget, Domestic Gross and Worldwide Gross, there are features about the movie I wanted to collect which was contained within each individual movie sub page. First lets open the url and read it into a soup object. Learn more about BeautifulSoup4 here.

#Import relevant libraries
import urllib2
from bs4 import BeautifulSoup
import re
from pandas import DataFrame
import sys
sys.setrecursionlimit(2000)
import unidecode
import unicodedata

After inspecting the page using Ctrl+Shift+I in Chrome, we see that the data we want is contained in HTML tags:

response = urllib2.urlopen('http://www.the-numbers.com/movie/budgets/all')
main_doc = response.read()
  • Line 1: open the url using urllib2.
  • Line 2: read the html into the main_doc variable.

The batch variable in line 4 is saved in this format:

<td class="data">1</td>,
 <td><a href="/box-office-chart/daily/2009/12/18">12/18/2009</a></td>,
 <td><b><a href="/movie/Avatar#tab=summary">Avatar</a></b></td>,
 <td class="data">$425,000,000</td>,
 <td class="data">$760,507,625</td>,
 <td class="data">$2,783,918,982</td>,
 <td class="data">2</td>,
 <td><a href="/box-office-chart/daily/2015/12/18">12/18/2015</a></td>,
 <td><b><a href="/movie/Star-Wars-Ep-VII-The-Force-Awakens#tab=summary">Star Wars Ep. VII: The Force Awakens</a></b></td>,
 <td class="data">$306,000,000</td>,
 <td class="data">$936,662,225</td>,
 <td class="data">$2,058,662,225</td>
...

It is like this for every movie in the url and we can see that the link to the movie sub page is contained within a tag as well as the Production Budget, Domestic Gross and Worldwide Gross.

Next we want to build a helper function which will be called later on to get the Rotten Tomatoes rating.

def solver(list_of_variables, mpaaRating): 
    list_to_append = []
    for category in list_of_variables: 
        try: 
            if category == 'Rotten Tomatoes':  
                rating_index = mpaaRating.index(category)+3
                rating = mpaaRating[rating_index:rating_index+2]                
                list_to_append.extend([rating])                
            else:
                category = (mpaaRating[mpaaRating.index(category)+1])
                list_to_append.append(category)
                
        except (ValueError, AttributeError):
            #print "category didn't work"
            list_to_append.append(None)            
    return list_to_append

Later on, we’ll see that the Rotten Tomatoes rating has to be dealt with separately and we’ll call this function on each entry to get the ratings and append them to our data list.

def txt_link_downloader(html_link):    
    soup = BeautifulSoup(html_link, 'html.parser')
    batch = soup.find_all('td')
    list_df = []    
    counter = 0
    for index,i in enumerate(xrange(0,len(batch),6)):
        list_df.append(map(lambda x: x.get_text(), batch[i:i+6]))        
        url_end = BeautifulSoup(batch[i+2].encode('utf-8'),'html.parser').find('a').get('href') 
        url = 'http://www.the-numbers.com' + url_end
        list_df[index].append(url)    
    
        response = urllib2.urlopen(url)
        main_doc = response.read()
        soup = BeautifulSoup(main_doc,'html.parser')  
  
        mpaaRating = []
        for tr in soup.findAll('tr'): 
            for td in tr.findAll('td'): 
                mpaaRating.append(td.get_text())
        mpaaRating = [unidecode.unidecode(x).strip() for x in mpaaRating]   
        
        list_of_variables = ['Genre:','Running Time:','MPAA Rating:','Production Companies:',
                             'Domestic Releases:','Domestic DVD Sales','Domestic Blu-ray Sales',
                             'Total Domestic Video Sales','Rotten Tomatoes']
        
        second_page = solver(list_of_variables,mpaaRating)
        list_df[index].extend(second_page)
        
        response = urllib2.urlopen(url)
        main_doc = response.read()
        soup = BeautifulSoup(main_doc,'html.parser')
        soup = soup.find(text = re.compile('Weekend Box Office Performance')).parent.parent.find('div', attrs = {"id": "box_office_chart"})
        try:
            soup = soup.get_text()
            soup = unicodedata.normalize('NFKD', soup).encode('utf-8').split()[4:35]
            soup.insert(3,'None')
            list_df[index].extend(soup)
        except:
            pass
        
        counter += 1
        #sets upper limit, max is 5230 as of 10/9/2016
        if counter == 2000:
            return DataFrame(list_df)            
import copy
list_df = txt_link_downloader(main_doc)
  • Line 2: turn the html into a soup object so that we can search for tags and extract data.
  • Line 3: find alltags since we saw from inspecting the webpage that the necessary information was contained within these tags.
  • Line 6: we notice from the batch variable that there are 6tags per movie so we want to loop through that to extract the data.
  • Line 7: append the ith 6 values contained intags to the first entry in list_df creating a list of lists. The get_text() function is what pulls this information.
  • Line 9: we can also see that line 2 for each movie contains the url for the movie sub page so we need to now access that in order to get the rest of the data.
  • Lines 13-15: read in the new url
  • Lines 17-21: We want to pull the MPAA rating data which which is in awithin a
    . This data was in unicode and causing issues, so we have to use the unidecode function to convert to ASCII.
  • Lines 24-25: Use the solver function from above to return the list. One additional problem we see here is that both the critics and audience ratings are in the same entry.
  • Lines 27-37: Try to get the text and normalize and encode the values. Split to get the correct values we need.
  • Lines 41-42: I set a limit of every 2000 to make sure I was getting the right data but this isn’t necessary.

The full code can be seen at my github and the prediction file will basically take this data and use it to build models in order to predict worldwide gross. I will update the repo with a jupyter notebook at some point so that it is easy to follow along.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s