cover-img

Web Scraping Tutorial: Python and BeautifulSoup

8 September, 2020

0

0

0

Contributors

Introduction

Web scraping lets us quickly scan through a webpage, extract information, and store it for later use. In this tutorial, you will (hopefully) learn how to access and parse html contents using Python 3, BeautifulSoup and requests libraries. I will be using Sublime Text (https://www.sublimetext.com/3) and Terminal. You can use any text editor (IDLE, Visual Studio, Jupyter Notebook).

From IMDb Top 25 Movies of All Time (www.imdb.com/list/ls024149810/), we will extract the movie titles, genres, ratings, and runtimes.


Prerequisites (If not already installed)

  1. Install Python version 3.7 (https://www.python.org/downloads/release/python-370/)
  2. Install BeautifulSoup4 and requests libraries

Installing Dependencies

The requests library sends a HTTP request to the web server, so you can download html contents from a webpage. The BeautifulSoup library lets you parse through html. These libraries aren’t included in Python’s standard library, so install them using pip. Launch the Terminal and enter the following pip command:

pip install requests bs4

Getting Started

Create a new python file top_movies.py and save it onto your Desktop. To run files on Sublime, enter the following command into the Terminal.

python ~/Desktop/top_movies.py

Import the required libraries into Sublime (or any text editor) to connect to the webpage.

import requests
from bs4 import BeautifulSoup

Pass the url to requests.get(). The get()method allows users to download and access the html.

url = ‘https://www.imdb.com/list/ls024149810/'
r = requests.get(url)

*Tip: If the following command returns 200, then you have successfully accessed the webpage*

print(r.status_code)

Creating a BeautifulSoup object will allow us to parse through the html contents.

soup = BeautifulSoup(r.content, ‘html.parser’)

Inspecting and extracting data

Understanding the basics of html will allow you to be successful when web scraping. If you are new to html, check out w3schools (https://www.w3schools.com/html/html_intro.asp)

To view the source code, right-click anywhere on the webpage and click on “Inspect Element” (Safari browser). This allows you to see all the images, links, and CSS codes that form the site.

Image for post

You should see this console pop-up.

Image for post
Image of the console

The red arrow above points to a button that lets you manually locate individual elements to the cooresponding code and highlights it in the console.

Image for post
Safari browser button
Image for post
Google chrome browser button

Movie Titles

When you click around the console, you will discover that each tag contains a specific string or document. Each movie item is contained in <div> tags and is defined in theclass="lister-item-content".

Image for post
Highlighted shows tag that displays the container of movie items

<h3> tag specifies the container that holds the movie title.

Image for post
Highlighted shows the header tag

<a> tag specifies where we can extract the string, The Godfather.

Image for post
Highlighted shows the tag that displays The Godfather

Create a for loop, and use soup.findAll('h3',{'class':'lister-item-header'}) to return every line of code that stores the movie title into a list. I highlighted the movie titles to make it easier to read.

Image for post
Scraped data from <h3> tag

Next, we iterate through the loop and use soup.find('a', href=True).get_text() to extract the strings. Lastly, create an empty list and store the strings in it.

movie_title = []
for title in soup.findAll('h3', {'class':'lister-item-header'}):
titles = title.find('a', href = True).get_text()
movie_title.append(titles)

Next, modify the code above to extract the ratings.

Ratings

Image for post

I have highlighted the tag and class for you. Since the ratings are decimals, make sure to store the data as a float variable.

Genres

Image for post
Genre is defined using <span> tag with class ‘genre’

There are multiple ways of writing code to extract information. Here is an example of extracting genres without using multiple tags. We can specify the exact tag and class that contains ‘Crime, Drama’ and extract the string.

genre_list = []
for genre in soup.findAll('span', attrs= {'class':'genre'}):
genre = genre.get_text()
genre_list.append(genre.strip())

Next, modify the code above to extract runtimes.

Runtime

Image for post
Runtime is defined using <span> with class ‘runtime’

Saving it as a DataFrame

Pandas DataFrame makes it easy for us to scrape data and analyze it. We will pass each item as a dictionary.

Import the built-in python library pandas

import pandas as pd

The dictionary keys are the columns and each list is the value associated with column.

topmovies = pd.DataFrame({
‘Movie Title’: movie_title,
‘Rating’: rating_list,
‘Genre’: genre_list,
‘Runtime’: runtime_list
})
print(topmovies)

Once you have ran your program, this is what the output should look like:

Image for post
Scraped top movie data from IMdb

Important Notes

1. Read through the website’s Terms and Conditions to see if it’s legal to scrape data from the webpage.

2. You can cause the website to break down if you download data too fast.

python

tutorial

webscraping

beautifulsoup

0

0

0

python

tutorial

webscraping

beautifulsoup

Kim Dang

San Francisco, CA, USA

Fullstack Engineer | Javascript(React, Node, Express), MongoDB, SQL | Passionate about building innovative solutions that push boundaries and make a positive social impact

More Articles

Showwcase is a professional tech network with over 0 users from over 150 countries. We assist tech professionals in showcasing their unique skills through dedicated profiles and connect them with top global companies for career opportunities.

© Copyright 2025. Showcase Creators Inc. All rights reserved.