Web-Scraping using Python and BeautifulSoup

How to scrape the data from web pages and save it to a CSV file.

ANKIT PRATAP SINGH
10 min readJun 30, 2021

“Data is the new oil.” — Clive Humby

Web Scraping Image

We all live in a time when the internet is rich with enormous amounts of data. Data has become a new fuel. Now the question is, “how do you utilize this fuel?” You can use data for analysis, research, machine learning, artificial intelligence, etc.

Another critical question is if you are a data scientist or Machine Learning Engineer “How to collect and prepare the data for your projects? “. If you are working for a company, the company may already have a ready dataset, or you may be asked to build one from various sources. If you are working on your projects, you have to collect the data from the different sources available on the internet. Usually, not all the data you need is available in one place, formatted for a direct download. Also, there is no guarantee that such readily available data will be suitable for your research.

Web scraping deals with extracting or scraping the information from the website. Whether you are a data scientist, engineer, or anybody who analyzes vast amounts of datasets, the ability to scrape data from the web is a valuable skill to have. Let’s say you find data from the web, and there is no direct way to download it; web scraping using Python is a skill you can use to extract the data into a proper form that can then be imported and used in various ways. Web scraping is sometimes referred to as web harvesting or web data extraction. Several tools are available for web scraping, and Python comes with some nice and useful libraries to make this job much easier.

“Where there is data smoke, there is business fire.” — Thomas Redman

BeautifulSoup is one of the most helpful library in Python that is beginner-friendly and easy to understand with all beginner levels of coding in Python. At the end of this article, I hope you will be confident that “anyone can scrape the data with all beginner level of coding.

An image showing how does scraping works

Can we scrape data from everywhere?

Before you dive deep into the process of Scraping, I think you should be aware of this, Scraping makes the website traffic spike and may cause the breakdown of the website server. Thus, not all websites allow people to scrape. So how do You know which websites are allowed or which are not? You can look at the ‘robots.txt’ file of the website. You simply put “/robots.txt” after the URL you want to scrape and you will see information on whether the website host allows you to scrape the website.

Let’s take an example of “value.today/robots.txt”

You will find a page like this Here, you will read all the terms and conditions about the web page and what the page allowed for and does not allow for.

Another way of finding the allowance for Scraping is to find user agreement page some time you will find terms and conditions page on the website, there you will find their T&C for web scraping. Sometimes some web pages do not allow scraping in general, but if you contact the admins by emailing them and ask them that you want to scrape the data for research or study purpose, they may allow you to scrape the data. Otherwise, if none of the conditions holds for the web page, and you still try to scrape the data for your purpose, then the admins may block your I.P. address for their server. And you will start seeing captcha pages instead of web pages.

How does web scraping works?

Well, It’s like a bot, first, you write some codes, and then you run the code, the program send a request to the web page server and bring content in the HTML form from the server, and then you parse that data in CSV or whatever format you want.

Let’s take an example here:

We want to scrape data regarding the world’s top 500 companies by market capitalization.

For this, I have selected a web page value. Today, this web page contains data regarding companies worldwide, and they are regularly updating their data on their page. They provide their data for study and research purposes. For more details, you can visit here.

There are many ways to scrape the data. In Python also you have dozens of libraries for web-scraping, But some important libraries in Python are: “Requests”, “Beautiful Soup”, “lxml”, “Selenium”, “Scrapy”. Everyone should learn requests, because this is the library for web scraping that allows you to communicate with the web servers, rest depends on your use case, like:

  1. Beautiful Soup: Beautiful soup library is vital to add into your data science toolkit, as this is yet simple and easy to use but powerful library that you can scrape the data in just few hours of practice. It’s simplicity is definitely it’s greatest strength.
  2. lxml: lxml is high-performance, production-quality HTML and XML parsing library. You can rely on it to be good for you no matter what web page you are following for the Scraping. lxml is faster then Beautiful soup, lxml is widely adopted between data science industries.
  3. Selenium: Some web pages out there use JavaScript to show their content, on those web pages times you need to go through forms to see the content there or sometimes you need to select options from a dropdown menu before seeing the content, For such web pages you will need more powerful tool then requests, you will need Selenium there, Selenium is a tool that automates the browsers, some times it’s called as web-driver.
  4. Scrapy: Scrapy is vast than the above options, Scrapy is technically not even a library, it’s a complete web scraping framework. You can use scrapy to manage requests, preserve user sessions, follow redirects and handle output pipelines.

For more information about these libraries you can visit : 5 Tasty Python Web Scraping Libraries.

Well, let’s start our journey for the above example:

We will use “Requests”, and “Beautiful Soup” libraries for our project, and we will finish our work in three steps:

  1. Downloading the web-page using Requests library,
  2. Inspecting and extracting the data from the web-page using Beautiful Soup library,
  3. After extracting the data we will write a parser function in Python to write all the required data into a csv file.

Step 1:

We will start our journey from installing and importing all the required libraries into our jupyter notebook. You can install a library into you notebook using the code below:

Then we will download the web page using requests library. Requests library makes a connection between your notebook or say your machine where you are writing the codes and it sends a request to the web-page server and bring the content from the web-page server in HTML format(that’s a general format, generally a web-page written in). We save all the HTML content in a variable so that we can use it anytime whenever we need, we write code as

url = 'https://www.value.today/world/world-top-500-companies'
response = requests.get(url)

And now we have an another way to know whether the web-page is ordinary for web-scraping or not. We just add a new code cell there and we check the response code for the web-page,

If this status_code is between 200–299 then the web-page you have chosen is ordinary otherwise not. This status_code actually stands for Hypertext Transfer Protocol(HTTP) response status_code. Status_codes are issued by a server in response to a client’s request made to the server. You can see a complete list of user guide for these status_codes here.

Now after downloading the web-page to our notebook, now it’s time to go towards the step 2.

Step 2:

Now we will inspect our web page, Since HTML uses tags for each attributes and our downloaded web page is also in the form of HTML codes so we will try to fetch all the data from the web page using the HTML tags for each attribute available on the web page. For knowing HTML tag for any attribute on the web page you just go to that particular attribute on the page and right click on your mouse you will see a menu having an inspect button there and you click on the inspect button, a new window pops up either in the left side or bottom side or right side of you current window like the following one,

Till now half of our work has been finished, We have downloaded the web page and now we now about the html tags for each attributes, now it’s time collect the data from the web-page using these attributes. Have a look on the below code cell:

In the above cell, That’s how does data looks like in an html tag. I chose to fetch a company_block_tag here because there are 500 companies on the web-page and the data for each company lies inside a single block, that’s why I fetched block_tag for each company and now I’ll try to fetch all required data from the respective tag for each respective attributes.

In the above code cell, I fetched the data regarding rank, head quarter location, CEO name, total employee and market capitalization from the company_block_tags for each company.

That’s great, we have collected the data from the web page and now it’s time to parse this data into a CSV file, using Pandas library. First we save all the collected data into a python dictionary in a sequenced format that we want the data should be on the csv file.

This is how a dictionary has been created and now we will make a pandas data-frame so that we can parse the data into a csv file using df.to_csv() function. Finally we have collected the data from the web page that we have selected.

Step 3:

Till now we have only collected the data from a single web-page let’s go deep into the web scraping let’s write some helper function that will actually help to automate our work, instead of writing codes for a single web page let’s write some codes that actually extracted the data from different web pages of almost same kind,

We will now write some helper function like above, when we call the above function by giving an input in the form of a url, it will check the statu_code and if the status_code is valid then it will get the HTML content from the web-page server and returns a Beautiful Soup object. And then we will write some functions to collect the data from each page,

Similar to above one, these function then we will use in an another function which we will use to parse all the data into a python dictionary,

This is our function that will collect all the data form the each company’s page and then will automatically write all the data into a python dictionary. That we will use for write all the data first into a pandas data-frame and then using the pandas data-frame we will write all the data into a csv file.

Here is a snip how does the pandas data-frame looks like:

Well, finally we have successfully scraped the data from 500 different pages, and we have created a csv file with that data, here is a link to download the entire dataset

Summary:

Here is a very short description from beginning to end:

  1. First of all we installed all the required libraries to our jupyter notebook.
  2. We download the web page to our notebook using requests library,
  3. We inspect the web-page for HTML tags for all required attributes regarding each data that we want to scrape from the web-page,
  4. We then collect the data from each HTML tags and write all the data into a python dictionary,
  5. For collecting the data from different pages we write some helper functions and then we write a parser function to extract all the data from each different page and then parse the collected data into a python dictionary,
  6. Finally we write all the data into a csv file using pandas library.

Now the dataset that we have created, can be used for exploratory data analysis on top 500 companies by market capitalization, The data set so created by our own is completely reliable because everything in the data set is in our knowledge. we can still add some more variation in our data set for creating more options with the data set like we can add companies data according to country wise, or according to stock exchange and so many ways to make variations.

That’s all web-scraping in python stands for.

References:

  1. To see the complete notebook for codes and all the user guidance plz visit to my notebook
  2. A big thanks to Aakash N S, who taught me how how to scrape the web-pages, here is a link of his tutorial https://youtu.be/RKsLLG-bzEY
  3. A complete documentation for requests library here
  4. Documentation for Beautiful Soup 4 library here
  5. A big thanks to Anushree K, Anubrata Das and the entire code-hunters team and jovian.ai team as well, who supports me to learn all things in a group.

--

--

ANKIT PRATAP SINGH

Data Science | Data Analysis | Web-Scraping | Product Analysis