Can you guess a simple way you can get data from a web page? It’s through a technique called web scraping.
In case you are not familiar with web scraping, here is an explanation:
“Web scraping is a computer software technique of extracting information from websites”
“Web scraping focuses on the transformation of unstructured data on the web, typically in HTML format, into structured data that can be stored and analyzed in a central local database or spreadsheet.”
Some web pages make your life easier, they offer something called API, they offer an interface that you can use to download data. Websites like Rotten tomatoes and Twitter provides API to access data. But if a web page doesn’t provide an API, you can use Python to scrape data from that webpage.
I will be using two Python modules for scraping data.
So, are you ready to scrape a webpage? All you have to do to get started is follow the steps given below:
Understanding HTML Basics
Scarping is all about html tags. So you need to understand html inorder to scrape data.
This is an example for a minimal webpage defined in HTML tags. The root tag is <html> and then you have the <head> tag. The page includes the title of the page and might also have other meta information like the keywords. The <body> tag includes the actual content of the page. <h1>, <h2> , <h3>, <h4>, <h5> and <h6> are different header levels.
These are some useful html tags you need to know.
I encourage you to inspect a web page and view its source code to understand more about html.
Scraping A Web Page Using Beautiful Soup
I will be scraping data from bigdataexaminer.com. I am importing urllib2, beautiful soup(bs4), Pandas and Numpy.
import pandas as pd
import numpy as np
What beautiful = urllib2.urlopen(url).read() does is, it goes to bigdataexaminer.com and gets the whole html text. I then store it in a variable called beautiful.
Now I have to parse and clean the HTML code. BeautifulSoup is a really useful Python module for parsing HTML and XML files. Beautiful Soup gives aBeautifulSoup object, which represents the document as a nested data structure.
You can use prettify() function to show different levels of the HTML code.
The simplest way to navigate the parse tree is to say the name of the tag you want. If you want the <h1> tag, just say soup.h1.prettify():
soup.tag.contents will return contents of a tag as a list.
In : soup.head.contents
The following function will return the title present inside head tag.
In : x = soup.head.title
Out : <title></title>
.string will return the string present inside the title tag of big data examiner. As big dataexaminer.com doesn’t have a title, the value returned is None.
Descendants lets you iterate over all of a tags children, recursively.
You can also look at the strings using .strings generator
extracts all the text from Big data examiner.com
You can use Find_all() to find all the ‘a’ tags on the page.
To get the first four ‘a’ tags you can use limit attribute.
To find a particular text on a web page, you can use text attribute along with find All. Here I am searching for the term ‘data’ on big data examiner.
Get me the attribute of the second ‘a’ tag on big data examiner.
You can also use a list comprehension to get the attributes of the first 4 a tags on bigdata examiner.
A data scientist should know how to scrape data from websites, and I hope you have found this article useful as an introduction to web scraping with Python. Apart from beautiful soup there is another useful python library called pattern for web scraping. I also found a good tutorial on web scraping using Python.