Scraping the web for data
Data Harvesting
Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.
If you are looking to collect data from the Internet for a personal database, your first stop is often a Google search. However, a search for mortgage rates can (in theory) return dozens of pages full of relevant images and data, as well as a lot of irrelevant content. You could visit every web page pulled up by your search and cut and paste the relevant data into your database. Or you could use a web scraper to automatically download and extract raw data from the web pages and reformat it into a table, graph, or spreadsheet on your computer.
Not just big data professionals, but also small business owners, teachers, students, shoppers, or just curious people can use web scraping to do all manner of tasks from researching a PhD thesis to creating a database of local doctors to comparing prices for online shopping. Unless you need to do really complicated stuff with super-optimized performance, web scraping is relatively easy. In this article, I'll show you how web scraping works with some practical examples that use the open source tool Beautiful Soup.
Caveats
Web scraping does have its limits. First, you have to start off with a well-crafted search engine query; web scraping can't replace the initial search. To protect their business and observe legal constraints, search engines deploy anti-scraping features; overcoming them is not worth the time of the occasional web scraper. Instead, Web scraping shines (and is irreplaceable) after you have completed your web search.
Web page complexity, which has increased over the past 10 to 15 years, also affects web scraping. Due to this complexity, determining the relevant parts of a web page's source code and how to process it has become more time consuming despite the great progress made by web scraping tools.
Dynamic content poses another problem for web scraping. Today, most pages continuously refresh, changing layout from one moment to the next, and are customized for each visitor. This makes the scraping code more complicated, without providing, sometimes, the certainty that the scraper will extract exactly what you would see in your browser. This is particularly problematic for online shopping. Not only does the price change frequently, but the price also depends on many independent factors, from shipping costs to your buyer profile, shopping history, or preferred payment method – all of which are just outside of web scraping's radar. Consequently, the best deal found by an ordinary scraping script may be quite different from what you would be offered when clicking on the corresponding link. (Unless you spent so much time and effort on tweaking the web scraper that you wouldn't have time left to enjoy your purchases!)
Additionally, many web pages only display properly after some JavaScript code has been downloaded and run or after some interaction with the user. Others, like Tumblr blogs, run an "infinite scroll" function. In all of these cases, a scraper may start parsing the code before it is ready for viewing, thus failing to deliver what you would see in your browser.
Changing HTML tags is yet another issue. Scraping works by recognizing certain HTML tags, with certain labels, in the web page you want to scrape. The label names may change after a software upgrade or adoption of a different graphic theme, resulting in your scraper script failing until you update its code accordingly.
Scraping can consume a lot of bandwidth, which may create real problems for the website you are scraping. To remedy this, make your scrapers as slow as possible and scrape only when there is no alternative (i.e., when webmasters don't provide direct access to the data you want), and everybody will be happy.
Finally, a website's business needs, as well as copyright and other legal issues, stand in the way of web scraping. Many webmasters try their best to block automated scraping of their content. This article does not attempt to address the copyright issues related to screen scraping, which can vary with the site requirements and jurisdiction.
Web Scraping Steps
In spite of these caveats, web scraping remains an immensely useful activity (and if you ask me, a really fun one) in many real world cases. In practice, every web scraping project goes through the same five steps: discovery, page analysis, automatic download, data extraction, and data archival.
In the discovery phase, you search for the pages you want to scrape via a search engine or by simply looking at a website's home page.
During page analysis, you study the HTML code of your selected pages to determine the location of the desired elements and how a scraping script might recognize them. Most of the time, HTML tags' id
and class
attributes are the easiest to detect and use for web scraping, but that is not always the case. Only visual inspection can confirm this and give you the names of those attributes. You can get a web page's HTML code by saving the complete web page on your computer or right-clicking on View Page Source (or View Selection source for a selected paragraph) in your browser.
The final three steps (automatic download, data extraction, and data archival) involve writing a script that will actually download the page(s), find the right data inside them, and write them to an external file for later processing. The most common format, which I use in two of my examples, is CSV (Comma-Separated-Values), which is a plain text file with one record per line and fields separated by commas or other predefined characters (I prefer pipes). JSON (JavaScript Object Notation) is another popular choice that is more efficient for certain applications.
Remember to keep both your code and your output data as simple as possible. For most web scraping activities, you will only grab the data once, but may then spend days or months analyzing it. Consequently, it doesn't make sense to optimize the scraping code. For instance, a program that will run once while you sleep (even if it takes a whole night to finish) isn't worth spending two hours of your time to optimize. In terms of your output data, it's difficult to know in advance all the possible ways you may want to process it. Therefore, just make sure that the extracted data is correct and laid out in a structured way when you save it. You can always reformat the data later, if necessary.
Beautiful Soup
Currently, the most popular open source web scraping tool is Beautiful Soup [1], a Python library for extracting data out of HTML and XML files. It has a large user community and very good documentation. If Beautiful Soup 4 is not available as a binary package for your Linux distribution, you can install it with the Python package manager, pip
. On Ubuntu and other Apt-based distributions, you can install pip
and Beautiful Soup with these two commands:
sudo apt-get install python3-pip pip install requests BeautifulSoup4
With Beautiful Soup, you can create scraping scripts for simple text and image scraping or for more complex projects.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.