Scraping highly dynamic websites

Programming Snapshot – chromedp

Article from Issue 234/2020

Author(s): Mike Schilli

Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.

Gone are the days when hobbyists could simply download websites quickly with a curl command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.

For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.

For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF

Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES

Print Issues

Digital Issues

SUBSCRIPTIONS

Print Subs

Digisubs

TABLET & SMARTPHONE APPS

US / Canada

UK / Australia

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

News

Mozilla Plans to AI-ify Firefox

Artificial Inte... , Firefox , privacy

With a new CEO in control, Mozilla is doubling down on a strategy of trust, all the while leaning into AI.
Gnome Says No to AI-Generated Extensions

Artificial Inte... , Gnome , LLM

If you're a developer wanting to create a new Gnome extension, you'd best set aside that AI code generator, because the extension team will have none of that.
Parrot OS Switches to KDE Plasma Desktop

Linux , Parrot OS , Plasma

Yet another distro is making the move to the KDE Plasma desktop.
TUXEDO Announces Gemini 17

Hardware , laptop , Linux

TUXEDO Computers has released the fourth generation of its Gemini laptop with plenty of updates.
Two New Distros Adopt Enlightenment

Desktop , Enlightenment , Linux

MX Moksha and AV Linux 25 join ranks with Bodhi Linux and embrace the Enlightenment desktop.
Solus Linux 4.8 Removes Python 2

Operating Systems , Python , Solus Linux

Solus Linux 4.8 has been released with the latest Linux kernel, updated desktops, and a key removal.
Zorin OS 18 Hits over a Million Downloads

Linux , open source , Zorin OS

If you doubt Linux isn't gaining popularity, you only have to look at Zorin OS's download numbers.
TUXEDO Computers Scraps Snapdragon X1E-Based Laptop

Hardware , laptop , Linux

Due to issues with a Snapdragon CPU, TUXEDO Computers has cancelled its plans to release a laptop based on this elite hardware.
Debian Unleashes Debian Libre Live

DEBIAN , free software , Linux

Debian Libre Live keeps your machine free of proprietary software.
Valve Announces Pending Release of Steam Machine

Games , Linux , Steam

Shout it to the heavens: Steam Machine, powered by Linux, is set to arrive in 2026.

Scraping highly dynamic websites

Programming Snapshot – chromedp

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

News

Mozilla Plans to AI-ify Firefox

Gnome Says No to AI-Generated Extensions

Parrot OS Switches to KDE Plasma Desktop

TUXEDO Announces Gemini 17

Two New Distros Adopt Enlightenment

Solus Linux 4.8 Removes Python 2

Zorin OS 18 Hits over a Million Downloads

TUXEDO Computers Scraps Snapdragon X1E-Based Laptop

Debian Unleashes Debian Libre Live

Valve Announces Pending Release of Steam Machine

Scraping highly dynamic websites

Programming Snapshot – chromedp

Buy this article as PDF

Buy Linux Magazine

Related content

Subscribe to our Linux Newsletters Find Linux and Open Source Jobs Subscribe to our ADMIN Newsletters

Support Our Work

News

Tag Cloud

Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters