Scraping highly dynamic websites
Programming Snapshot – chromedp
Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.
Gone are the days when hobbyists could simply download websites quickly with a curl
command in order to machine-process their content. The problem is that state-of-the-art websites are teeming with reactive design and dynamic content that only appears when a bona fide, JavaScript-enabled web browser points to it.
For example, if you wanted to write a screen scraper for Gmail, you wouldn't even get through the login process with your script. In fact, even a scraping framework like Colly [1] would fail here, because it does not support JavaScript and does not know the browser's DOM (Document Object Model), upon which the web flow relies. One elegant workaround is for the scraper program to navigate a real browser to the desired web page and to inquire later about the content currently displayed.
For years, developers have been using the Java Selenium suite for fully automated unit tests for Web user interfaces (UIs). The tool speaks the Selenium protocol, which is supported by all standard browsers, to get things moving. Google's Chrome browser additionally implements the DevTools protocol [2], which does similar things, and the chromedp project on GitHub [3] defines a Go library based on it. Go enthusiasts can now write their unit tests and scraper programs natively in their favorite language. I'll take a look at some screen-scraping techniques in this article, but keep in mind that many websites have licenses that prohibit screen scraping. See the site's permission page and consult the applicable laws for your jurisdiction.
Directing Chrome
The Go program in Listing 1 [4] launches the Chrome browser, points it at the Linux Magazine web page, and then takes a screenshot of the retrieved content. The whole thing runs at the command line if you type
Listing 1
screenshot.go
01 package main 02 03 import ( 04 "context" 05 emu "github.com/chromedp/cdproto/emulation" 06 "github.com/chromedp/cdproto/page" 07 cdp "github.com/chromedp/chromedp" 08 "io/ioutil" 09 ) 10 11 func main() { 12 ctx, cancel := 13 cdp.NewContext(context.Background()) 14 defer cancel() 15 16 var buf []byte 17 tasks := cdp.Tasks{ 18 cdp.Navigate( 19 "http://linux-magazine.com"), 20 cdp.ActionFunc( 21 func(ctx context.Context) error { 22 _, _, contentSize, err := 23 page.GetLayoutMetrics().Do(ctx) 24 if err != nil { 25 panic(err) 26 } 27 28 w, h := contentSize.Width, 29 contentSize.Height 30 31 viewPortFix(ctx, int64(w), int64(h)) 32 33 buf, err = page.CaptureScreenshot(). 34 WithQuality(90). 35 WithClip(&page.Viewport{ 36 X: contentSize.X, 37 Y: contentSize.Y, 38 Width: w, 39 Height: h, 40 Scale: 1, 41 }).Do(ctx) 42 if err != nil { 43 panic(err) 44 } 45 return nil 46 })} 47 48 err := cdp.Run(ctx, tasks) 49 if err != nil { 50 panic(err) 51 } 52 53 err = ioutil.WriteFile("screenshot.png", 54 buf, 0644) 55 if err != nil { 56 panic(err) 57 } 58 } 59 60 func viewPortFix( 61 ctx context.Context, w, h int64) { 62 err := emu.SetDeviceMetricsOverride( 63 w, h, 1, false). 64 WithScreenOrientation( 65 &emu.ScreenOrientation{ 66 Type: 67 emu.OrientationTypePortraitPrimary, 68 Angle: 0, 69 }). 70 Do(ctx) 71 72 if err != nil { 73 panic(err) 74 } 75 }
go build screenshot.go
followed by ./screenshot
. The user will not see a browser pop up, because chromedp
normally runs in headless (i.e., invisible) mode, unless otherwise configured. The following command gets the required library code from GitHub and also compiles and installs it:
$ go get -u github.com/chromedp/chromedp
It takes the compiled program in Listing 1 a few seconds to retrieve the page, depending on your Internet connection and the current server speed; then it saves an image file in PNG format named screenshot.png
to the hard disk as a result. Since the Linux Magazine homepage fills several browser pages in terms of length, giving users a reason to scroll down and explore, the screenshot in Figure 1 is almost 3000 pixels tall.
Listing 1 creates a new chromedp context in line 13 and gives the constructor a standard Go background context, which is an auxiliary construct for controlling Go routines and subroutines. A context constructor in Go returns a cancel()
function. This function can be called by the main program later to signal to another (maybe deeply) nested part of the program that it is time to clean up, because doors are being closed.
The Tasks
structure starting on line 17 defines a set of actions that you want the connected Chrome browser to perform, using the DevTools protocol. The Navigate
task starting on line 18 directs the browser to the Linux Magazine website. The second task starting in line 20 is created by the ActionFunc()
function, a tool to structure new customized tasks in chromedp. In this case, the task creates a screenshot of the web page displayed in the remote browser using the function CaptureScreenshot()
in line 33.
Wide Open Spaces
Now the question is how far to open the virtual browser, because this setting determines what you see in the screenshot. Is only a fraction of the web page visible or all of it, including the parts that can only be reached by scrolling? If it's the latter, the screenshot needs to capture everything that the user would see if they had an infinitely tall screen with the browser fully extended.
To capture it all, the GetLayoutMetrics()
function calculates the layout dimensions of the displayed page, and the viewPortFix()
function (called in line 31 and defined in line 60) uses SetDeviceMetricsOverride()
to adjust the dimensions of the invisible browser. The buf
image buffer returned by the Screenshot
function in line 33 is written to disk in PNG format by WriteFile()
. The sequence of the actions, starting with navigating to the page, followed by taking the screenshot, is processed by the Run()
function starting in line 48.
The technique of creating screenshots of automatically fetched web pages opens up a number of unheard-of possibilities when testing newly developed web UIs. For example, image recognition can later determine whether the site's various graphic elements are in the right place with different browser sizes, without human test personnel having to click their way through the flow with every release. It could also be used to implement a neat system for archiving websites; in the next century, historians would surely be amused by the advertisements placed on the Linux Magazine homepage in 2020.
Complicating Easy Things
For test purposes, it would be quite useful at times to start the remote browser visibly in the foreground instead of hidden in the background. Developers of scraping applications can thus determine if the browser is stepping through or if it gets stuck at some point. Paradoxically, however, setting up foreground mode has become quite complicated since the introduction of default background mode in chromedp some time ago, since using NewContext()
to create a new browser context configures the browser to run in background mode deep down in the library's engine compartment, which is inaccessible from outside.
This is why Listing 2 creates a new browser controller in the form of NewExecAllocator()
and passes it the NoFirstRun
option to make the browser run in the foreground. Back comes a context, but, alas, not a context compatible with the context object that chromedp uses and gives to Run()
in line 24 of the executing function. Therefore, line 12 creates a compatible context via NewContext()
and passes it the previously created Exec
context as a parent context. The new chromedp context also has a cancel()
function, and the defer
statements in lines 13 and 14 are both triggered at the end of the program to neatly collapse the remote-controlled browser.
Listing 2
foreground.go
01 package main 02 03 import ( 04 "context" 05 cdp "github.com/chromedp/chromedp" 06 "time" 07 ) 08 09 func main() { 10 pctx, pcancel := cdp.NewExecAllocator( 11 context.Background(), cdp.NoFirstRun) 12 ctx, cancel := cdp.NewContext(pctx) 13 defer cancel() 14 defer pcancel() 15 16 tasks := cdp.Tasks{ 17 cdp.Navigate( 18 "https://linux-magazin.de"), 19 cdp.Navigate( 20 "http://linux-magazine.com"), 21 cdp.Sleep(5 * time.Second), 22 } 23 24 err := cdp.Run(ctx, tasks) 25 if err != nil { 26 panic(err) 27 } 28 }
Listing 2 only accesses the homepages of the German and English versions of Linux Magazine for this test; it then Sleep()
s for five seconds and terminates.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.