Screen scraping with Colly in Go

Programming Snapshot – Colly

Lead Image © Hannu Viitanen, 123RF.com

Lead Image © Hannu Viitanen, 123RF.com

Article from Issue 223/2019
Author(s):

The Colly scraper helps developers who work with the Go programming language to collect data off the web. Mike Schilli illustrates the capabilities of this powerful tool with a few practical examples.

As long as there are websites to view for the masses of browser customers on the web, there will also be individuals on the consumer side who want the data in a different format and write scraper scripts to automatically extract the data to fit their needs.

Many sites do not like the idea of users scraping their data. Check the website's terms of service for more information, and be aware of the copyright laws for your jurisdiction. In general, as long as the scrapers do not republish or commercially exploit the data, or bombard the website too overtly with their requests, nobody is likely to get too upset about it.

Different languages offer different tools for this. Perl aficionados will probably appreciate the qualities of WWW::Mechanize as a scraping tool, while Python fans might prefer the selenium package [1]. In Go, there are several projects dedicated to scraping that attempt to woo developers.

[...]

Use Express-Checkout link below to read the full article (PDF).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Plan Your Hike

    The hiking and cycling app komoot saves your traveled excursion routes. Mike Schilli shows you how to retrieve the data with Go.

  • At a Glance

    Using extensions in Go and Ruby, Mike Schilli adapts the WTF terminal dashboard tool to meet his personal needs.

  • Under the Hood

    Screen scrapers often fail when confronted with complex web pages. To keep his scraper on task, Mike Schilli remotely controls the Chrome browser using the DevTools protocol to extract data, even from highly dynamic web pages.

  • Web Scraping

    Web scraping lets you automatically download and extract data from websites to build your own database. With a simple scraping script, you can harvest information from the web.

  • Simile

    The Simile project jump starts the semantic web with a collection of tools for extending semantic information to existing websites.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News