Transform web pages into EPUB files
Read at Will
Instead of relying on a third-party read-it-later service, you can use this DIY tool to save articles from the Internet in a format that meets your specific needs.
Few of us have time to read long-form web articles during the day, which is why services that let you save interesting reads for later can come in handy. Popular services such as Pocket and Instapaper even offer apps you can use to read the saved content offline on your preferred device. Better still, the saved articles are reformatted for better readability and scrubbed of all ads, scripts, trackers, and other junk.
Hosted services are like restaurants, though. No matter how great the food and the service, you eventually start longing for home-cooked meals, not only because cooking at home is cheaper and more convenient, but because you can make any dish you wish just the way you like it and have fun in the process. In a similar vein, why settle for a ready-made, read-it-later service, when you can cook up your very own solution with a bit of creative thinking, the right mix of open source tools, and a dash of shell scripting magic? That's exactly what is on today's menu: a DIY read-it-later tool.
Instead of saving and serving slimmed down versions of web pages, this DIY read-it-later application is going to process pages and transform them into ePub files. This way, you can read the saved content on practically any device, and you can choose whatever ebook reading app you like. Because the DIY read-it-later tool is a simple shell script that relies on Linux tools, you don't need a server to host it. If necessary, you can run the tool on a remote Linux machine and serve ePub files via a dedicated Open Publication Distribution System (OPDS) server or simply publish the files on the web. In short, the DIY read-it-later tool gives you plenty of room for experimenting and setting up the solution that works best for your specific needs. Moreover, the fact that an ePub file is essentially a ZIP archive containing an XHTML file along with stylesheets, fonts, and so on makes the saved content future-proof and editable.
Preparatory Work
You don't have to code the DIY read-it-later tool from scratch, because I've already done the hard work for you and published the fruits of my labor, readiculous.sh
, on GitHub [1]. All you need to do is download the source code as a ZIP archive and unpack it, or clone the project's Git repository using the command:
git clone https://github.com/dmpop/readiculous.git
Before getting down to the nitty-gritty, you need to do some preparatory work. The first order of business is to install the required software. The main readiculous.sh
shell script relies on Pandoc, ImageMagick, jq
, wget
, and Go-Readability [2]. With the exception of Go-Readability, all of these dependencies are available in the official software repositories of most mainstream Linux distributions, so you can install them using the default package manager. To do this on Debian or an Ubuntu-based distribution, run the command:
sudo apt install pandoc imagemagick § jq wget
The source code on GitHub [1] includes a binary version of the Go-Readability tool compiled for the x86_64 architecture. If you plan to use the script on any other platform, or you want to have the very latest version of the tool, you will have to compile it yourself. Fortunately, it's a rather straightforward thing to do. Install the Go language package (use the sudo apt install golang
command on Debian and Ubuntu), and then run the following command to compile the command-line version of Go-Readability:
go get -u -v github.com/go-shiori/go-readability/cmd/...
Once the compiling process is finished, you'll find the resulting binary in the ~/go/bin
directory. Move the binary file into the readiculous
directory, and you're done.
How It Works
The readiculous.sh
script (Listing 1) starts working by fetching the desired page, scrubbing it clean, and reformatting it for better readability. To do all that, the script uses the nifty Go-Readability tool. Go-Readability also extracts the page title and passes it to ImageMagick, which creates a cover image with the obtained title. Finally, the Pandoc tool transforms the saved page into an ePub file complete with the generated cover.
Listing 1
readiculous.sh
01 #!/usr/bin/env bash 02 if [ ! -x "$(command -v convert)" ] || [ ! -x "$(command -v pandoc)" ] || [ ! -x "$(command -v jq)" ]; then 03 echo "Make sure that the required tools are installed" 04 exit 1 05 fi 06 07 # Usage prompt 08 usage() { 09 cat <<EOF 10 $0 [OPTIONS] 11 ------ 12 $0 transforms web pages pages into readable EPUB files. 13 14 USAGE: 15 ------ 16 $0 -u <URL> -d <dir> -m auto 17 18 OPTIONS: 19 -------- 20 -u Source URL 21 -d Destination directory (optional) 22 -m Enable auto mode (optional) 23 24 EXAMPLES: 25 --------- 26 $0 -u https://psyche.co/guides/how-to-approach-the-lifelong-project-of-language-learning -d "Language" 27 $0 -m auto 28 29 EOF 30 exit 1 31 } 32 33 #Read the specified parameters 34 while getopts "u:d:m:" opt; do 35 case ${opt} in 36 u) 37 url=$OPTARG 38 ;; 39 d) 40 dir=$OPTARG 41 ;; 42 m) 43 mode=$OPTARG 44 ;; 45 \?) 46 usage 47 ;; 48 esac 49 done 50 shift $((OPTIND - 1)) 51 52 if [ ! -z "$dir" ]; then 53 dir=Library/"$dir" 54 else 55 dir=Library 56 fi 57 mkdir -p "$dir" 58 59 readicule() { 60 # Extract title and image from the specified URL 61 title=$(./go-readability -m $url | jq '.title' | tr -d \") 62 # Generate a readable HTML file 63 ./go-readability $url >>"$dir/$title".html 64 # Generate a cover 65 wget -q https://picsum.photos/800/1024 -O cover.jpg 66 convert -background '#0008' -font Arvo -pointsize 35 -fill white -gravity center -size 800x150 caption:"$title" cover.jpg +swap -gravity south -composite cover.jpg 67 if [ -z "$title" ]; then 68 title="This is Readiculous!" 69 fi 70 # convert HTML to EPUB 71 pandoc -f html -t epub --metadata title="$title" --metadata creator="Readiculous" --metadata publisher="$url" --css=stylesheet.css --epub-cover-image=cover.jpg -o "$dir/$title".epub "$dir/$title".html 72 rm cover.jpg "$dir/$title".html 73 echo 74 echo ">>> '$title' has been saved in '$dir'" 75 echo 76 } 77 78 # If "-m auto" is specified 79 if [ "$mode" = "auto" ]; then 80 file="links.txt" 81 if [ ! -f "$file" ]; then 82 echo "$file not found." 83 exit 1 84 fi 85 # Read the contents of the links.txt file line-by-line 86 while IFS="" read -r url || [ -n "$url" ]; do 87 readicule 88 done <"$file" 89 rm links.txt 90 exit 1 91 fi 92 93 if [ -z "$url" ]; then 94 usage 95 fi 96 97 readicule
The script accepts three parameters: -u
, -d
, and -m
. The mandatory -u
parameter specifies the URL of the target page, while the optional -d
parameter determines in which subdirectory the resulting ePub file should be saved. If the -d
parameter is omitted, the script saves ePub files in the default Library
directory. By specifying the subfolder, you can automatically sort the created ePub files by topic (for example, Language, Travel, Long Reads, and so on), or any other criteria. The -m
parameter allows you to convert several saved URLs at once, but I'll take a closer look at it later. The script uses a combination of the getopts
tool, the do...done
loop, and the case in
control structure to read the values passed by the specified parameters and assign these values to variables (lines 34-50 in Listing 1). If the default Library
directory doesn't exist, the script creates it (lines 52-57).
Listing 1's readicule()
function does the actual work. First, Go-Readability obtains the metadata of the specified page. The metadata is returned in the JSON format, and the jq
tool extracts the title, while the tr
tool strips double quotes (line 61). The same Go-Readability tool fetches the page using the specified URL and saves the processed version as an HTML file (line 63).
The next step is to create a cover for use with the ePub file. Strictly speaking, covers are not necessary, but they do make it easier to find the file you need in the library, and they make the ePub file look less bland. To generate a cover, the script uses the wget
tool for fetching a random 1024x800 image from the Lorem Picsum service and saves the file as cover.jpg
(line 65). Then, the convert
tool superimposes the obtained title onto the cover image (line 66).
There are, of course, plenty of other ways to create covers if you don't want the script to rely on a third-party service. For example, you can create covers with random background colors. To do this, you need to tweak the script so that it generates three random numbers between
and 255
. The convert
tool can then use the numbers as red, green, and blue values for generating a cover:
r=$(shuf -i 0-255 -n 1) g=$(shuf -i 0-255 -n 1) b=$(shuf -i 0-255 -n 1) convert -size 800x1024 xc:rgb\($r,$g,$b\) cover.jpg
If solid colors are not your cup of tea, you can use the convert
tool to generate a random colorful fractal image and specify the -paint
and -blur
options for a more artistic effect:
convert -size 800x1024 plasma:fractal -paint 10 -blur 10x20 cover.png
Finally, Pandoc finishes the task. It assembles the saved HTML file, the generated cover, and the obtained data into an ePub file and saves it either in the default directory (line 71) or in the subdirectory specified by the -d
parameter.
But that's not all. If you read a lot, running the script every time you want to save a page for later can quickly become a nuisance. That's why the script also features the -m
parameter. When specified with the auto
value, the script picks URLs from the links.txt
file one by one and generates ePub files for each one. The if...then...fi
block that starts on line 79 checks whether the $mode
value is set to auto
. If so, the while...do
loop (lines 86-90) reads URLs from the links.txt
file and calls the readicule()
function to generate ePub files. If the $mode
value is not specified, the script simply calls the function to generate an ePub file using the URL passed by the -u
parameter.
To speed up the process of transforming articles into ePub files, you can create a simple helper script:
#!/usr/bin/env bash url=$(xclip -o) echo $url cd /path/to/readiculous ./readiculous.sh -u $url notify-send "Added to Readiculous"
Replace /path/to/readiculous
with the actual path to the readiculous
directory, and save the script under an appropriate name (for example, add-to-readiculous.sh
). Install the xclip
tool on your system, and assign a keyboard shortcut to the script.
The Matter of Reading
Saving articles in the ePub format means that you read them using practically any device on any platform. Better yet, if you use Apple Books or Google Books, you can take advantage of the features these apps offer, including synchronization across multiple devices, saving highlights, library management functionality, and more.
However, if you've gone to the trouble of rolling out your own read-it-later tool, it probably doesn't make much sense to use a third-party commercial platform for reading. Enter KOReader [3], an open source ebook reader application available for Linux, Android, and a slew of dedicated readers. Despite its deceptively simple interface, KOReader packs an impressive array of features, including syncing, highlights, gesture support, note-taking capabilities, extensions, and much, much more (Figure 1). So if you want to keep your entire read-it-later toolchain open source, you should use KOReader.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.