Rolling your own RSS aggregator
Tutorial – Homegrown RSS Aggregator
Create the perfect mix of news with an RSS aggregator. Linux supports several open source aggregators, or, if you're looking for the perfect fit, you can even create your own.
Really Simple Syndication (RSS) [1] of website headlines is the most underrated tech on the Internet. This tutorial offers a flash introduction to RSS and shows how easy it is to build your own RSS aggregator to collect and process online news.
Understanding RSS
Technically speaking, RSS is an open standard that any website can use to provide small, plain-text files that contain titles, links, and excerpts. Programs called RSS aggregators download these files from many independent sources and display the headlines in one coherent interface.
These days too many people don't even know that RSS exists, and one reason might be the false but widespread belief that RSS died in 2013 when Google killed off Google Reader.
I have lost count of how many times I have read this false claim. RSS never died: It's just extremely underused, in no small part because those who would benefit from it the most – that is, web publishers – sometimes actively hide their own RSS feeds [2].
The truth is that RSS is a very efficient way to browse all the news you want in one place. Finding RSS feeds to follow is easy. You can even find free services such as WikiRSS [3] that extract RSS links matching a given keyword from Wikipedia (Figure 1). It is also possible to create and exchange whole lists of feeds in Outline Processor Markup Language (OPML) [4], a plain-text format that is easy to create and process in any programming language. You can even follow any Mastodon account via RSS by adding .rss to the end of the account's public profile URL. I could fill pages just listing other similar examples of RSS integration with other services.
Thanks to its decentralization, interoperability, simplicity, and openness, RSS is the best way to follow scores of news sources without centralized surveillance, profiling, or a single point of failure [5]. With any RSS aggregator, and there are plenty of them, you curate your own experience of the web at your pace, without flame wars or other distractions.
Open Source RSS Readers
RSS apps for the Linux desktop include Akregator [6] (Figure 2) and Liferea [7] (Figure 3). Akregator is a KDE application often used in tandem with the Konqueror web browser and file manager, but it has its own tabbed interface for downloading and reading RSS content without opening a browser. Liferea can cache articles for offline reading, save their headlines in "news bins," perform searches on RSS posts in the news bins, and play podcasts linked in the feeds.
Tiny Tiny RSS (TT-RSS) [8] is a web-based aggregator that you can run from its official Docker image, (only available for Linux/AMD systems). Like Liferea and Akregator, TT-RSS can filter news by several criteria and display the full contents, but its main advantage is the fact that it can be installed on any web server. From there it can support access by multiple users (Figure 4), even from its official Android client.
RSSHub [9] is another web-based RSS front end. According to the project homepage, RSSHub can generate feeds from "pretty much everything." The easiest way to see what RSSHub can do is to try the online demo linked from the homepage. As with TT-RSS, you can self-host an instance of RSSHub on your own server [10]. As a final example of web-based aggregators, take a look at FreeRSS [11] (Figure 5).
When it comes to the command line, a very simple client such as Rsstail [12] might be all you need, but I personally recommend the Newsboat program [13] shown in Figures 6, 7, and 8. Newsboat lets you read whole articles or play podcasts, filter or group feeds according to many criteria, and define macros to automate actions. Above all, Newsboat can directly send links or whole articles to other programs for post-processing and is generally a great foundation for RSS automation. I like Newsboat because it handles any kind of RSS file, it's actively developed, and it is easy to install on any major Linux distribution (personally, I'm using it on both Ubuntu and CentOS without problems).
Custom RSS Application
RSS makes it very easy to extend or customize a reader app to include third-party services – or even some of your own code. If you have a website and want to automatically share all updates with your Mastodon account, for example, you can just use MastoFeed [14].
Figure 9 shows what might be the most efficient and least distracting way to browse RSS headlines on a Linux desktop: Just place them in dynamic pop-up menus of your favorite window manager! To learn how, see my previous article on desktop newsfeeds [15].
I have been browsing almost 100 RSS feeds every day for almost 10 years with a really bare but terribly efficient web-based aggregator I coded myself in a few hours. My app works so well for me that in all those years I have patched or completely rewritten the code several times but have never changed its flow. The tool downloads all the feeds listed in a plain-text file with the format of Listing 1, extracts titles and links of all the articles they contain, and shows all the text published after the previous run in one super-simple HTML page, grouped by category (Figure 10).
Listing 1
List of RSS Feeds, Grouped by Category
The first version of my homemade aggregator was a patchwork of very ugly Python and Perl scripts that used the RSS modules available for those languages. Last year, however, I rewrote the whole thing from scratch as the two scripts from Listings 2 and 3, and, although they are still ugly, I am very happy with the result.
The main script, myrss.sh
(Listing 2) runs on my server as a cron job of the root user, twice every hour, at 7 and 37 minutes. The corresponding crontab entry is:
7,37 * * * * root /usr/local/bin/myrss.sh
Listing 2
myrss.sh
The script uses Newsboat to do all the actual RSS work. The first 15 lines set all the variables that the script needs, create the $ARCHIVEDIR
folder if it does not exist, and remove the results of the previous run (lines 14 and 15). The $FEEDLIST
file, partially shown in Listing 1, contains one feed per line, with its category and URL separated by the pipe character. The numeric prefixes for each category name (e.g., 20 for "World") allow me to add, remove, or rename categories at will, while sorting them as I want, in the final HTML page shown in Figure 10.
Lines 17 and 18 extract the category names from the $FEEDLIST
into the temporary rss-channels
file and then save them, using the mapfile
Bash function, in an array called $CHANNELS
. Running the script on the $FEEDLIST
shown in Listing 1 would fill that array with the four categories "020-World," "030-TechInnovation," "040-FOSS," and "070-Trekking."
The loop in lines 20 to 50 does two things: Lines 22 to 26 extract from $FEEDLIST
all the URLs of the newsfeed that belong to the current category $C
and save them, using the same mapfile
technique of line 18, in another array called $CURFEEDS
. When $C
is "020-World," for example, that array would contain the URLs of the first two lines of Listing 1.
The real engine of the script is the inner loop shown in lines 28 to 49, which works around what is, for my purpose, a limitation of Newsboat: As of this writing, there is no way to run Newsboat from a script that directly dumps all the titles and URLS it finds into a plain-text file. That is not a showstopper, however, because Newsboat caches all that data into an SQLite3 database. So the trick is to tell Newsboat what file to use as the cache and then query it with the SQLite3 program.
Consequently, each iteration of that loop saves the current element of $CURFEEDS
into the temporary file /root/single-feed
(line 30) and then tells Newsboat (line 32) to download the feed contained in that URL (the -u
option), save all its content in the cache file $SQLITECACHE
(-c
), and then quit (-x
reload
).
Once Newsboat has done its job, lines 41 to 48 launch SQLite3 to grab URLs and titles from the cache and save them into the file $ARCHIVEDIR/latest-news-unsorted
. Line 41 is the most cryptic at first sight, but not so difficult to explain. The <<!EOF
syntax is what Bash manuals call a here document. It means that all the lines that follow, until the one that begins and ends with the same string after the exclamation mark (that is, lines 42 to 46) must be passed as one file to the standard input of the program called right before the double <
sign.
In other words, the first half of line 41 means "run the SQLite3 program to open the $SQLITECACHE
database and, using the |
character as a column separator, execute all the commands listed below, until the EOF
string."
The statements in the here document just tell SQLite3 to fetch all the titles and corresponding links from the Newsboat database (line 45), print them to the standard output without headers one per line (line 43), and then quit.
In the second half of line 41, the sed
utility adds to each line printed out by SQLite3 a prefix consisting of the name of the current channel followed by a pipe character. The result is appended to the file $ARCHIVEDIR/latest-news-unsorted
.
Please note that the actual instructions in the here documents may depend on which exact version of SQLite3 is available on your system. As an example, check out the discussion [16] where the Newsboat developers (thanks!) explained to me how to do what I needed.
The rest of myrss.sh
is much easier to explain – or modify as you wish. After the main loop has ended, the file $ARCHIVEDIR/latest-news-unsorted
will contain lines such as
040-FOSS|Alchemy Is A Free PDF File Converter|https://medevel.com/alchemy-file-converter/
plus some blank lines from SQLite3, and possibly duplicated lines, if two RSS feeds contained the same link with the same headline (yes, it happens regularly). That is the reason to take all – and only – the lines of the file that contain a URL (grep http
), sort them, remove duplicates (uniq
), and save what remains into $ARCHIVEDIR/latest-news
. Line 55, instead, removes all the results of previous runs of myrss.sh
that are older than 60 days. The reasons for both this action and for line 57, which compacts all the URLs found in the past 60 days into $ARCHIVEDIR/all-old-urls
, will be clear later.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.
-
XZ Gets the All-Clear
The back door xz vulnerability has been officially reverted for Fedora 40 and versions 38 and 39 were never affected.
-
Canonical Collaborates with Qualcomm on New Venture
This new joint effort is geared toward bringing Ubuntu and Ubuntu Core to Qualcomm-powered devices.
-
Kodi 21.0 Open-Source Entertainment Hub Released
After a year of development, the award-winning Kodi cross-platform, media center software is now available with many new additions and improvements.
-
Linux Usage Increases in Two Key Areas
If market share is your thing, you'll be happy to know that Linux is on the rise in two areas that, if they keep climbing, could have serious meaning for Linux's future.
-
Vulnerability Discovered in xz Libraries
An urgent alert for Fedora 40 has been posted and users should pay attention.