Rolling your own RSS aggregator
Tutorial – Homegrown RSS Aggregator
Create the perfect mix of news with an RSS aggregator. Linux supports several open source aggregators, or, if you're looking for the perfect fit, you can even create your own.
Really Simple Syndication (RSS) [1] of website headlines is the most underrated tech on the Internet. This tutorial offers a flash introduction to RSS and shows how easy it is to build your own RSS aggregator to collect and process online news.
Understanding RSS
Technically speaking, RSS is an open standard that any website can use to provide small, plain-text files that contain titles, links, and excerpts. Programs called RSS aggregators download these files from many independent sources and display the headlines in one coherent interface.
These days too many people don't even know that RSS exists, and one reason might be the false but widespread belief that RSS died in 2013 when Google killed off Google Reader.
I have lost count of how many times I have read this false claim. RSS never died: It's just extremely underused, in no small part because those who would benefit from it the most – that is, web publishers – sometimes actively hide their own RSS feeds [2].
The truth is that RSS is a very efficient way to browse all the news you want in one place. Finding RSS feeds to follow is easy. You can even find free services such as WikiRSS [3] that extract RSS links matching a given keyword from Wikipedia (Figure 1). It is also possible to create and exchange whole lists of feeds in Outline Processor Markup Language (OPML) [4], a plain-text format that is easy to create and process in any programming language. You can even follow any Mastodon account via RSS by adding .rss to the end of the account's public profile URL. I could fill pages just listing other similar examples of RSS integration with other services.
Thanks to its decentralization, interoperability, simplicity, and openness, RSS is the best way to follow scores of news sources without centralized surveillance, profiling, or a single point of failure [5]. With any RSS aggregator, and there are plenty of them, you curate your own experience of the web at your pace, without flame wars or other distractions.
Open Source RSS Readers
RSS apps for the Linux desktop include Akregator [6] (Figure 2) and Liferea [7] (Figure 3). Akregator is a KDE application often used in tandem with the Konqueror web browser and file manager, but it has its own tabbed interface for downloading and reading RSS content without opening a browser. Liferea can cache articles for offline reading, save their headlines in "news bins," perform searches on RSS posts in the news bins, and play podcasts linked in the feeds.
Tiny Tiny RSS (TT-RSS) [8] is a web-based aggregator that you can run from its official Docker image, (only available for Linux/AMD systems). Like Liferea and Akregator, TT-RSS can filter news by several criteria and display the full contents, but its main advantage is the fact that it can be installed on any web server. From there it can support access by multiple users (Figure 4), even from its official Android client.
RSSHub [9] is another web-based RSS front end. According to the project homepage, RSSHub can generate feeds from "pretty much everything." The easiest way to see what RSSHub can do is to try the online demo linked from the homepage. As with TT-RSS, you can self-host an instance of RSSHub on your own server [10]. As a final example of web-based aggregators, take a look at FreeRSS [11] (Figure 5).
When it comes to the command line, a very simple client such as Rsstail [12] might be all you need, but I personally recommend the Newsboat program [13] shown in Figures 6, 7, and 8. Newsboat lets you read whole articles or play podcasts, filter or group feeds according to many criteria, and define macros to automate actions. Above all, Newsboat can directly send links or whole articles to other programs for post-processing and is generally a great foundation for RSS automation. I like Newsboat because it handles any kind of RSS file, it's actively developed, and it is easy to install on any major Linux distribution (personally, I'm using it on both Ubuntu and CentOS without problems).
Custom RSS Application
RSS makes it very easy to extend or customize a reader app to include third-party services – or even some of your own code. If you have a website and want to automatically share all updates with your Mastodon account, for example, you can just use MastoFeed [14].
Figure 9 shows what might be the most efficient and least distracting way to browse RSS headlines on a Linux desktop: Just place them in dynamic pop-up menus of your favorite window manager! To learn how, see my previous article on desktop newsfeeds [15].
I have been browsing almost 100 RSS feeds every day for almost 10 years with a really bare but terribly efficient web-based aggregator I coded myself in a few hours. My app works so well for me that in all those years I have patched or completely rewritten the code several times but have never changed its flow. The tool downloads all the feeds listed in a plain-text file with the format of Listing 1, extracts titles and links of all the articles they contain, and shows all the text published after the previous run in one super-simple HTML page, grouped by category (Figure 10).
Listing 1
List of RSS Feeds, Grouped by Category
020-World|https://brilliantmaps.com/rss 020-World|https://restofworld.org/feed 030-TechInnovation|http://rss.slashdot.org/Slashdot/slashdot 030-TechInnovation|https://news.ycombinator.com/rss 030-TechInnovation|https://www.datasciencecentral.com/feed/atom 040-FOSS|http://fossforce.com/newswire 040-FOSS|http://lxer.com/module/newswire/headlines.rdf 040-FOSS|http://www.linux-magazine.com/rss/feed/lmi_news 040-FOSS|http://www.tuxmachines.org/node/feed 070-Trekking|http://www.backpackinglight.com/rss
The first version of my homemade aggregator was a patchwork of very ugly Python and Perl scripts that used the RSS modules available for those languages. Last year, however, I rewrote the whole thing from scratch as the two scripts from Listings 2 and 3, and, although they are still ugly, I am very happy with the result.
The main script, myrss.sh
(Listing 2) runs on my server as a cron job of the root user, twice every hour, at 7 and 37 minutes. The corresponding crontab entry is:
7,37 * * * * root /usr/local/bin/myrss.sh
Listing 2
myrss.sh
1 #!/bin/bash 2 3 MOMENT=`date +%Y%m%d%H%M` 4 ARCHIVEDIR="/tmp/myrss-archive" 5 SQLITECACHE="$ARCHIVEDIR/tempcache.db" 6 SQLITECACHE="/root/tempcache.db" 7 8 FEEDLIST="/usr/local/etc/feed-list.txt" 9 HTMLGEN="/usr/local/bin/generate-rss-page.pl" 10 HTMLDIR="/var/www/html/main" 11 NEWSBOAT="/var/lib/snapd/snap/bin/newsboat" 12 13 mkdir -p $ARCHIVEDIR 14 rm -f $ARCHIVEDIR/latest-news 15 rm -f $ARCHIVEDIR/rss-channels 16 17 grep -v '^#' $FEEDLIST | cut '-d|' -f1 | sort | uniq > $ARCHIVEDIR/rss-channels 18 mapfile -t CHANNELS < $ARCHIVEDIR/rss-channels 19 20 for C in "${CHANNELS[@]}" 21 do 22 echo "CATEGORY $C" 23 rm -f $ARCHIVEDIR/current-feeds 24 grep $C $FEEDLIST | grep -v '^#' | cut '-d|' -f2 > $ARCHIVEDIR/current-feeds 25 mapfile -t CURFEEDS < $ARCHIVEDIR/current-feeds 26 rm $ARCHIVEDIR/current-feeds 27 for F in "${CURFEEDS[@]}" 28 do 29 # save the current feed with newsboat 30 echo $F > /root/single-feed 31 rm -f $SQLITECACHE 32 $NEWSBOAT -u /root/single-feed -c $SQLITECACHE -x reload 33 RESULT=$? 34 if [ $RESULT=0 ] 35 then 36 echo -n "OK $C / $F : " 37 else 38 echo -n "ERROR: $RESULT $C / $F: " 39 fi 40 41 sqlite3 -separator '|' $SQLITECACHE <<!EOF | sed -e "s/^/$C|/" >> $ARCHIVEDIR/latest-news-unsorted 42 .headers off 43 .mode list 44 .output stdout 45 SELECT title, url FROM rss_item; 46 .quit 47 !EOF 48 rm -f $SQLITECACHE 49 done 50 done 51 52 sort $ARCHIVEDIR/latest-news-unsorted | grep http | sort | uniq > $ARCHIVEDIR/latest-news 53 rm -f $ARCHIVEDIR/latest-news-unsorted 54 55 find $ARCHIVEDIR -type f -name "loaded-urls-*" -mtime +60 -exec rm {} \; 56 57 cat $ARCHIVEDIR/loaded-urls-* | sort | uniq > $ARCHIVEDIR/all-old-urls 58 59 cd $HTMLDIR 60 61 mv main-5.html main-6.html 62 mv main-4.html main-5.html 63 ... 64 mv main.html main-1.html 65 66 $HTMLGEN $ARCHIVEDIR/all-old-urls > main.html 67 68 cut '-d|' -f3 $ARCHIVEDIR/latest-news | sort | uniq > $ARCHIVEDIR/loaded-urls-$MOMENT.txt 69 70 exit
The script uses Newsboat to do all the actual RSS work. The first 15 lines set all the variables that the script needs, create the $ARCHIVEDIR
folder if it does not exist, and remove the results of the previous run (lines 14 and 15). The $FEEDLIST
file, partially shown in Listing 1, contains one feed per line, with its category and URL separated by the pipe character. The numeric prefixes for each category name (e.g., 20 for "World") allow me to add, remove, or rename categories at will, while sorting them as I want, in the final HTML page shown in Figure 10.
Lines 17 and 18 extract the category names from the $FEEDLIST
into the temporary rss-channels
file and then save them, using the mapfile
Bash function, in an array called $CHANNELS
. Running the script on the $FEEDLIST
shown in Listing 1 would fill that array with the four categories "020-World," "030-TechInnovation," "040-FOSS," and "070-Trekking."
The loop in lines 20 to 50 does two things: Lines 22 to 26 extract from $FEEDLIST
all the URLs of the newsfeed that belong to the current category $C
and save them, using the same mapfile
technique of line 18, in another array called $CURFEEDS
. When $C
is "020-World," for example, that array would contain the URLs of the first two lines of Listing 1.
The real engine of the script is the inner loop shown in lines 28 to 49, which works around what is, for my purpose, a limitation of Newsboat: As of this writing, there is no way to run Newsboat from a script that directly dumps all the titles and URLS it finds into a plain-text file. That is not a showstopper, however, because Newsboat caches all that data into an SQLite3 database. So the trick is to tell Newsboat what file to use as the cache and then query it with the SQLite3 program.
Consequently, each iteration of that loop saves the current element of $CURFEEDS
into the temporary file /root/single-feed
(line 30) and then tells Newsboat (line 32) to download the feed contained in that URL (the -u
option), save all its content in the cache file $SQLITECACHE
(-c
), and then quit (-x
reload
).
Once Newsboat has done its job, lines 41 to 48 launch SQLite3 to grab URLs and titles from the cache and save them into the file $ARCHIVEDIR/latest-news-unsorted
. Line 41 is the most cryptic at first sight, but not so difficult to explain. The <<!EOF
syntax is what Bash manuals call a here document. It means that all the lines that follow, until the one that begins and ends with the same string after the exclamation mark (that is, lines 42 to 46) must be passed as one file to the standard input of the program called right before the double <
sign.
In other words, the first half of line 41 means "run the SQLite3 program to open the $SQLITECACHE
database and, using the |
character as a column separator, execute all the commands listed below, until the EOF
string."
The statements in the here document just tell SQLite3 to fetch all the titles and corresponding links from the Newsboat database (line 45), print them to the standard output without headers one per line (line 43), and then quit.
In the second half of line 41, the sed
utility adds to each line printed out by SQLite3 a prefix consisting of the name of the current channel followed by a pipe character. The result is appended to the file $ARCHIVEDIR/latest-news-unsorted
.
Please note that the actual instructions in the here documents may depend on which exact version of SQLite3 is available on your system. As an example, check out the discussion [16] where the Newsboat developers (thanks!) explained to me how to do what I needed.
The rest of myrss.sh
is much easier to explain – or modify as you wish. After the main loop has ended, the file $ARCHIVEDIR/latest-news-unsorted
will contain lines such as
040-FOSS|Alchemy Is A Free PDF File Converter|https://medevel.com/alchemy-file-converter/
plus some blank lines from SQLite3, and possibly duplicated lines, if two RSS feeds contained the same link with the same headline (yes, it happens regularly). That is the reason to take all – and only – the lines of the file that contain a URL (grep http
), sort them, remove duplicates (uniq
), and save what remains into $ARCHIVEDIR/latest-news
. Line 55, instead, removes all the results of previous runs of myrss.sh
that are older than 60 days. The reasons for both this action and for line 57, which compacts all the URLs found in the past 60 days into $ARCHIVEDIR/all-old-urls
, will be clear later.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.