Rolling your own RSS aggregator

Tutorial – Homegrown RSS Aggregator

Article from Issue 271/2023
Author(s):

Create the perfect mix of news with an RSS aggregator. Linux supports several open source aggregators, or, if you're looking for the perfect fit, you can even create your own.

Really Simple Syndication (RSS) [1] of website headlines is the most underrated tech on the Internet. This tutorial offers a flash introduction to RSS and shows how easy it is to build your own RSS aggregator to collect and process online news.

Understanding RSS

Technically speaking, RSS is an open standard that any website can use to provide small, plain-text files that contain titles, links, and excerpts. Programs called RSS aggregators download these files from many independent sources and display the headlines in one coherent interface.

These days too many people don't even know that RSS exists, and one reason might be the false but widespread belief that RSS died in 2013 when Google killed off Google Reader.

I have lost count of how many times I have read this false claim. RSS never died: It's just extremely underused, in no small part because those who would benefit from it the most – that is, web publishers – sometimes actively hide their own RSS feeds [2].

The truth is that RSS is a very efficient way to browse all the news you want in one place. Finding RSS feeds to follow is easy. You can even find free services such as WikiRSS [3] that extract RSS links matching a given keyword from Wikipedia (Figure 1). It is also possible to create and exchange whole lists of feeds in Outline Processor Markup Language (OPML) [4], a plain-text format that is easy to create and process in any programming language. You can even follow any Mastodon account via RSS by adding .rss to the end of the account's public profile URL. I could fill pages just listing other similar examples of RSS integration with other services.

Figure 1: Wikipedia, and the web in general, is full of RSS feeds, about Linux or any other topic.

Thanks to its decentralization, interoperability, simplicity, and openness, RSS is the best way to follow scores of news sources without centralized surveillance, profiling, or a single point of failure [5]. With any RSS aggregator, and there are plenty of them, you curate your own experience of the web at your pace, without flame wars or other distractions.

Open Source RSS Readers

RSS apps for the Linux desktop include Akregator [6] (Figure 2) and Liferea [7] (Figure 3). Akregator is a KDE application often used in tandem with the Konqueror web browser and file manager, but it has its own tabbed interface for downloading and reading RSS content without opening a browser. Liferea can cache articles for offline reading, save their headlines in "news bins," perform searches on RSS posts in the news bins, and play podcasts linked in the feeds.

Figure 2: Akregator, the RSS component of the Kontact suite, has the same clean, efficient look and feel of all other major KDE applications.
Figure 3: Liferea is another very powerful and easy-to-use RSS aggregator for Linux.

Tiny Tiny RSS (TT-RSS) [8] is a web-based aggregator that you can run from its official Docker image, (only available for Linux/AMD systems). Like Liferea and Akregator, TT-RSS can filter news by several criteria and display the full contents, but its main advantage is the fact that it can be installed on any web server. From there it can support access by multiple users (Figure 4), even from its official Android client.

Figure 4: TT-RSS brings the same basic interface of Akregator or Liferea to any web browser, for multiple users!

RSSHub [9] is another web-based RSS front end. According to the project homepage, RSSHub can generate feeds from "pretty much everything." The easiest way to see what RSSHub can do is to try the online demo linked from the homepage. As with TT-RSS, you can self-host an instance of RSSHub on your own server [10]. As a final example of web-based aggregators, take a look at FreeRSS [11] (Figure 5).

Figure 5: FreeRSS has a much more basic interface than TT-RSS, but may be even more efficient.

When it comes to the command line, a very simple client such as Rsstail [12] might be all you need, but I personally recommend the Newsboat program [13] shown in Figures 6, 7, and 8. Newsboat lets you read whole articles or play podcasts, filter or group feeds according to many criteria, and define macros to automate actions. Above all, Newsboat can directly send links or whole articles to other programs for post-processing and is generally a great foundation for RSS automation. I like Newsboat because it handles any kind of RSS file, it's actively developed, and it is easy to install on any major Linux distribution (personally, I'm using it on both Ubuntu and CentOS without problems).

Figure 6: Bare, but powerful, Newsboat is probably the best command-line RSS aggregator for Linux.
Figure 7: Newsboat showing the 20 newest headlines from the RSS feed of Linux Magazine.
Figure 8: The excerpt of one of the news items shown in Listing 9.

Custom RSS Application

RSS makes it very easy to extend or customize a reader app to include third-party services – or even some of your own code. If you have a website and want to automatically share all updates with your Mastodon account, for example, you can just use MastoFeed [14].

Figure 9 shows what might be the most efficient and least distracting way to browse RSS headlines on a Linux desktop: Just place them in dynamic pop-up menus of your favorite window manager! To learn how, see my previous article on desktop newsfeeds [15].

Figure 9: With RSS, the latest news can be always at your fingertips, inside the menus of your window manager!

I have been browsing almost 100 RSS feeds every day for almost 10 years with a really bare but terribly efficient web-based aggregator I coded myself in a few hours. My app works so well for me that in all those years I have patched or completely rewritten the code several times but have never changed its flow. The tool downloads all the feeds listed in a plain-text file with the format of Listing 1, extracts titles and links of all the articles they contain, and shows all the text published after the previous run in one super-simple HTML page, grouped by category (Figure 10).

Listing 1

List of RSS Feeds, Grouped by Category

020-World|https://brilliantmaps.com/rss
020-World|https://restofworld.org/feed
030-TechInnovation|http://rss.slashdot.org/Slashdot/slashdot
030-TechInnovation|https://news.ycombinator.com/rss
030-TechInnovation|https://www.datasciencecentral.com/feed/atom
040-FOSS|http://fossforce.com/newswire
040-FOSS|http://lxer.com/module/newswire/headlines.rdf
040-FOSS|http://www.linux-magazine.com/rss/feed/lmi_news
040-FOSS|http://www.tuxmachines.org/node/feed
070-Trekking|http://www.backpackinglight.com/rss
Figure 10: My very own, custom RSS aggregator – a page packed with headlines and nothing else!

The first version of my homemade aggregator was a patchwork of very ugly Python and Perl scripts that used the RSS modules available for those languages. Last year, however, I rewrote the whole thing from scratch as the two scripts from Listings 2 and 3, and, although they are still ugly, I am very happy with the result.

The main script, myrss.sh (Listing 2) runs on my server as a cron job of the root user, twice every hour, at 7 and 37 minutes. The corresponding crontab entry is:

7,37  *  *  *  *  root /usr/local/bin/myrss.sh

Listing 2

myrss.sh

 1      #!/bin/bash
 2
 3      MOMENT=`date +%Y%m%d%H%M`
 4      ARCHIVEDIR="/tmp/myrss-archive"
 5      SQLITECACHE="$ARCHIVEDIR/tempcache.db"
 6      SQLITECACHE="/root/tempcache.db"
 7
 8      FEEDLIST="/usr/local/etc/feed-list.txt"
 9      HTMLGEN="/usr/local/bin/generate-rss-page.pl"
10      HTMLDIR="/var/www/html/main"
11      NEWSBOAT="/var/lib/snapd/snap/bin/newsboat"
12
13      mkdir -p $ARCHIVEDIR
14      rm -f $ARCHIVEDIR/latest-news
15  rm -f $ARCHIVEDIR/rss-channels
16
17      grep -v '^#' $FEEDLIST | cut '-d|' -f1 | sort | uniq  > $ARCHIVEDIR/rss-channels
18      mapfile -t CHANNELS < $ARCHIVEDIR/rss-channels
19
20      for C in "${CHANNELS[@]}"
21      do
22          echo "CATEGORY $C"
23          rm -f $ARCHIVEDIR/current-feeds
24          grep $C $FEEDLIST | grep -v '^#' | cut '-d|' -f2  > $ARCHIVEDIR/current-feeds
25          mapfile -t CURFEEDS < $ARCHIVEDIR/current-feeds
26          rm $ARCHIVEDIR/current-feeds
27          for F in "${CURFEEDS[@]}"
28          do
29          # save the current feed with newsboat
30              echo $F > /root/single-feed
31              rm -f $SQLITECACHE
32              $NEWSBOAT -u /root/single-feed -c $SQLITECACHE -x reload
33              RESULT=$?
34              if [ $RESULT=0 ]
35              then
36                  echo -n  "OK $C / $F  : "
37              else
38                  echo -n "ERROR: $RESULT $C / $F: "
39              fi
40
41              sqlite3 -separator '|' $SQLITECACHE <<!EOF  | sed -e "s/^/$C|/" >> $ARCHIVEDIR/latest-news-unsorted
42      .headers off
43      .mode list
44      .output stdout
45      SELECT title, url FROM rss_item;
46      .quit
47      !EOF
48              rm -f $SQLITECACHE
49              done
50      done
51
52      sort  $ARCHIVEDIR/latest-news-unsorted | grep http  | sort | uniq > $ARCHIVEDIR/latest-news
53      rm -f $ARCHIVEDIR/latest-news-unsorted
54
55      find $ARCHIVEDIR -type f -name "loaded-urls-*" -mtime +60 -exec rm {} \;
56
57      cat $ARCHIVEDIR/loaded-urls-* | sort | uniq > $ARCHIVEDIR/all-old-urls
58
59      cd $HTMLDIR
60
61      mv main-5.html main-6.html
62      mv main-4.html main-5.html
63      ...
64      mv main.html   main-1.html
65
66      $HTMLGEN $ARCHIVEDIR/all-old-urls > main.html
67
68      cut '-d|' -f3 $ARCHIVEDIR/latest-news | sort | uniq > $ARCHIVEDIR/loaded-urls-$MOMENT.txt
69
70      exit

The script uses Newsboat to do all the actual RSS work. The first 15 lines set all the variables that the script needs, create the $ARCHIVEDIR folder if it does not exist, and remove the results of the previous run (lines 14 and 15). The $FEEDLIST file, partially shown in Listing 1, contains one feed per line, with its category and URL separated by the pipe character. The numeric prefixes for each category name (e.g., 20 for "World") allow me to add, remove, or rename categories at will, while sorting them as I want, in the final HTML page shown in Figure 10.

Lines 17 and 18 extract the category names from the $FEEDLIST into the temporary rss-channels file and then save them, using the mapfile Bash function, in an array called $CHANNELS. Running the script on the $FEEDLIST shown in Listing 1 would fill that array with the four categories "020-World," "030-TechInnovation," "040-FOSS," and "070-Trekking."

The loop in lines 20 to 50 does two things: Lines 22 to 26 extract from $FEEDLIST all the URLs of the newsfeed that belong to the current category $C and save them, using the same mapfile technique of line 18, in another array called $CURFEEDS. When $C is "020-World," for example, that array would contain the URLs of the first two lines of Listing 1.

The real engine of the script is the inner loop shown in lines 28 to 49, which works around what is, for my purpose, a limitation of Newsboat: As of this writing, there is no way to run Newsboat from a script that directly dumps all the titles and URLS it finds into a plain-text file. That is not a showstopper, however, because Newsboat caches all that data into an SQLite3 database. So the trick is to tell Newsboat what file to use as the cache and then query it with the SQLite3 program.

Consequently, each iteration of that loop saves the current element of $CURFEEDS into the temporary file /root/single-feed (line 30) and then tells Newsboat (line 32) to download the feed contained in that URL (the -u option), save all its content in the cache file $SQLITECACHE (-c), and then quit (-x reload).

Once Newsboat has done its job, lines 41 to 48 launch SQLite3 to grab URLs and titles from the cache and save them into the file $ARCHIVEDIR/latest-news-unsorted. Line 41 is the most cryptic at first sight, but not so difficult to explain. The <<!EOF syntax is what Bash manuals call a here document. It means that all the lines that follow, until the one that begins and ends with the same string after the exclamation mark (that is, lines 42 to 46) must be passed as one file to the standard input of the program called right before the double < sign.

In other words, the first half of line 41 means "run the SQLite3 program to open the $SQLITECACHE database and, using the | character as a column separator, execute all the commands listed below, until the EOF string."

The statements in the here document just tell SQLite3 to fetch all the titles and corresponding links from the Newsboat database (line 45), print them to the standard output without headers one per line (line 43), and then quit.

In the second half of line 41, the sed utility adds to each line printed out by SQLite3 a prefix consisting of the name of the current channel followed by a pipe character. The result is appended to the file $ARCHIVEDIR/latest-news-unsorted.

Please note that the actual instructions in the here documents may depend on which exact version of SQLite3 is available on your system. As an example, check out the discussion [16] where the Newsboat developers (thanks!) explained to me how to do what I needed.

The rest of myrss.sh is much easier to explain – or modify as you wish. After the main loop has ended, the file $ARCHIVEDIR/latest-news-unsorted will contain lines such as

040-FOSS|Alchemy Is A Free PDF File Converter|https://medevel.com/alchemy-file-converter/

plus some blank lines from SQLite3, and possibly duplicated lines, if two RSS feeds contained the same link with the same headline (yes, it happens regularly). That is the reason to take all – and only – the lines of the file that contain a URL (grep http), sort them, remove duplicates (uniq), and save what remains into $ARCHIVEDIR/latest-news. Line 55, instead, removes all the results of previous runs of myrss.sh that are older than 60 days. The reasons for both this action and for line 57, which compacts all the URLs found in the past 60 days into $ARCHIVEDIR/all-old-urls, will be clear later.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • FOSSPicks

    This month Graham looks at Cartes du Ciel, Foliate, Newsboat, xinput-gui, DevilutionX, Performous, and much more!

  • The sys admin's daily grind: rss2email

    In order to keep up to date with security, Charly uses RSS feeds, among other things. He lets rss2email send the most important feeds directly to his mailbox to ensure that nothing is overlooked.

  • Perl: Collecting News Headlines

    Instead of visiting news sites periodically to pick up the latest reports, most people prefer to let a news aggregator do the job.The aggregator automatically draws your attention to incoming news. If a website does not have an RSS feed,a new Perl module simplifies the task of programming an RSS feed for private use.

  • The sys admin's daily grind: urlwatch

    Experienced system administrators attach great importance to always being up to date when it comes to information technology. Urlwatch is a command-line tool that presents the latest news from websites that do not offer RSS feeds by email.

  • Workspace: RSS Aggregators

    Read and manage your favorite feeds with open source RSS aggregators.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News