Rolling your own RSS aggregator
Creating the News Page
The list of news items in Figure 10 is the content of one file called main.html
. Lines 61 to 64, that save with different names the last six versions of that file, are useful for debug purposes but optional.
The actual creation of main.html
happens in line 66, when $ARCHIVEDIR/all-old-urls
is passed to the other script called generate-rss-page.pl
that is shown in Listing 3. Line 68 extracts the third column of the latest-news
file (i.e., only its URLs) and saves them in a file marked with the current timestamp ($ARCHIVEDIR/loaded-urls-$MOMENT.txt
) that will be used in the future runs of myrss.sh
.
Listing 3
generate-rss-page.pl
The best way to explain Listing 3 is to start from the end. I said that last year I rewrote the whole RSS aggregator from scratch, but that is not entirely true. The part of Listing 3 that actually generates the bare-bones HTML page of Figure 10, line 51 onward, is the same ugly code I originally wrote more than a decade ago. Of course, the reason it's still there unchanged is that it's good enough for my needs. In this tutorial, however, that code may also be a useful reminder that you don't need huge, bloated frameworks to create stuff that just works.
The Perl script works with two files. First, in lines 12 to 19, the script copies all the URLs that myrss.sh
found in the previous 60 days into the hash called %OLD_URLS
(compared with line 57 of Listing 2) and saves them into the file $ARCHIVEDIR/all-old-urls
, before passing just that file as first argument to generate-rss-page.pl
.
This step is necessary because I do not want to see URLs that I have already seen, even if they have a different headline (this too happens frequently), even if that first time was several weeks before. The 60-day window comes exactly from the fact that, because many interesting websites are not updated often, their RSS feed might not change for weeks, no matter how often you download it. For more on the code in Listing 3, see the box entitled "Deep Dive."
Deep Dive
In Listing 3, lines 25 to 49 are quick to summarize: They open $RSS_ARCHIVE/latest-news
, which again contains lines such as
040-FOSS|Alchemy Is A Free PDF File Converter|https://medevel.com/alchemy-file-converter/
Then lines 28 and 29 remove newlines and split each record using the pipe character as a separator, saving each column in a different variable. If the URL of the current line already exists in the %OLD_URLS
hash, it means that the page was already seen, sometime in the previous two months, so the script jumps to the next news (line 30). If the URL was never seen before, lines 31 to 35 clean up the title, URL, and channel name, removing non-ASCII characters, user-tracking strings (line 34), and the numeric prefixes from channel names, which aren't needed anymore (line 35).
If the channel name has changed with respect to the previous record in $RSS_ARCHIVE/latest-news
(line 37), the script saves it in $CURRENT_CHANNEL
, prints it out as a level-4 HTML header (line 39), and resets the $BLOCK_COUNTER
variable. This is what partitions the output in Figure 10 in different sections titled "World," "TechInnovation," and "FOSS." The <ul>
and </ul>
strings in line 39, and later in line 46, are just the HTML markup for the beginning and end of an unordered, bulleted list.
The single headlines are formatted as list elements and appended, just like headers and list delimiters, to the $LATEST_NEWS
variable (line 43). Once all latest-news
has been scanned, line 51 prints the HTML page all at once to standard output, with the list of headlines in $LATEST_NEWS
in the right place.
The purpose of lines 44 to 47 is cosmetic but important, at least for me: Every seven items (line 45), the script closes the current list, inserts a blank line, and starts a new one (line 46). This simple trick greatly improves the overall readability of Figure 10 when it contains lots of news because visually separated small lists are easier to scan than just one big wall of text.
Speaking of readability, and to stress the time savings that RSS makes possible even with just a few lines of code: Note that sorting the headlines alphabetically (line 52 of Listing 2) makes them much faster to scan. In the Oscars or Super Bowl weeks, for example, there will surely be many headlines starting with those words all around the web. But if those headlines are grouped together, regardless of which website they came from, both spotting and ignoring all of them will be much quicker.
Conclusion
Using RSS will improve your web experience, and you deserve it. If your favorite website does not show their RSS feed, demand that they do it. This article introduces some of the leading open source RSS tools and describes how to create a custom RSS tool that delivers the news just the way you like it.
Infos
- RSS 2.0 Specification: https://www.rssboard.org/rss-specification
- "The 'Snob RSS' Hall of (Constructive!) Shame": https://stop.zona-m.net/2021/02/the-snob-rss-hall-of-constructive-shame/
- WikiRSS: https://researchbuzz.me/2023/03/20/turn-wikipedia-into-an-rss-search-engine-with-wikirss/
- OPML: http://dev.opml.org/spec2.html
- RSS advocacy posts: https://stop.zona-m.net/tag/rss
- Akregator: https://userbase.kde.org/Akregator
- Liferea: https://lzone.de/liferea/
- TT-RSS: https://tt-rss.org/
- RSSHub: https://docs.rsshub.app/en/
- RSSHub self-hosting guide: https://docs.rsshub.app/en/install/
- FreeRSS: https://github.com/robdelacruz/freerss
- Rsstail: https://python-rsstail.readthedocs.io/en/latest/
- Newsboat: https://newsboat.org/
- MastoFeed: https://mastofeed.org/
- "Tutorial – Desktop News Feeds" by Marco Fioretti, Linux Magazine, issue 217, December 2018: https://www.linux-magazine.com/Issues/2018/217/Read-Me
- Newsboat usage in scripts: https://github.com/newsboat/newsboat/issues/2320
« Previous 1 2
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Endless OS 6 has Arrived
After more than a year since the last update, the latest release of Endless OS is now available for general usage.
-
Fedora Asahi 40 Remix Available for Macs with Apple Silicon
If you've been anticipating KDE's Plasma 6 for your Apple Silicon-powered Mac, then you're in luck.
-
Red Hat Adds New Deployment Option for Enterprise Linux Platforms
Red Hat has re-imagined enterprise Linux for an AI future with Image Mode.
-
OSJH and LPI Release 2024 Open Source Pros Job Survey Results
See what open source professionals look for in a new role.
-
Proton 9.0-1 Released to Improve Gaming with Steam
The latest release of Proton 9 adds several improvements and fixes an issue that has been problematic for Linux users.
-
So Long Neofetch and Thanks for the Info
Today is a day that every Linux user who enjoys bragging about their system(s) will mourn, as Neofetch has come to an end.
-
Ubuntu 24.04 Comes with a “Flaw"
If you're thinking you might want to upgrade from your current Ubuntu release to the latest, there's something you might want to consider before doing so.
-
Canonical Releases Ubuntu 24.04
After a brief pause because of the XZ vulnerability, Ubuntu 24.04 is now available for install.
-
Linux Servers Targeted by Akira Ransomware
A group of bad actors who have already extorted $42 million have their sights set on the Linux platform.
-
TUXEDO Computers Unveils Linux Laptop Featuring AMD Ryzen CPU
This latest release is the first laptop to include the new CPU from Ryzen and Linux preinstalled.