Rolling your own RSS aggregator

Creating the News Page

The list of news items in Figure 10 is the content of one file called main.html. Lines 61 to 64, that save with different names the last six versions of that file, are useful for debug purposes but optional.

The actual creation of main.html happens in line 66, when $ARCHIVEDIR/all-old-urls is passed to the other script called generate-rss-page.pl that is shown in Listing 3. Line 68 extracts the third column of the latest-news file (i.e., only its URLs) and saves them in a file marked with the current timestamp ($ARCHIVEDIR/loaded-urls-$MOMENT.txt) that will be used in the future runs of myrss.sh.

Listing 3

generate-rss-page.pl

 

The best way to explain Listing 3 is to start from the end. I said that last year I rewrote the whole RSS aggregator from scratch, but that is not entirely true. The part of Listing 3 that actually generates the bare-bones HTML page of Figure 10, line 51 onward, is the same ugly code I originally wrote more than a decade ago. Of course, the reason it's still there unchanged is that it's good enough for my needs. In this tutorial, however, that code may also be a useful reminder that you don't need huge, bloated frameworks to create stuff that just works.

The Perl script works with two files. First, in lines 12 to 19, the script copies all the URLs that myrss.sh found in the previous 60 days into the hash called %OLD_URLS (compared with line 57 of Listing 2) and saves them into the file $ARCHIVEDIR/all-old-urls, before passing just that file as first argument to generate-rss-page.pl.

This step is necessary because I do not want to see URLs that I have already seen, even if they have a different headline (this too happens frequently), even if that first time was several weeks before. The 60-day window comes exactly from the fact that, because many interesting websites are not updated often, their RSS feed might not change for weeks, no matter how often you download it. For more on the code in Listing 3, see the box entitled "Deep Dive."

Deep Dive

In Listing 3, lines 25 to 49 are quick to summarize: They open $RSS_ARCHIVE/latest-news, which again contains lines such as

040-FOSS|Alchemy Is A Free PDF File Converter|https://medevel.com/alchemy-file-converter/

Then lines 28 and 29 remove newlines and split each record using the pipe character as a separator, saving each column in a different variable. If the URL of the current line already exists in the %OLD_URLS hash, it means that the page was already seen, sometime in the previous two months, so the script jumps to the next news (line 30). If the URL was never seen before, lines 31 to 35 clean up the title, URL, and channel name, removing non-ASCII characters, user-tracking strings (line 34), and the numeric prefixes from channel names, which aren't needed anymore (line 35).

If the channel name has changed with respect to the previous record in $RSS_ARCHIVE/latest-news (line 37), the script saves it in $CURRENT_CHANNEL, prints it out as a level-4 HTML header (line 39), and resets the $BLOCK_COUNTER variable. This is what partitions the output in Figure 10 in different sections titled "World," "TechInnovation," and "FOSS." The <ul> and </ul> strings in line 39, and later in line 46, are just the HTML markup for the beginning and end of an unordered, bulleted list.

The single headlines are formatted as list elements and appended, just like headers and list delimiters, to the $LATEST_NEWS variable (line 43). Once all latest-news has been scanned, line 51 prints the HTML page all at once to standard output, with the list of headlines in $LATEST_NEWS in the right place.

The purpose of lines 44 to 47 is cosmetic but important, at least for me: Every seven items (line 45), the script closes the current list, inserts a blank line, and starts a new one (line 46). This simple trick greatly improves the overall readability of Figure 10 when it contains lots of news because visually separated small lists are easier to scan than just one big wall of text.

Speaking of readability, and to stress the time savings that RSS makes possible even with just a few lines of code: Note that sorting the headlines alphabetically (line 52 of Listing 2) makes them much faster to scan. In the Oscars or Super Bowl weeks, for example, there will surely be many headlines starting with those words all around the web. But if those headlines are grouped together, regardless of which website they came from, both spotting and ignoring all of them will be much quicker.

Conclusion

Using RSS will improve your web experience, and you deserve it. If your favorite website does not show their RSS feed, demand that they do it. This article introduces some of the leading open source RSS tools and describes how to create a custom RSS tool that delivers the news just the way you like it.

The Author

Marco Fioretti (http://mfioretti.substack.com) is a freelance author, trainer, and researcher based in Rome, Italy, who has been working with free/open source software since 1995, and on open digital standards since 2005. Marco also is a board member of the Free Knowledge Institute (http://freeknowledge.eu).

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • FOSSPicks

    This month Graham looks at Cartes du Ciel, Foliate, Newsboat, xinput-gui, DevilutionX, Performous, and much more!

  • Perl: Collecting News Headlines

    Instead of visiting news sites periodically to pick up the latest reports, most people prefer to let a news aggregator do the job.The aggregator automatically draws your attention to incoming news. If a website does not have an RSS feed,a new Perl module simplifies the task of programming an RSS feed for private use.

  • Krill: News Filtered
  • Workspace: RSS Aggregators

    Read and manage your favorite feeds with open source RSS aggregators.

  • The sys admin's daily grind: rss2email

    In order to keep up to date with security, Charly uses RSS feeds, among other things. He lets rss2email send the most important feeds directly to his mailbox to ensure that nothing is overlooked.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News