Find spammers with the help of a database and Google Charts
Sniffing out Spammers
To identify the geographic regions from which link spam originated, a database locates IP addresses and the Google Charts service puts them onto a world map.
Sometimes I imagine how satisfying it would be to track down a spammer or telemarketer's office, like the man in the Snickers commercial [1] who arrives at the office and gets his revenge. Unfortunately, legal and logistical reasons often prevent this. Additionally, it is often the case that the perpetrators are botnets rather than the spammers themselves. Still, it would be interesting to create a graph that pinpoints the geographic regions in which most spam activities originate.
The Internet is the ideal platform for anonymous trickery, but the perpetrator's deeds actually leave a trail – each incoming request on a website includes the sender's IP address (see Figure 1).
Of course, the address could be spoofed, but this is not so simple and just too much trouble for most link spammers.
The DNS system, which resolves hostnames into IP addresses, can often do the same thing in reverse gear. A DNS reverse lookup expects an IP address, and if the spammer's service provider has set up everything as it should be, the script in Listing 1, revlookup, will return a hostname from which you usually can identify the provider. Figure 2 shows that the address I caught spamming, IP 69.162.110.146, belongs to an ISP called lstn.net; a friendly email to the ISP's webmaster, stating the IP and the time (which is important because these IPs are often assigned dynamically) might just be the ticket to stop the spammer's illegal activity for good.
Listing 1
revlookup
The inet_aton() function in Perl's Socket module accepts an IP address in string notation ("x.x.x.x") and returns a data structure for a subsequent call to Perl's gethostbyaddr() function. When called with the AF_INET parameter, as shown in line 17 in revlookup, the function performs the DNS reverse lookup in the IPv4 address space and, if successful, returns a string with the hostname or returns undef if an error occurs. Depending on how busy the DNS server you call on is at the time and how many of its peers it needs to consult to answer your request, this process can take a couple of seconds.
As another option, the whois command-line utility doesn't just work with domains, it also accepts IP addresses as arguments. Figure 3 shows that the provider, Limestone Networks, has registered everything correctly and even provides an email address that spammed webmasters can contact with their complaints. The lookup can be automated in Perl with the CPAN Net::Whois::Raw module, for example; however, it puts significant load on the servers hosted by Network Solutions, who will block access if you perform 100 lookups in quick succession. In other words, searching a complete access log with this module is impossible, even if you cache queries you have already made.
Many spammers use IP addresses without a reverse lookup entry on the DNS system. But even then you can still locate the culprit; IP addresses are assigned to service providers in blocks, and you can download databases with the information necessary to discover the approximate geographic position of any given IP address. MaxMind offers a database file [2] that is free for non-commercial use. The licensing conditions are available in the same directory as the database itself. The CPAN IP::Country::MaxMind module provides an API to match, thus avoiding the need to mess around with data blobs. The IP mappings stored in the database change very slowly; updating once every couple of months should be fine.
After installing the module, you will need another CPAN module, Geo::IP::PurePerl. The MaxMind module's open() constructor loads the local database that you specify, and the inet_atocc() function returns a country code for any IP address (for example, DE for Germany).
The Google Charts API [3] gives you a useful option for plotting these codes on a world map. If you pass in pairs of values to the Google server, it will respond with a PNG-formatted image file. The data format for the pairs of values is slightly unusual in that you need to squash largish volumes of data into the very restricted space offered by a URL and its query parameters.
Simplicity Itself
The API's aptly named Simple Encoding data format will only allow values between 0 and 61, encoded as A-Z (0-25), a-z (26-51), and 0-9 (52-61).
If you assign a value of 23 to Germany, 3 to the USA, and 60 to Japan, you can encode the country codes in the chld URL parameter as "DEUSJP" (DE, US, and JP, concatenated without blanks), and the values as "s:XD8" (s = simple encoding, X = 23, D = 3, and 8 = 60) in chd.
The script in Listing 2, spam2geo, implements the steps I have identified thus far; it analyzes the access.log file from an Apache server under heavy fire from link spammers. The CPAN ApacheLog::Parser module provides a parse_line_to_hash function, which understands the access.log format and returns the individual fields of each log entry as a hash. The client entry includes the spammer's IP address in each case, and a call to the inet_atocc method in line 32 returns the two-letter country code, assuming the database knows it.
Listing 2
spam2geo
If successful, line 36 increments the hash entry for the country, and the program moves on to the next line in the logfile. Because you are not interested in all the URLs – just the ones generated by spammers – line 28 filters out all entries whose path (file hash key) does not match the regular expression posting. The regex should only match URLs used by spammers to post on the forums you are monitoring, so you must modify it to match your local conditions.
Normalization and conversion of the data to the Google format starts in line 42. Because the numeric values for each country in the %by_country hash are not necessarily in the range 0--61 but can assume arbitrary values, spam2geo must determine the limits of the range by use of min and max from List::Util. After doing so, it subtracts $min and divides by $max to squash the numeric values to be represented into the range between 0 and 1 and multiplies the latter value by the number of encoding characters minus 1. Thus, $norm contains a floating point number, which can be converted to an integer and used as an index in the @SYMBOLS array, thus mapping the whole range of values to an element in the array.
Lines 68 and 70 then concatenate the calculated symbols to give strings without separating blanks, for passing in with the chld (country codes) and chd (values) URL parameters. From the programmer's point of view, the order in which the keys and values functions return results is arbitrary, but consistent within the Perl script, and irrelevant to the Google service.
Communications with the Google server are handled by LWP::UserAgent via the http protocol. The URL parameters are set by the query_form() method, which also performs any URL encoding required. The cht parameter specifies the charts type used by the Google Charts service and is set to "t" (topological) for a world map. You can optionally restrict the view to individual continents; however, you need to set the chtm parameter to "world" for a world map.
The chs parameter sets the dimensions of the resulting image to 440x220 pixels. Google Charts uses the colors white, yellow, and red specified as hex RGB values in chco to shade the countries, thus reflecting minimum, medium, and maximum values. So, the settings in Listing 2 leave countries with normalized spam counts around 0 white, values of around 20 yellow, and values of 60 or more red. The "bg,s,EAF7FE" string for the chf parameter stands for background, solid, and the hex value for light blue to color the world's oceans.
All told, the URL will look something like this: http://chart.apis.google.com/chart?cht=t&chs=440x220&chtm=world&chd=
s%3ABFAABAHGQAAA8BAAAAAAAaBAA&chco=ffffff%2Cf4ed28%2Cf11414&chld=GBNLHKEELVKRRUSAPAMDCA
SECNDEPKITPLINMEBRCZUSUAESFR&chf=bg%2Cs%2CEAF7FE
Google takes just a couple of seconds to render and deliver this as the graph shown in Figure 4. If you comment out lines 27-29 in spam2geo, the graph will give you a geographic distribution of all incoming URLs instead (Figure 5).
Although most spam requests originate in China and the US, most of the website's bona fide customers come from Germany. The eog file.png command displays the file produced by Google and retrieved via a web request in the Eye of Gnome utility.
Installation
After downloading the MaxMind GeoIP.dat.gz database [2], unpack the GeoIP.dat file and place the spam2geo script into your current working directory. The CPAN IP::Country::MaxMind, Geo::IP::PurePerl, List::Util, and ApacheLog::Parser modules and all their dependencies are best installed from a CPAN shell. To use the Google API, you do not need to register. You just need to modify line 28 in spam2geo to match your local conditions by changing the /posting/ pattern to match URLs used only by spammers to clutter your discussion groups with parasitic entries.
For more detailed analysis including, for example, the number of forum requests compared with other activities or the preferred browser type used by the spammers (at least what they say they're using), check out the enormous choice provided by the Google Charts API [3], which gives you an easy approach to render any statistical information elegantly in polished chart form.
Infos
- Snickers Cruncher – Telemarketer: http://www.youtube.com/watch?v=R6QATC2C0h8
- Free MaxMind GeoIP database download: http://www.maxmind.com/download/geoip/database/
- "Maps" charts by the Google Charts web service: http://code.google.com/apis/chart/types.html#maps
- Listings for this article: http://www.linux-magazine.com/resources/article_code
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.
-
VirtualBox 7.1.4 Includes Initial Support for Linux kernel 6.12
The latest version of VirtualBox has arrived and it not only adds initial support for kernel 6.12 but another feature that will make using the virtual machine tool much easier.