Big Data, Python, and the future of security
Good vs. Bad
When you start processing security-related data to find patterns, you quickly end up in Big Data territory, and you'll need some powerful tools to help you separate the good from the bad.
Intrusion detection and prevention is a difficult problem, much like email spam. Basically, you want to block all the "bad" traffic without blocking any "good" data. Because you can't accomplish this perfectly, you have to make a choice of how much bad traffic you're willing to allow, and how much good traffic you're willing to block.
Generally, people take one of three positions here. The first is the infamous "we can't block any good traffic, we'll lose sales, etc." The second approach is "I don't care about inconveniencing anybody, block by default and make sure anything coming through is good." The third option is a little more subtle and difficult to implement; basically, you turn to economics and try to figure out the cost of blocking good traffic (annoying users, support costs) and the cost of not blocking bad traffic (cleaning up after the occasional intrusion), and you make a decision. The third option, however, is rarely based on actual data and is mostly done along the lines of "how much can we annoy users before they yell at us." But, it's better than nothing.
Big Data Tools
Processing all this information, of course, leads to Big Data. Personally, I'm not a fan of buzzwords, but enough incremental change usually leads to entirely new things. Today, I was backing up an email account that contains messages about the size of my first hard drive, and the entire mailbox was larger than the storage of my first seven or eight computers put together. The reality is, if you want to start processing security-related data to find patterns, you're going to end up in Big Data territory quite quickly.
In typical open source fashion, you won't be spoiled for choice of tools for the job. For the purposes of this article, however, I'll mention Hadoop [1], MongoDB [2], and Python. Why Python, you ask? Why not Scala or something else? Python has its roots in scientific computing and, as such, has a number of extremely powerful data processing and machine learning libraries that are ideally suited to the problem here. As for Hadoop and MongoDB, it's simple: They can store a ton of data; they allow you to scale performance very cheaply, and talking to them to manipulate your data is easy.
Bayesian Filtering
One of the most powerful and simple tools for taking a lot of data and figuring out which of it is "good" and which of it is "bad" is Bayesian probability. This concept alone took spam from manually created lists to something that actually worked in an automated fashion. To make a long story short, you basically examine your data set, looking for relationships (e.g., the phrase "refinance your mortgage") that occur in spam email. If you're a mortgage broker, however, this phrase also appears in your ham (good) email. The trick is knowing what percentage of spam email and what percentage of ham email has it. For example, if 1 percent of your spam contains the term but 10 percent of your legitimate email has it, then it's probably a legitimate term for you despite being abused by spammers.
With Bayesian filtering, if you can codify the data, you can process it. For example, if you record all your network traffic and server logs and then a server suffers a break-in, you can mark all the data from the time of the break-in – assuming you can determine that – as suspicious and then compare it to all the other known good traffic. With luck, Bayesian filtering will be able to find the malicious data, because it will not have occurred in the known good set of data (Figure 1). If you combine this approach with additional data like IP address, country of origin, and time of day, you should be able to eliminate large amounts of "good" traffic from the suspect data set quickly.
Machine Learning with Python
Mainly what you're doing with these data sets of network traffic, server logs, and so on is classifying and clustering. You want to know "is this data good or bad" and "what things are related to this data." For example, many modern viruses phone home to command and control servers or to servers that host the payload. This means attackers can customize the payload for a virus based on the location of the machine requesting it and keep offering new versions to make detection more difficult.
The trick here is to know what these outgoing requests look like. For example, if you have systems running Linux and Firefox and they start sending out web requests with a user agent of Internet Explorer, then that's probably not legitimate traffic. Another clue might be if they started sending out web requests at three in the morning when no one was in the office. These behaviors seem obvious in hindsight, but there are millions of possibilities and not only do they vary from site to site but they also keep changing.
For Python, the two main packages to help deal with the problem are scikit-learn [3] and mlpy [4]. These tools are built on top of NumPy and SciPy, with the performance-critical parts written in C and Fortran, so they're fast and easy to work with.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
First Release Candidate for Linux Kernel 6.14 Now Available
Linus Torvalds has officially released the first release candidate for kernel 6.14 and it includes over 500,000 lines of modified code, making for a small release.
-
System76 Refreshes Meerkat Mini PC
If you're looking for a small form factor PC powered by Linux, System76 has exactly what you need in the Meerkat mini PC.
-
Gnome 48 Alpha Ready for Testing
The latest Gnome desktop alpha is now available with plenty of new features and improvements.
-
Wine 10 Includes Plenty to Excite Users
With its latest release, Wine has the usual crop of bug fixes and improvements, along with some exciting new features.
-
Linux Kernel 6.13 Offers Improvements for AMD/Apple Users
The latest Linux kernel is now available, and it includes plenty of improvements, especially for those who use AMD or Apple-based systems.
-
Gnome 48 Debuts New Audio Player
To date, the audio player found within the Gnome desktop has been meh at best, but with the upcoming release that all changes.
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.