Big Data, Python, and the future of security
Good vs. Bad
When you start processing security-related data to find patterns, you quickly end up in Big Data territory, and you'll need some powerful tools to help you separate the good from the bad.
Intrusion detection and prevention is a difficult problem, much like email spam. Basically, you want to block all the "bad" traffic without blocking any "good" data. Because you can't accomplish this perfectly, you have to make a choice of how much bad traffic you're willing to allow, and how much good traffic you're willing to block.
Generally, people take one of three positions here. The first is the infamous "we can't block any good traffic, we'll lose sales, etc." The second approach is "I don't care about inconveniencing anybody, block by default and make sure anything coming through is good." The third option is a little more subtle and difficult to implement; basically, you turn to economics and try to figure out the cost of blocking good traffic (annoying users, support costs) and the cost of not blocking bad traffic (cleaning up after the occasional intrusion), and you make a decision. The third option, however, is rarely based on actual data and is mostly done along the lines of "how much can we annoy users before they yell at us." But, it's better than nothing.
Big Data Tools
Processing all this information, of course, leads to Big Data. Personally, I'm not a fan of buzzwords, but enough incremental change usually leads to entirely new things. Today, I was backing up an email account that contains messages about the size of my first hard drive, and the entire mailbox was larger than the storage of my first seven or eight computers put together. The reality is, if you want to start processing security-related data to find patterns, you're going to end up in Big Data territory quite quickly.
In typical open source fashion, you won't be spoiled for choice of tools for the job. For the purposes of this article, however, I'll mention Hadoop [1], MongoDB [2], and Python. Why Python, you ask? Why not Scala or something else? Python has its roots in scientific computing and, as such, has a number of extremely powerful data processing and machine learning libraries that are ideally suited to the problem here. As for Hadoop and MongoDB, it's simple: They can store a ton of data; they allow you to scale performance very cheaply, and talking to them to manipulate your data is easy.
Bayesian Filtering
One of the most powerful and simple tools for taking a lot of data and figuring out which of it is "good" and which of it is "bad" is Bayesian probability. This concept alone took spam from manually created lists to something that actually worked in an automated fashion. To make a long story short, you basically examine your data set, looking for relationships (e.g., the phrase "refinance your mortgage") that occur in spam email. If you're a mortgage broker, however, this phrase also appears in your ham (good) email. The trick is knowing what percentage of spam email and what percentage of ham email has it. For example, if 1 percent of your spam contains the term but 10 percent of your legitimate email has it, then it's probably a legitimate term for you despite being abused by spammers.
With Bayesian filtering, if you can codify the data, you can process it. For example, if you record all your network traffic and server logs and then a server suffers a break-in, you can mark all the data from the time of the break-in – assuming you can determine that – as suspicious and then compare it to all the other known good traffic. With luck, Bayesian filtering will be able to find the malicious data, because it will not have occurred in the known good set of data (Figure 1). If you combine this approach with additional data like IP address, country of origin, and time of day, you should be able to eliminate large amounts of "good" traffic from the suspect data set quickly.
Machine Learning with Python
Mainly what you're doing with these data sets of network traffic, server logs, and so on is classifying and clustering. You want to know "is this data good or bad" and "what things are related to this data." For example, many modern viruses phone home to command and control servers or to servers that host the payload. This means attackers can customize the payload for a virus based on the location of the machine requesting it and keep offering new versions to make detection more difficult.
The trick here is to know what these outgoing requests look like. For example, if you have systems running Linux and Firefox and they start sending out web requests with a user agent of Internet Explorer, then that's probably not legitimate traffic. Another clue might be if they started sending out web requests at three in the morning when no one was in the office. These behaviors seem obvious in hindsight, but there are millions of possibilities and not only do they vary from site to site but they also keep changing.
For Python, the two main packages to help deal with the problem are scikit-learn [3] and mlpy [4]. These tools are built on top of NumPy and SciPy, with the performance-critical parts written in C and Fortran, so they're fast and easy to work with.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.