A tour of some important data science techniques

Method in the Madness

Article from Issue 278/2024
Author(s):

Data science is all about gaining insights from mountains of data. We tour some important tools for the trade.

Data is the new oil, and data science is the new refinery. Increasing volumes of data are being collected, by websites, retail chains, and heavy industry, and that data is available to data scientists. Their task is to gain new insights from this data while automating processes and helping people make decisions [1]. The details for how they coax real, usable knowledge from these mountains of data can vary greatly depending on the business and the nature of the information. But many of the mathematical tools they use are quite independent of the data type. This article introduces you to some of the methods data scientists use to squeeze insights from a sea of numbers.

More than Just Modeling

The term data scientist evokes associations with math nerds, but data science consists of far more than building and optimizing models. First and foremost, it involves understanding a problem and its context.

For example, imagine a bank wants to use an algorithm to predict the probability that a borrower will be able to repay a loan. A data scientist will first want to understand how lending has worked so far and what data has been collected in this field – as well as whether that data is actually available – with a view to data protection requirements. In addition, data scientists need to be able to communicate their findings. Storytelling is more useful than presenting infinite rows of numbers, because the audience is likely to be made up of non-mathematicians. The need to clearly explain the findings frequently presents a challenge for less extroverted data scientists.

Preparing the Data

What sounds simple in theory often requires time-consuming data cleaning and transformation. Data is not always available in the way you need it. For example, many algorithms require numerical data to be extracted from non-numerical data.

To separate the data, the data scientist forms categories that can be divided using either numerical distances or dummy variables, where each occurrence of a characteristic (such as male, female, and nonbinary) becomes a separate variable. As a rule, one variable can be omitted. For example, in this data set, someone can only be male if they are neither female nor nonbinary. However, erroneous user input often results in data points that could bump an algorithm off track. These data points need to be identified and cleaned up.

The data scientist also looks for variables that are genuinely relevant to the model. This is where the information gathered during the understanding phase comes into play. In an exploratory data analysis, often in a Jupyter Notebook or similar, the data scientist generates and documents the findings in order to share them with colleagues (or at least ensure that the findings are repeatable).

Choosing a Suitable Model

First and foremost, the choice of algorithm depends on the task. If data capable of training an algorithm is available, data scientists refer to this scenario as supervised learning. For instance, if you have access to historical data on loan defaults, you could use it to predict whether future borrowers will repay their loans. The variable used for training is often referred to as the target variable – in this example, this is simply whether or not a loan has been repaid. Other examples would be classifications, whether or not a birthmark is indicative of skin cancer, or whether a customer is a fraudster.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Sustainability by Design

    Sustainability studies for the IT industry often ignore the contributions of software. This article explores what developers and admins can do to create and maintain more energy-efficient systems.

  • R For Science

    The R programming language is a universal tool for data analysis and machine learning.

  • Apache StreamPipes

    You don't need to be a stream processing expert to create useful custom solutions with Apache StreamPipes. We'll use StreamPipes to build a simple app that calculates when the International Space Station will fly overhead.

  • Analytics with Python and KDD

    The Knowledge Discovery in Data Mining (KDD) method breaks the business of data analytics into easy-to-understand steps. We'll show you how to get started with KDD and Python.

  • Programming Snapshot – Mileage AI

    On the basis of training data in the form of daily car mileage, Mike Schilli's AI program tries to identify patterns in driving behavior and make forecasts.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News