Editing statistical data with gawk

Stats Don't Lie

© Lead Image © daniilantiq, 123RF.com

© Lead Image © daniilantiq, 123RF.com

Article from Issue 248/2021
Author(s):

With very little overhead, you can access statistics on the spread of COVID-19 using gawk scripts and simple shell commands.

Open a paper, turn on the radio or TV, and you are confronted with statistics concerning the spread of COVID-19. Sometimes, it's hard to wrap your mind around all the data, and even then, you are never sure if the media is taking the data out of context. For instance, a report of 30,000 new infections in a country doesn't provide any information about the number of tests or the positivity rate.

If you want to drill down into the data concerning the spread of COVID-19, you can use gawk scripts and simple shell commands to process, evaluate, and present these statistics in a more meaningful way.

A Few One-Liners

The statistical data on the spread of COVID-19, usually collected daily and organized in timelines, is freely available on the Internet. Our World in Data (OWID) [1] has been recording the data since the beginning of the crisis. On its website, OWID makes the data available for download in various formats, including CSV format, which is well suited for further processing.

To get started, you should download the CSV file from OWID to see what it contains. Looking at the extensive owid-covid-data.csv file with a simple less or cat command may result in eye strain when trying to read the more detailed data for individual countries. Even if you open the file with an editor or OpenOffice Writer, it is still quite difficult to get a useful overview. A better option is to use shell one-liners to extract the information you need.

To find out the column titles in the CSV file, you can use the one-liner shown in line 1 of Listing 1. The head command at the beginning of the pipe outputs the first line of the file. Then, sed converts all commas to line feeds. Next, cat -n numbers the output, and pr distributes the whole thing in columns across the screen, resulting in a useful overview (Figure 1).

Listing 1

Useful One-Liners

01 $ head -n 1 owid-covid-data.csv | sed 's/,/\n/g' | cat -n | pr -T --columns=2 --width=90
02 $ cut -f 3 -d "," owid-covid-data.csv | sort | uniq | sed -r '/World|International/d' | cat -n | pr -T --columns=2 --width=120
03 $ grep -i germany owid-covid-data.csv | cut -f 4-8 -d "," | column -t -s ","
Figure 1: A lean one-liner delivers a quick overview of the column titles in the CSV file.

Looking at Figure 1, you can assume that the first four columns titles (iso_code, continent, location, date) will not change over time, but you can't be so confident about the order of the other column titles. New key data is added all the time, as knowledge about the virus evolves, treatments are discovered, and vaccinations are introduced. For more information on column title descriptions (as well as information on the data and data sources), visit OWID's GitHub page [2].

You may also want to know which countries provide data. The one-liner from line 2 of Listing 1 reveals that OWID has data for over 200 countries (Figure 2). Depending on your terminal size, you can adjust the --columns or --width parameters.

Figure 2: To fit the country output on your screen, adjust the command's parameters.

Finally, you may want to filter out a specific country's data. You can do this with a combination of grep and cut, as shown in line 3 of Listing 1 (Germany in this example). However, this output is missing the column titles. A better way to extract this information is to write a script that takes into account the column sequence – the script needs be able to discover the number of columns and which columns contain the data.

Scripted

Listing 2 shows the script to filter out a specific country's data. Like any script, it starts by specifying the interpreter. In lines 2 to 6, you can see basic declarations that are stored in variables for easier access later. In principle, these are fairly simple things like the URL, the file name, two columns titles needed later, and the first key data (the world population).

Listing 2

Data Evaluation (query.sh)

01 #!/usr/bin/bash
02 URL=https://covid.ourworldindata.org/data/owid-covid-data.csv
03 FILE=owid-covid-data.csv
04 col_population=`sed 1q $FILE | sed 's/,/\n/g' | cat -n | grep "population$" | cut -f 1`
05 col_populationdensity=`sed 1q $FILE | sed 's/,/\n/g' | cat -n | grep "population_density$" | cut -f 1`
06 world_population=`grep -i world $FILE | tail -n 1 | cut -f $col_population -d ","`
07
08 download(){
09   test -e $FILE || wget -O $FILE $URL
10   test `ls --full-time $FILE | gawk '{print $6}'` = `date +%Y-%m-%d` ||
11   wget -O $FILE $URL
12 }
13
14 download
15
16 country_selection_menu(){
17   country=`cut -f 1,3 -d "," $FILE | sed 1d | sort | uniq | sed '/International/d;/World/d' | sed 's/^/(/;s/,/) /'`
18   pr --columns=3 -T <<<$country
19   echo "Enter ISO codes line-by-line and terminate input with Ctrl-D:"
20   readarray -t country_selection
21 }
22
23 col_selection_menu(){
24   sp=`sed 1q $FILE | sed 's/,/\n/g' | cat -n`
25   pr -T --columns=2 <<<$sp
26   echo "Enter columns line-by-line and terminate input with Ctrl-D:"
27   readarray -t col_selection
28 }
29
30 output(){
31   searchstring=`tr [:lower:] [:upper:] <<<$1`
32   local country=`grep "("$searchstring")" <<<$country`
33   outputblock=`cat <(sed 1q $FILE) <(grep -E "^"$searchstring $FILE)`
34   population_country=`tail -n 1 <<<$outputblock | cut -f $col_population -d ","`
35   populationdensity_country=`tail -n 1 <<<$outputblock | cut -f $col_populationdensity -d ","`
36   colformat=`for col in 4 ${col_selection[*]};do cut -f $sp -d "," <<<$outputblock | wc -L | sed 's/^/  %/g;s/$/s/g' ;done`
37   echo -e "\n"
38   echo `tr [:lower:] [:upper:] <<<"$country"`
39   gawk -F "," -v pop=$population_country -v world_pop=$world_population -v density=$populationdensity_country <<<$outputblock '
40     BEGIN{
41       ub1="Population: "pop", percentage of world population: "pop*100/world_pop" %"
42       ub2="Inhabitants per km<+>2<+>: "density
43       stars = gensub(/./, "*", "g", ub1)
44       print stars
45       print ub1
46       print ub2
47       print stars
48     }
49     FNR == 1{
50       printf "'"`echo ${colformat[*]}`"'\n", '"`echo 4 ${col_selection[*]} | sed 's/^/$/;s/ /,$/g'`"'
51     }
52     FNR > 1{
53       printf "'"`echo ${colformat[*]}`"'\n", '"`echo 4 ${col_selection[*]} | sed 's/^/$/;s/ /,$/g'`"'
54       datasetcounter++
55     }
56     END{
57       print "\nDatasets in total: "datasetcounter
58     }
59   '
60 }
61
62 country_selection_menu
63 spalten_selection_menu
64
65 test -e Evaluations && : || mkdir Evaluations
66
67 for land in "${country_selection[@]}"
68 do
69   output "$land"
70 done | tee "Evaluations/""`echo ${country_selection[*]} | sed 's/ /_/g;s/$/_/'`""`date +%a_%d_%b_%Y`"".txt" | less
71
72 read -p "More queries? (Y/N)" continue
73 test $continue = "y" || test $continue = "Y" && $0 || exit 1

The download() function (lines 8-12) first checks to see whether the file has already been downloaded and whether it corresponds to the current date. If the answer is no to either of these criteria, the download takes place, overwriting a previously downloaded file.

Next, country_selection_menu() (lines 16-21) displays all countries for selection (Figure 3). You then input line by line the selected ISO codes, which are loaded into an array. Do not worry about uppercase/lowercase when entering the ISO codes. Press Ctrl+D to confirm your selections.

Figure 3: From the country selection menu, you can choose the ISO codes for the countries you want included in your data.

For column titles, col_selection_menu() (23-28) lets you do the same thing using the column title numbers for input. When inputting your column selections, you do not have to input the date column: It is always parsed by the script, because listing the other data without this important column does not make much sense.

The output() function (lines 30-60) takes care of the data output. Starting in line 39, the function contains a gawk script with Bash commands embedded in it. Use the following format:

'GAWK CODE'"BASH CODE"'GAWK CODE'

This function is passed an ISO code as a parameter, which is converted to uppercase letters in line 31 and ends up in a variable. With this help, grep then retrieves the right name from the list of countries and stores it in a variable (line 32). Line 33 stores the first line of the file (the column titles) and, using grep again, the country's data block in a variable so that Bash or gawk can work with it. Lines 34 and 35 filter out the current population size and density of the country from the output block.

Line 36 creates a format string for a later printf command for gawk. The code determines the longest entry in the output block for each column title you select, creating a number string in which each number has a percent sign (%) at the beginning and an s appended to the end (e.g., %10s %20s ...).

The actual output starts on line 37, which is handled by Bash. The gawk script takes over in line 39, and it expects as parameters the preallocated variables and the output block that was read out.

The BEGIN block (lines 40-48) works with the variables. In ub1 and ub2, strings are assembled, or literally lumped together, by the code simply concatenating them. In line 43, a general substitution occurs; this simply converts each character in ub1 to a star (*) and stores the result in the stars variable. Lines 44 through 47 then output the variable contents.

Two more blocks take care of outputting the appropriate headers and data columns (lines 49-55). After querying the special variable FNR, gawk knows which line is currently up and then executes the appropriate block. The variable datasetcounter (line 54) records how many records there are in total; the result is then output in the END block with the appropriate text. You will notice that not all columns for each country are completely filled: Due to the vast amount of data, data glitches are bound to happen, and some countries do not always collect the same metrics as others.

Last but not least, all previously defined functions must also be called. Lines 62 and 63 handle the query of the countries and columns to be displayed. Line 65 checks if the directory ./Evaluations/ already exists and creates it if necessary to be able to store data there later. The calls in lines 67 to 70 then work through all the selected countries and pass the results to the output function.

At the end of the for loop, another tee statement writes the data to the ./evaluations/ directory, with the appropriate date and country specified, and then copies it to standard output, where less lets you view the data in the terminal (Figure 4). Finally, the routine in lines 72 and 73 prompts you to start more queries. If you want to, the script restarts; otherwise it says goodbye with an exit command.

Figure 4: The script lets you track how the pandemic developed in China or any other country. You choose the data you want to see.

The script lets you view any key data that you are interested in, giving you a more detailed view than what you might find in the media. In particular, the data for individual countries can be more accurately compared. Keep in mind, this script can not tell you how a reproduction (R) value, incidence, or other magic number was calculated in the original data from OWID.

Making Your Own Calculations

How much money does the pharmaceutical industry make on COVID-19 tests worldwide? A precise answer could require weeks of investigation, but you can come up with a quick estimate using gawk and Bash. In the total_tests column, most countries show the total number of tests. All you have to do is add up the current test numbers for all countries and multiply the result by an estimate of the average test price. You now have a rough idea about the worldwide pharmaceutical industry's gross revenue (or turnover) on tests.

Listing 3 provides a script to calculate the gross revenue for COVID-19 tests. In line 2, the script uses the download routine from Listing 2. For this reason, both scripts (Listing 2 and Listing 3) must reside in the directory. Line 3 stores the file name in a variable. However, this can also be left out, since the source statement from line 2 also takes this from the first script (Listing 2).

Listing 3

Test Revenues (turnover.sh)

01 #!/usr/bin/bash
02 source <(sed -n '2,/download$/p' query.sh)
03 FILE=owid-covid-data.csv
04 total_test_col=`sed 's/,/\n/g;1q' $FILE | grep -n total_tests$ | cut -f 1 -d ":"`
05
06 readarray -t all_country <<<`cut -f 3 -d "," $FILE | sort | uniq | sed -r '/World|International/d'`
07
08 for land in "${all_country[@]}"
09 do
10   val_country=`grep -F "$land" $FILE | cut -f $total_test_col -d "," | grep -E "[0-9]" | tail -n 1`
11   grep -q "[0-9]" <<<$val_country && echo "$country,$val_country,`(sed -r 's/\.[0-9]*$//' <<<$val_country) | numfmt --grouping`"
12 done | sort -t "," -k 2 -n | gawk -F "," '
13     BEGIN{
14       printf "%-25s %20s\n","Country","Tests"
15       price_per_test=180
16     }
17     {
18       printf "%-25s %20s\n",$1,$3
19       tests+=$2
20     }
21     END{
22       "numfmt --grouping "tests |& getline tests_grouped
23       printf "\n%-20s %25s\n","Tests taken: ",tests_grouped
24       printf "%-20s %25s\n","Price per test:  ",price_per_test" euros"
25       "numfmt --grouping "tests*price_per_test |& getline megaturnover
26       printf "%-20s %25s\n","Turnover generated:",megaturnover" euros"
27     }
28   '

Line 4 determines the correct column for the test count, and line 6 bundles all the countries into an array. The for loop in lines 8 to 12 goes through all countries one by one and creates three columns separated by commas: the country name, the total number of tests, and a total number formatted to the thousandths decimal place. Then a sort command sorts this data by the second column (total number of tests).

Now a gawk script takes care of the formatted output. The BEGIN block creates a suitable heading and sets the current price for tests. The block outputs the name of the country and the formatted numbers, adding up the numbers from the second column at the same time. The END block then displays the gross revenue for test sales (Figure 5) – a number that the pharmaceutical industry probably would not like to be known.

Figure 5: Based on the the global number of COVID-19 tests and the estimated sales, the pharmaceutical industry is unlikely to go bankrupt in the next few years.

In the END block (line 21), you can see how gawk handles pipes. Pass a shell command to a |& getline variable (lines 22 and 25). gawk then catches the shell command output and stores it in the variable, letting it then work with it downstream. In this way, you can implement even more complicated subroutines, but at the cost of more complex handling. Most importantly, you must close pipes at the right place to produce useful output. In this case, however, it is a simple shell statement, which then also self-terminates.

Buy this article as PDF

Express-Checkout as PDF
Price $2.95
(incl. VAT)

Buy Linux Magazine

SINGLE ISSUES
 
SUBSCRIPTIONS
 
TABLET & SMARTPHONE APPS
Get it on Google Play

US / Canada

Get it on Google Play

UK / Australia

Related content

  • Open Data

    A lot of COVID-19 data is available through online REST APIs. With a little ingenuity and some open source tools, you can extract and analyze the data yourself.

  • Gnuplot

    Use Gnuplot with command-line utilities.

  • Shell Download Manager

    A few lines of shell code and the Gawk scripting language make downloading files off the web a breeze.

  • Xrmap

    If you want to become an expert in world geography, why not get a little help from the CIA?

  • Mining File Data

    Efficient tools to examine, edit, and transfer file information.

comments powered by Disqus
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters

Support Our Work

Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.

Learn More

News