Editing statistical data with gawk
Stats Don't Lie
With very little overhead, you can access statistics on the spread of COVID-19 using gawk scripts and simple shell commands.
Open a paper, turn on the radio or TV, and you are confronted with statistics concerning the spread of COVID-19. Sometimes, it's hard to wrap your mind around all the data, and even then, you are never sure if the media is taking the data out of context. For instance, a report of 30,000 new infections in a country doesn't provide any information about the number of tests or the positivity rate.
If you want to drill down into the data concerning the spread of COVID-19, you can use gawk
scripts and simple shell commands to process, evaluate, and present these statistics in a more meaningful way.
A Few One-Liners
The statistical data on the spread of COVID-19, usually collected daily and organized in timelines, is freely available on the Internet. Our World in Data (OWID) [1] has been recording the data since the beginning of the crisis. On its website, OWID makes the data available for download in various formats, including CSV format, which is well suited for further processing.
To get started, you should download the CSV file from OWID to see what it contains. Looking at the extensive owid-covid-data.csv
file with a simple less
or cat
command may result in eye strain when trying to read the more detailed data for individual countries. Even if you open the file with an editor or OpenOffice Writer, it is still quite difficult to get a useful overview. A better option is to use shell one-liners to extract the information you need.
To find out the column titles in the CSV file, you can use the one-liner shown in line 1 of Listing 1. The head
command at the beginning of the pipe outputs the first line of the file. Then, sed
converts all commas to line feeds. Next, cat -n
numbers the output, and pr
distributes the whole thing in columns across the screen, resulting in a useful overview (Figure 1).
Listing 1
Useful One-Liners
01 $ head -n 1 owid-covid-data.csv | sed 's/,/\n/g' | cat -n | pr -T --columns=2 --width=90 02 $ cut -f 3 -d "," owid-covid-data.csv | sort | uniq | sed -r '/World|International/d' | cat -n | pr -T --columns=2 --width=120 03 $ grep -i germany owid-covid-data.csv | cut -f 4-8 -d "," | column -t -s ","
Looking at Figure 1, you can assume that the first four columns titles (iso_code
, continent
, location
, date
) will not change over time, but you can't be so confident about the order of the other column titles. New key data is added all the time, as knowledge about the virus evolves, treatments are discovered, and vaccinations are introduced. For more information on column title descriptions (as well as information on the data and data sources), visit OWID's GitHub page [2].
You may also want to know which countries provide data. The one-liner from line 2 of Listing 1 reveals that OWID has data for over 200 countries (Figure 2). Depending on your terminal size, you can adjust the --columns
or --width
parameters.
Finally, you may want to filter out a specific country's data. You can do this with a combination of grep
and cut
, as shown in line 3 of Listing 1 (Germany in this example). However, this output is missing the column titles. A better way to extract this information is to write a script that takes into account the column sequence – the script needs be able to discover the number of columns and which columns contain the data.
Scripted
Listing 2 shows the script to filter out a specific country's data. Like any script, it starts by specifying the interpreter. In lines 2 to 6, you can see basic declarations that are stored in variables for easier access later. In principle, these are fairly simple things like the URL, the file name, two columns titles needed later, and the first key data (the world population).
Listing 2
Data Evaluation (query.sh)
01 #!/usr/bin/bash 02 URL=https://covid.ourworldindata.org/data/owid-covid-data.csv 03 FILE=owid-covid-data.csv 04 col_population=`sed 1q $FILE | sed 's/,/\n/g' | cat -n | grep "population$" | cut -f 1` 05 col_populationdensity=`sed 1q $FILE | sed 's/,/\n/g' | cat -n | grep "population_density$" | cut -f 1` 06 world_population=`grep -i world $FILE | tail -n 1 | cut -f $col_population -d ","` 07 08 download(){ 09 test -e $FILE || wget -O $FILE $URL 10 test `ls --full-time $FILE | gawk '{print $6}'` = `date +%Y-%m-%d` || 11 wget -O $FILE $URL 12 } 13 14 download 15 16 country_selection_menu(){ 17 country=`cut -f 1,3 -d "," $FILE | sed 1d | sort | uniq | sed '/International/d;/World/d' | sed 's/^/(/;s/,/) /'` 18 pr --columns=3 -T <<<$country 19 echo "Enter ISO codes line-by-line and terminate input with Ctrl-D:" 20 readarray -t country_selection 21 } 22 23 col_selection_menu(){ 24 sp=`sed 1q $FILE | sed 's/,/\n/g' | cat -n` 25 pr -T --columns=2 <<<$sp 26 echo "Enter columns line-by-line and terminate input with Ctrl-D:" 27 readarray -t col_selection 28 } 29 30 output(){ 31 searchstring=`tr [:lower:] [:upper:] <<<$1` 32 local country=`grep "("$searchstring")" <<<$country` 33 outputblock=`cat <(sed 1q $FILE) <(grep -E "^"$searchstring $FILE)` 34 population_country=`tail -n 1 <<<$outputblock | cut -f $col_population -d ","` 35 populationdensity_country=`tail -n 1 <<<$outputblock | cut -f $col_populationdensity -d ","` 36 colformat=`for col in 4 ${col_selection[*]};do cut -f $sp -d "," <<<$outputblock | wc -L | sed 's/^/ %/g;s/$/s/g' ;done` 37 echo -e "\n" 38 echo `tr [:lower:] [:upper:] <<<"$country"` 39 gawk -F "," -v pop=$population_country -v world_pop=$world_population -v density=$populationdensity_country <<<$outputblock ' 40 BEGIN{ 41 ub1="Population: "pop", percentage of world population: "pop*100/world_pop" %" 42 ub2="Inhabitants per km<+>2<+>: "density 43 stars = gensub(/./, "*", "g", ub1) 44 print stars 45 print ub1 46 print ub2 47 print stars 48 } 49 FNR == 1{ 50 printf "'"`echo ${colformat[*]}`"'\n", '"`echo 4 ${col_selection[*]} | sed 's/^/$/;s/ /,$/g'`"' 51 } 52 FNR > 1{ 53 printf "'"`echo ${colformat[*]}`"'\n", '"`echo 4 ${col_selection[*]} | sed 's/^/$/;s/ /,$/g'`"' 54 datasetcounter++ 55 } 56 END{ 57 print "\nDatasets in total: "datasetcounter 58 } 59 ' 60 } 61 62 country_selection_menu 63 spalten_selection_menu 64 65 test -e Evaluations && : || mkdir Evaluations 66 67 for land in "${country_selection[@]}" 68 do 69 output "$land" 70 done | tee "Evaluations/""`echo ${country_selection[*]} | sed 's/ /_/g;s/$/_/'`""`date +%a_%d_%b_%Y`"".txt" | less 71 72 read -p "More queries? (Y/N)" continue 73 test $continue = "y" || test $continue = "Y" && $0 || exit 1
The download()
function (lines 8-12) first checks to see whether the file has already been downloaded and whether it corresponds to the current date. If the answer is no to either of these criteria, the download takes place, overwriting a previously downloaded file.
Next, country_selection_menu()
(lines 16-21) displays all countries for selection (Figure 3). You then input line by line the selected ISO codes, which are loaded into an array. Do not worry about uppercase/lowercase when entering the ISO codes. Press Ctrl+D to confirm your selections.
For column titles, col_selection_menu()
(23-28) lets you do the same thing using the column title numbers for input. When inputting your column selections, you do not have to input the date
column: It is always parsed by the script, because listing the other data without this important column does not make much sense.
The output()
function (lines 30-60) takes care of the data output. Starting in line 39, the function contains a gawk
script with Bash commands embedded in it. Use the following format:
'GAWK CODE'"BASH CODE"'GAWK CODE'
This function is passed an ISO code as a parameter, which is converted to uppercase letters in line 31 and ends up in a variable. With this help, grep
then retrieves the right name from the list of countries and stores it in a variable (line 32). Line 33 stores the first line of the file (the column titles) and, using grep
again, the country's data block in a variable so that Bash or gawk
can work with it. Lines 34 and 35 filter out the current population size and density of the country from the output block.
Line 36 creates a format string for a later printf
command for gawk
. The code determines the longest entry in the output block for each column title you select, creating a number string in which each number has a percent sign (%
) at the beginning and an s
appended to the end (e.g., %10s %20s ...
).
The actual output starts on line 37, which is handled by Bash. The gawk
script takes over in line 39, and it expects as parameters the preallocated variables and the output block that was read out.
The BEGIN
block (lines 40-48) works with the variables. In ub1
and ub2
, strings are assembled, or literally lumped together, by the code simply concatenating them. In line 43, a general substitution occurs; this simply converts each character in ub1
to a star (*
) and stores the result in the stars
variable. Lines 44 through 47 then output the variable contents.
Two more blocks take care of outputting the appropriate headers and data columns (lines 49-55). After querying the special variable FNR
, gawk
knows which line is currently up and then executes the appropriate block. The variable datasetcounter
(line 54) records how many records there are in total; the result is then output in the END
block with the appropriate text. You will notice that not all columns for each country are completely filled: Due to the vast amount of data, data glitches are bound to happen, and some countries do not always collect the same metrics as others.
Last but not least, all previously defined functions must also be called. Lines 62 and 63 handle the query of the countries and columns to be displayed. Line 65 checks if the directory ./Evaluations/
already exists and creates it if necessary to be able to store data there later. The calls in lines 67 to 70 then work through all the selected countries and pass the results to the output
function.
At the end of the for
loop, another tee
statement writes the data to the ./evaluations/
directory, with the appropriate date and country specified, and then copies it to standard output, where less
lets you view the data in the terminal (Figure 4). Finally, the routine in lines 72 and 73 prompts you to start more queries. If you want to, the script restarts; otherwise it says goodbye with an exit
command.
The script lets you view any key data that you are interested in, giving you a more detailed view than what you might find in the media. In particular, the data for individual countries can be more accurately compared. Keep in mind, this script can not tell you how a reproduction (R) value, incidence, or other magic number was calculated in the original data from OWID.
Making Your Own Calculations
How much money does the pharmaceutical industry make on COVID-19 tests worldwide? A precise answer could require weeks of investigation, but you can come up with a quick estimate using gawk
and Bash. In the total_tests
column, most countries show the total number of tests. All you have to do is add up the current test numbers for all countries and multiply the result by an estimate of the average test price. You now have a rough idea about the worldwide pharmaceutical industry's gross revenue (or turnover) on tests.
Listing 3 provides a script to calculate the gross revenue for COVID-19 tests. In line 2, the script uses the download
routine from Listing 2. For this reason, both scripts (Listing 2 and Listing 3) must reside in the directory. Line 3 stores the file name in a variable. However, this can also be left out, since the source statement from line 2 also takes this from the first script (Listing 2).
Listing 3
Test Revenues (turnover.sh)
01 #!/usr/bin/bash 02 source <(sed -n '2,/download$/p' query.sh) 03 FILE=owid-covid-data.csv 04 total_test_col=`sed 's/,/\n/g;1q' $FILE | grep -n total_tests$ | cut -f 1 -d ":"` 05 06 readarray -t all_country <<<`cut -f 3 -d "," $FILE | sort | uniq | sed -r '/World|International/d'` 07 08 for land in "${all_country[@]}" 09 do 10 val_country=`grep -F "$land" $FILE | cut -f $total_test_col -d "," | grep -E "[0-9]" | tail -n 1` 11 grep -q "[0-9]" <<<$val_country && echo "$country,$val_country,`(sed -r 's/\.[0-9]*$//' <<<$val_country) | numfmt --grouping`" 12 done | sort -t "," -k 2 -n | gawk -F "," ' 13 BEGIN{ 14 printf "%-25s %20s\n","Country","Tests" 15 price_per_test=180 16 } 17 { 18 printf "%-25s %20s\n",$1,$3 19 tests+=$2 20 } 21 END{ 22 "numfmt --grouping "tests |& getline tests_grouped 23 printf "\n%-20s %25s\n","Tests taken: ",tests_grouped 24 printf "%-20s %25s\n","Price per test: ",price_per_test" euros" 25 "numfmt --grouping "tests*price_per_test |& getline megaturnover 26 printf "%-20s %25s\n","Turnover generated:",megaturnover" euros" 27 } 28 '
Line 4 determines the correct column for the test count, and line 6 bundles all the countries into an array. The for
loop in lines 8 to 12 goes through all countries one by one and creates three columns separated by commas: the country name, the total number of tests, and a total number formatted to the thousandths decimal place. Then a sort
command sorts this data by the second column (total number of tests).
Now a gawk
script takes care of the formatted output. The BEGIN
block creates a suitable heading and sets the current price for tests. The block outputs the name of the country and the formatted numbers, adding up the numbers from the second column at the same time. The END
block then displays the gross revenue for test sales (Figure 5) – a number that the pharmaceutical industry probably would not like to be known.
In the END
block (line 21), you can see how gawk
handles pipes. Pass a shell command to a |& getline
variable
(lines 22 and 25). gawk
then catches the shell command output and stores it in the variable, letting it then work with it downstream. In this way, you can implement even more complicated subroutines, but at the cost of more complex handling. Most importantly, you must close pipes at the right place to produce useful output. In this case, however, it is a simple shell statement, which then also self-terminates.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.
-
Red Hat Enterprise Linux 9.5 Released
Notify your friends, loved ones, and colleagues that the latest version of RHEL is available with plenty of enhancements.
-
Linux Sees Massive Performance Increase from a Single Line of Code
With one line of code, Intel was able to increase the performance of the Linux kernel by 4,000 percent.
-
Fedora KDE Approved as an Official Spin
If you prefer the Plasma desktop environment and the Fedora distribution, you're in luck because there's now an official spin that is listed on the same level as the Fedora Workstation edition.
-
New Steam Client Ups the Ante for Linux
The latest release from Steam has some pretty cool tricks up its sleeve.
-
Gnome OS Transitioning Toward a General-Purpose Distro
If you're looking for the perfectly vanilla take on the Gnome desktop, Gnome OS might be for you.
-
Fedora 41 Released with New Features
If you're a Fedora fan or just looking for a Linux distribution to help you migrate from Windows, Fedora 41 might be just the ticket.
-
AlmaLinux OS Kitten 10 Gives Power Users a Sneak Preview
If you're looking to kick the tires of AlmaLinux's upstream version, the developers have a purrfect solution.
-
Gnome 47.1 Released with a Few Fixes
The latest release of the Gnome desktop is all about fixing a few nagging issues and not about bringing new features into the mix.
-
System76 Unveils an Ampere-Powered Thelio Desktop
If you're looking for a new desktop system for developing autonomous driving and software-defined vehicle solutions. System76 has you covered.