How compilers work
Meticulous Transformer
Compilers translate source code into executable programs and libraries. Inside modern compiler suites, a multistage process analyzes the source code, points out errors, generates intermediate code and tables, rearranges a large amount of data, and adapts the code to the target processor.
Below the surface, a black box compiler handles complex processes that require good knowledge of machine theory and formal languages. Given the importance of compilers, it is not surprising that compiler construction is standard curriculum for computer science students. If you have never been to a college-level lecture on compiler theory – or if you went to the lecture but need a refresher course – this article summarizes the basics. (See the box titled "Teaching Material for Compiler Construction.")
Teaching Material for Compiler Construction
When students of computer science need to learn what holds the programming language world together, the best bet is to attend a lecture on compiler construction. Mathematics professor Peter Köhler held a lecture on compilers in the 2016 summer term at the University of Giessen [1], Germany. With his permission, Linux Magazine worked through his notes and summarized the most important points for this article.
In simple terms, a compiler goes through three steps: It parses the source code, analyzes it, and synthesizes the finished program (Figure 1).
In the first step, the compiler parses the source code character by character and tries to identify keywords, variable names, numbers, strings, and all other components – including comments. The scanner takes care of this lexical analysis.
In the second phase, another component checks the syntax and semantics of the program. For example, if a programmer uses a variable that has not yet been declared, or if a semicolon at the end of an expression is missing, the compiler will discover the problem in this phase.
Modern compilers distinguish between syntactic and semantic analysis. Syntax is all about structure, whereas semantics is focused on meaning. The component of the compiler called the parser tries to detect the syntax elements. If this phase is successful, the parser calls the semantic routine.
This delineation makes it possible to optimize tasks and analyze the source code more efficiently.
However, syntactic and semantic analysis turns out to be a non-trivial task, as a simple example shows: the compiler for a dumb calculator needs to parse a text file with a calculation expression. The calculator can only add and multiply numbers, where multiplication/division takes priority over addition/subtraction. If in doubt, brackets specify the sequence. For example, the following is allowed,
(1+3)*(4+5)
but this is not:
(1+)*4
The compiler must understand which of the two expressions is valid and which is not.
For the compiler to identify that the second expression as faulty, the compiler builder first tells the compiler how the correct source code is structured. This information is conveyed through rule sets. The compiler checks whether the source code follows the rules defined in the rule sets. In the example, a single number is obviously a valid expression. If two x
and y
expressions exist, then x + y
, x * y
and (x)
are valid expressions.
The rules can also be written in shorthand. The Backus-Naur Form (BNF) is a very popular form (see Listing 1): The program (program
) consists of an expression (expr
). This expression, in turn, exists if it is either a term
or if it is a term
with the character +
and then another expression after. The last rule is necessary because numbers can consist of several digits (digits
).
Listing 1
Backus-Naur Form
These rules, known as production or derivation rules, form the grammar of the source language. Like English grammar, this grammar provides rules that build the programming language. Abbreviations, such as term
or factor
, are known as symbols. All symbols on the left-hand side are referred to as non-terminals, all others are terminals. Terminals would include digit
, +
, *
, )
, and (
.
One programming language may include several grammars. The choice would depend on the language and requirements. The grammar used in the preceding example can lead to infinitely long expressions and programs because some rules have a recursive structure. The grammar used in the example is considered a context-free grammar (or type 2). A grammar is considered regular if all rules follow one of the two forms:
U ::= t U ::= Vt
In this case, t
is a terminal character; U
and V
are non-terminal characters.
Top-Down Parser
To the delight of developers, a grammar can be used to create a compiler quite elegantly: For all non-terminals, the developer creates a function that takes care of the corresponding evaluation. If necessary, the function calls on other functions to help. For example, the function expr()
, which belongs to expr
, has a structure like the one briefly outlined in Listing 2.
According to the second rule in Listing 1, a correct expression such as 1+2
always starts with a term
. In Listing 2, the expr()
function includes term()
, which still needs to be implemented for an evaluation in the first step. In this example, term()
returns 1
.
Listing 2
expr()
§§nonumberint expr() { int value = term(); if (File.ReadNextChar() == '+') value += expr(); return value; }
In the second step, expr()
then looks at the next character in the source code. If the next character is a +
, the second rule of Listing 1 requires an expression to follow, which is why expr()
calls itself. The result is that expr()
returns 2
, which the compiler also adds to the previously computed intermediate result. The compiler would then have arrived at the end of a valid expression, which is why expr()
returns the calculated value in the example.
At this point, a mature compiler could generate a suitable instruction in machine code. The developer generates functions according to the same principle for all other non-terminals. If you then apply program()
to the source code file, you automatically get the translated program – in the example, this would be the calculated number. Because processing starts with program
, computer scientists also call this symbol the start symbol.
The calls to the individual functions can be displayed as a tree in this scenario: the syntax tree.
Bottom-Up Parsers
In the process described so far, the parser simply proceeds from the start symbol (in the example program
) and then tries to find appropriate grammar rules for each character that has been read. Such parsers are known as top-down parsers.
Other parsers are bottom-up parsers and proceed according to an opposite scheme: They try to replace the source code with grammar rules (reduction) until only the start symbol remains. In practice, such bottom-up parsers usually use an empty stack at the beginning, into which they dump all previously detected, non-terminal characters and all the characters they read. After each new character is read, they check whether the topmost elements on the stack match the right side of a grammar rule. If this is the case, parsers take the elements from the stack and place the non-terminals from the left side of the identified grammar rule on top. The whole process runs until the start symbol is left and the end of the source code is reached. If that doesn't work, there's probably a mistake. A bottom-up parser that works according to this pattern is also known as a shift-reduction parser.
The programming of the scanner and parser is a straightforward consequence of the grammar of the language. This process can also be automated: Tools such as Flex [2] or Bison [3] automatically generate the source code for matching scanners and parsers from the grammar (Figure 2).
Whenever the parser needs more information, it requests it from the scanner. In the example in Listing 2, the expr()
function would thus not read the next character itself; instead, File.ReadNextChar()
simply consults the scanner.
Scanner Internals: Machines
The source code of most programming languages consists of numbers, variable names, keywords, operators, and separators (delimiters). Names and keywords are composed of several letters and characters.
A finite-state machine helps the scanner detect int
, &&&
, and all other components of the language. Imagine it like a ticket vending machine: The customer throws in 20 cents, and the machine changes its status to "20 cents paid." If the customer inserts a dollar, the machine then changes the status to "120 cents paid". This continues until the machine reaches the "price paid in full" status. Similarly, you can make a machine that accepts letters from the source code. The machine switches each character read to a different status: If the scanner has read an i
, it changes to the state "i read"; if an n
is added, it changes to the state "n read." If the t
follows, the machine has finally read the whole term int
and changes its status to "int identified."
This machine has an initial status and at least one final status. The machine for detecting int
has detected the initial "no character read yet" status and the two final "int recognized" and "not the keyword int" statuses. When the final status occurs, computer scientists say the machine has accepted the word int
.
You can also envision a scenario in which boxes (nodes) represent the individual states (Figure 3). The start and end states are highlighted accordingly. Arrows (edges) indicate from which status the machine transitions to another state. The transition will only take place if the machine has read one of the characters written on the arrow.
A regular grammar can be used to construct not only a parser, but also a finite-state machine that accepts the language of the grammar: First, create a status for all non-terminals, plus the start and end statuses. Then, when it reads a character t
, it makes a transition from the s
state to the n
state if there is a grammar rule in the n ::= st
form. Incidentally, this process creates a bottom-up parser for the grammar.
In a simple scenario, you could read the next character and check, with if
queries or a switch
construction, which character it is and finally note, in the form of a variable, the new state to which the machine has changed.
Once the scanner has identified the next element in the source code (such as the keyword int
), it replaces the element with a symbol and returns it to the parser. The aforementioned Flex tool also uses the strategy with finite-state machines.
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
There's a New Open Source Terminal App in Town
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.
-
Gnome 47.2 Now Available
Gnome 47.2 is now available for general use but don't expect much in the way of newness, as this is all about improvements and bug fixes.
-
Latest Cinnamon Desktop Releases with a Bold New Look
Just in time for the holidays, the developer of the Cinnamon desktop has shipped a new release to help spice up your eggnog with new features and a new look.
-
Armbian 24.11 Released with Expanded Hardware Support
If you've been waiting for Armbian to support OrangePi 5 Max and Radxa ROCK 5B+, the wait is over.
-
SUSE Renames Several Products for Better Name Recognition
SUSE has been a very powerful player in the European market, but it knows it must branch out to gain serious traction. Will a name change do the trick?
-
ESET Discovers New Linux Malware
WolfsBane is an all-in-one malware that has hit the Linux operating system and includes a dropper, a launcher, and a backdoor.
-
New Linux Kernel Patch Allows Forcing a CPU Mitigation
Even when CPU mitigations can consume precious CPU cycles, it might not be a bad idea to allow users to enable them, even if your machine isn't vulnerable.