Developing an AI system with Markov chains
Sure-Fire Success
Markov chains model systems that jump from state to state with predetermined probabilities, but can they help write new columns like this one after learning from previously written articles?
Hard to believe, but true: This edition of my programming column marks its 20-year anniversary. Lately, I've been diving into artificial intelligence topics, a field that is increasingly replacing traditional jobs, but I wonder if a machine could possibly one day replace authors like me? An AI system would certainly have a number of advantages over a human writer, because a robot would no doubt deliver the manuscript on time every time, to the amazement of my editors, who are not fond of my bad habit of watching deadlines swoosh by. Also, I could afford to work less hard and have more time to kick back, relax, and spend my days pursuing my surfing hobby at various Pacific beaches (Figure 1).
Moody like the Weather
An algorithm that writes columns automatically could be implemented as a so-called Markov chain, named after the Russian mathematician Andrei Markov. Markov chains are stochastic processes with different states that change according to predetermined probabilities.
The weather forecasting model shown in Figure 2, for example, says the chance of a sunny day following a "Sun" state is 65 percent. At 30 percent, the weather will change to the "Clouds" state, and at 5 percent, it goes directly to the "Rain" state. The probabilities apply only to the current state; a change to the next state does not require knowledge of any event in the past, so the probability of a transition is very easy to calculate.
In a transition matrix such as that shown in Figure 2, the computer only needs to select the row with the current state, such as the third row for the Clouds state (C), to find out that there is a 25 percent chance of the weather being sunny, a 25 percent chance of rain, and a 50 percent chance of skies remaining cloudy. Equipped with corresponding random generators, the model jumps back and forth between states, and a later analysis of where this Monte Carlo method leads eventually allows conclusions to be drawn about the most likely weather in the future.
Sentences from the Machine
Such "random walks" can predict not only the weather, they can generate sequences of words that sound like they were written by a human author. For this purpose, a program first analyzes the word sequences used in an extensive body of text. It notes which words follow one another in a sentence, and pieces together a new text from learned phrases. For example, it might realize that, in corpus, three given words are sometimes followed by word A and sometimes by word B. When generating new prose, the algorithm will roll an imaginary die each time to determine whether the three initial words are followed by A or B, according to the probabilities in which they occur in the original text.
To start off, the algorithm separates the body of text into words (tokens) and then analyzes it step by step with a rolling window of words. The first (n – 1) words of the window determine the state of the machine at the current time, and the last word in the window is the value the machine assumes during the transition to the next state.
Rolling Window
For example, for a window size of n=3, from the text
Humpty Dumpty sat on a wall,
Humpty Dumpty had a big fall.
it extracts the sequences "Humpty Dumpty sat," "Dumpty sat on," "sat on a," and so on. The first two words (Humpty Dumpty) respectively determine the current state of the machine; the following word ("sat") is the next state value. In the second part of the nursery rhyme, the algorithm stumbles again on the "Humpty Dumpty" state. However, it sees not the word "sat," but rather the word "had" following the current state. When it later creates random text, the algorithm now has two options: It can either follow up with "sat," or it can choose "had," and it will do so according to calculated random values. The result is text that shows slight human qualities because it copies word sequences, but occasionally performs crazy jumps between contexts, which can sound either confusing or funny.
The oldest Programming Snapshot in Linux Pro Magazine dates back to 2003 (although the German edition started in 1997), and Listing 1 [1] reads all raw manuscripts cobbled together of every single issue since then in the file all.txt
. The file name is first passed to the generate_from()
function in line 68. The statemap_gen
function starting in line 37 then splits the file into tokens with token_feed()
and passes on the resulting list of the window()
function from line 26, which accepts n=4 as the window width. As explained previously, the status of the Markov chain in this case is made up of the three last words; the fourth is part of the result state.
Listing 1
randomtext.py
To prepare the tokenizer of the nltk
package, it needs to be installed with pip3 install nltk
. Moreover, the tokenizer dataset punkt still needs to be downloaded separately in a Python shell (Figure 3). The tokenizer interprets punctuation marks, such as commas or colons (the regular expression in line 7 defines them all), as individual tokens. This approach later increases the readability and entertainment value of the random text.
Figure 4 shows how the algorithm generates a state file of the Markov chain from the nursery rhyme and the window length n=3. Note how the statemap generated offers both "sat" and "had" as a follow-up state for "Humpty Dumpty":
('Humpty', 'Dumpty'): ['sat', 'had']
The Python code shows some of the spice of the language, such as the yield()
operator in the token_feed()
function in line 16: So that Python can iterate over the components of an object using a for
loop, it must be of type Iterable
. These objects do not spill out their elements in one go, but they pass them through one piece at a time with each call to yield()
.
The advantage of this process is that the script does not need to store all of the data in a construct in memory; rather, it can get away with only calculating the data when needed. Therefore, an object can contain an infinite number of elements; as long as they are never needed all at once, the illusion stays alive despite limited memory.
Python also provides a variety of data structures. Double-ended queues, called deque
, as used in the window()
function in lines 26-35, can be modified really fast if the code limits itself to adding or removing elements to or from the head or tail of the queue. The only flaw is that the caller of the window()
function cannot manipulate the deque
object returned, or the algorithm gets confused.
For this reason, line 41 uses key=list(state)
to copy the elements of the deque
object into a Python list, which the statemap_gen()
function can modify to its heart's content. It interprets the first (n – 1) elements as a Markov state, and the last item as a possible next state.
In line 43, key
is a Python tuple that can no longer be modified. The dictionary statemap
maps such tuples to one or more follow-up state values. In this way, the algorithm can later look up possible follow-up state values quickly and select one randomly with random.choice()
in line 60, leading the text on unpredictable paths and imbuing the algorithm with human-like traits.
The result can be seen in Figure 5. However, it demonstrates that the applied Markov chain procedure does not automatically provide exciting new magazine articles. The grammar is still lacking enormously, and the content doesn't really make any sense. This column will probably have to be created manually for the foreseeable future.
Infos
- Listing for this article: ftp://ftp.linux-magazine.com/pub/listings/linux-magazine.com/205/
Buy this article as PDF
(incl. VAT)
Buy Linux Magazine
Subscribe to our Linux Newsletters
Find Linux and Open Source Jobs
Subscribe to our ADMIN Newsletters
Support Our Work
Linux Magazine content is made possible with support from readers like you. Please consider contributing when you’ve found an article to be beneficial.
News
-
Linux Kernel 6.13 Offers Improvements for AMD/Apple Users
The latest Linux kernel is now available, and it includes plenty of improvements, especially for those who use AMD or Apple-based systems.
-
Gnome 48 Debuts New Audio Player
To date, the audio player found within the Gnome desktop has been meh at best, but with the upcoming release that all changes.
-
Plasma 6.3 Ready for Public Beta Testing
Plasma 6.3 will ship with KDE Gear 24.12.1 and KDE Frameworks 6.10, along with some new and exciting features.
-
Budgie 10.10 Scheduled for Q1 2025 with a Surprising Desktop Update
If Budgie is your desktop environment of choice, 2025 is going to be a great year for you.
-
Firefox 134 Offers Improvements for Linux Version
Fans of Linux and Firefox rejoice, as there's a new version available that includes some handy updates.
-
Serpent OS Arrives with a New Alpha Release
After months of silence, Ikey Doherty has released a new alpha for his Serpent OS.
-
HashiCorp Cofounder Unveils Ghostty, a Linux Terminal App
Ghostty is a new Linux terminal app that's fast, feature-rich, and offers a platform-native GUI while remaining cross-platform.
-
Fedora Asahi Remix 41 Available for Apple Silicon
If you have an Apple Silicon Mac and you're hoping to install Fedora, you're in luck because the latest release supports the M1 and M2 chips.
-
Systemd Fixes Bug While Facing New Challenger in GNU Shepherd
The systemd developers have fixed a really nasty bug amid the release of the new GNU Shepherd init system.
-
AlmaLinux 10.0 Beta Released
The AlmaLinux OS Foundation has announced the availability of AlmaLinux 10.0 Beta ("Purple Lion") for all supported devices with significant changes.