QuickBasic Markov chain algorithm

In order to process a text file to produce (some) different, intelligent-looking text, the Markov Chain algorithm is sometimes used. A list of suffixes corresponding to every two-word prefix is compiled, then one of the suffixes is chosen at random. The program is fed with the first two words in the input text and proceeds until the required number of words has been produced- unless the end of the input file is met first.

Shouldn't prefixes of other sizes be tried? It seems that single-word prefixes will output nonsense, while three or more words in the prefix more or less echo the original text.

It is easy to find versions of the Markov chain algorithm on the net, in Java, Perl, C, C++ and even Awk. There almost certainly is a Basic version of it somewhere, since the program has been around since the early eighties. However, the site could not easily find a Basic version, and it is a short program, anyway.

A program should be considerate of the limitations of many compilers: There are no associative arrays, for example, though you could form your own, by using a separate string array for the indices. There is usually an array limit of 64kb, while strings are limited to 32kb.

The input text is parsed into words separated by whitespace; punctuation is left attached to words, since this tends to produce more legible results. Every word is denoted by a number according to the order in which it has been met: For example, the second word in the text is thereafter referred to by the number two. Associative arrays are convenient, but not strictly necessary.

Since string manipulation is faster than random access to a file, the input text is stored in a string. You will have to get a more powerful compiler if your input text is longer than 32kb- there are at least two of them which are still free. (You could also directly access memory.)

The position of the first character in every word in the text is stored, as well as the wordlength, rather than having to find them time and time again.

Some 85% of prefixes are only met once, therefore the corresponding suffix is not explicitly stored- this is just the word immediately following the prefix. In the cases that there are additional suffixes, a list of them is compiled and pointed to by the first occurrence of the corresponding prefix. The following instances of the prefix merely point to the first one.

Determining whether a prefix is met several times in the text needed some more work: A thousand words of input require some 500,000 string comparisons if every word is compared to (up to) every other word. Since 500,000 string comparisons only take a second on a 2 GHz processor, you could let it go- in the first instance. When the texts get much longer, the waiting times become nobody's idea of fun, and a tree structure is appropriate. A hash table can be even faster, though that will be discussed in another section. No attempt is made to remove identical suffixes.

The result is compact and reasonably fast code, but still very modest in its demands for memory space. The power supplies section has been used as input text: The effect is at times indistinguishable from human text, acceptable or funny.

Site map
Gauss-Legendre Quadrature

Up to this point: Valid XHTML 1.0!