Decision trees and pattern recognition

Even though there is a BIOS routine which tells the character at a specific location on the screen, this program was written with more general pattern recognition in mind. The code attempts to find the pattern corresponding to a set of features. It is assumed that the binary input data is reliable; correlation is the appropriate technique if the data is noisy.

Decision tree

In general, it is useful to think of causes and the corresponding set of symptoms, every one of which is either present or absent. In this case, the causes are the letters, numbers and symbols, while the symptoms are the pixels in an 8x16 matrix (which are either on or off).

Pixel labelling

Use is made of a decision tree which rejects about half the prospective symbols at every stage, by selecting a suitable symptom. The average number of comparisons is basically logarithmic with respect to the number of causes. Individual stages are optimised in the program, not necessarily the entire tree structure: However, in practice the ideal structure doesn't seem to be much shorter, as witnessed by the following table:

Symbols	Longest path	Average	Ideal
96	7	6.67	6.58
224	10	7.89	7.80

So far, so good, but there are still a couple of matters to settle:

it may be necessary to test for the same symptom at different nodes in the tree
near the leaves, the tree may be sparse

For the first point, it is noted that the next symptom to check does not necessarily entirely depend on the present symptom alone, but maybe on past symptoms as well. Better still, make it a node property.

An array which is dimensioned to 1024 elements but only uses 256 is not a satisfactory state of affairs when the trees get much larger. One way out of this is to use an (additional) indirection array: The sketch below shows the obvious way to number tree nodes when sorting is not important. On a modern computer, the speed difference may be lost.

Tree node numbering

Some symptoms may be always (or never) present. The program then proceeds, but it is fair to give the user a warning that this symptom is useless. Sometimes, it is not possible to tell between some causes, because enough symptoms have not been supplied, or because the causes are not really different: If considering the extended ASCII set, for example, (which also includes a second alphabet and graphics symbols), some letters in the local alphabet may clash with standard ASCII characters. Again, the program resolves as many symbols as possible and emits the relevant messages: Under certain circumstances it is satisfactory to return the first cause which matches, only.

Decision stages

The tree is built up gradually, and any node values which are already known are obviously not found again. If the user specifies a previously saved tree, the values are simply filled in from the file.

The decision tree really is a special type of neural network: All the input and intermediate (but not output) nodes only take binary values, though. When the symptoms take several values, rather than pass/fail only, general neural networks might be appropriate. This could form the subject of a different page.

The source code for this section is shown now. The PCDOS binary and some sample tree files are given for this section. Of course, different computer models will not have the same fonts.

Site map
Parallel to serial link

For my own code up to this point and excluding any text added by the server: Valid XHTML 1.0!

Decision trees, neural networks and pattern recognition