That a few lines of code should become an expert in recognition of spoken British vowels, or a winetasting test, is fascinating. Of course, an expert may well have been needed to decide
which attributes to measure, but still nobody can cope with tens of variables at the same time.
People tend to simplify, and seek the three or four most influential factors to the classification desired. (It is good enough, sometimes.) Computers get results in a different way, as they can examine tens of variables at the same time. It doesn't matter if the individual attributes selected are only mildly relevant to the target classification- as long as there are a lot of attributes.
There certainly exist experts, but tend to be busy- and some matters are thought to be sensitive. Most people are not experts in most matters. A computer can help,
sometimes, but it is stressed that no single method is appropriate for all problems.
The
nearest neighbour classifier asserts that a sample and its nearest neighbour are likely in the same class- if there are
a lot of samples.
Database | Samples | Attributes | Classes | Prediction (%) |
Breast tumour prognosis | 194 | 33 | 2 | 70 |
Character recognition | 20,000 # | 16 | 26 | 75 |
Glass types | 214 | 9 | 6 | 73 |
Handwritten digits | 11,000 # | 16 | 10 | 98/ 93 * |
Image segmentation | 210 | 19 | 7 | 86 |
Ionosphere returns | 351 | 34 | 2 | 91 |
Iris plant | 150 | 4 | 3 | 93 |
Liver disorder | 345 | 6 | 2 | 63 |
Lung disease | 27 | 56 | 3 | 56 |
Oncology data | 569 | 30 | 2 | 95 |
Pima diabetes | 768 | 8 | 2 | 69 |
Spoken vowels | 990 | 10 | 11 | 99/ 50 * |
Thyroid data | 215 | 5 | 3 | 97 |
Ultrasonic identification | 208 | 60 | 2 | 87 |
Wine origin | 178 | 13 | 3 | 98 |
* The lower percentage is for different people in the training and testing groups.
# Only the first 950 samples have been actually used.
For the liver disorder, the results really are disappointing: Prediction is hardly better than a pure guess, allowing for statistical uncertainty. It appears that most attributes supplied are not relevant- those which are (to some extent), are not enough for a successful classification.
Even when the results are not encouraging, as for speaker independent vowel recognition, you can still draw some useful conclusions: For example, a successful vowel recognition algorithm will likely have to be trained on the
individual- and you don't need to be an expert to see that. Handwritten digits, on the other hand, do not seem to require a customisation process.
There are, usually, variations on a theme, and nearest neighbour classification is no exception. The algorithm of
k-nearest neighbours finds several nearest neighbours, then applies a majority rule.
The differences in attribute values are normalised by the standard deviation before finding distances, but usually it is worthy to assign different
weights to attributes, according to their relevance. A gain of 2% (at least) in most of the figures in the table above will be obtained, even by a crude adjustment of the weights, such as trying values of 2 and 0.5.
One paper proposes to check whether the classifier is improved by removing from the training set those instances which are not up to the average performance of the classifier.
Cross-validation is employed to assess classifying algorithms: The data is split in ten groups, all having about the same number of instances in every class. Nine groups are used to train the classifier, while the tenth is used for testing. It is possible to test all the unseen groups, one at a time, and quote the average success rate. More than ten groups can be used, but the reduction in the statistical uncertainty may not be worthwhile and, for some methods, this will need extra computer effort- if this still matters nowadays. The most demanding test is to train the program on all samples but one, then test it on the unknown sample- this provided the figures in the table.
Some of the values for some attribute may have been lost. The obvious solution ('omit them from the training set') is not what the literature advises. Instead, it is proposed that the remaining attributes are exploited to estimate the missing one. For the figures in the table, the (few) records with unknown attribute values have been bypassed.
There can be
nominal attributes, as well as numeric. (These can be treated like boolean attributes, if they only take two values.) If there are more nominal values, a text processor can be used to substitute numeric values for them, but this is the least of the problems: You still need to decide on the corresponding numerical difference between nominal values. Hopefully, it will be possible to place nominal attributes in some order, such as 'lost hands down', 'missed narrowly', 'won- low bonus', 'won- high bonus'. If nominal values cannot be ordered, even if you do get the associated numerical differences right, it may not be possible to find a satisfactory numeric representation of nominal features, unless extra, artificial, attributes are introduced.
The
program supplied includes a small computer game database for experimentation. Without adjusting weights, the hit rate is 79%. This happens to be as good a rate as for the larger counterpart of 1,000 samples, which is a surprising result. Usually, the more samples, the better.
If you get a hit rate of 100%, don't get too excited! You probably think the class attribute is in the leftmost column, while it is in the rightmost, or vice versa. If you get a very
low rate, comparable to a mere guess, the same goes.
Instance based classifiers may look a 'dumb' method (if statistically sound), but they are
effective. Actually, apart from some
hybrid methods which include
instance based classification, and improve on it marginally, this otherwise looks the most successful classifier for most data sets.
Most of the programs on these pages need
QuickBasic. If you cannot find it on your OS CDROM, it can be downloaded from some sites.
The
UCI repository is acknowledged for the databases in this project.
For my own code up to this point: Valid XHTML 1.0!