1. The Tagging Itself

What is tagging and how does it work when HMMs are used to do it? Language exhibits ambiguity. One case  of ambiguity is lexical ambiguity. A word can have more than one category assigned to it. For instance,
man can be a verb, but it can also be a noun. Likewise, boat can be a noun and a verb.

Now look at the text
drunk sailors man boats. Intuitively, man is a verb (VBZ, say) in this example, and boats is a plural noun - let's call that NNS. The word drunk has triple lexical ambiguity. It can be an adjective -JJ- or a NN or a VBZ. It's almost as if some process selects the right category. Few speakers of English would argue drunk in this case is a verb. The correct part-of-speech tagging for our small text is

drunk/JJ sailors/NNS man/VBZ boats/NNS

Let's put all the words in one row and all the grammatical categories in rows below it.

    
   






Desambigutation simply means finding the right path through the thicket of grammatical categories column-by-column: JJ - NNS - VBZ - NNS. Rather than finding the right path, HMMs find the most likely path through this thicket of correct and incorrect grammatical categories. Viterbi calls this thicket 'convolutional codes' and finding the path through the thicket is called 'decoding'. Decoding has a host of applications in signal processing, and language analysis can be viewed as signal processing.

Here's one more tiny example of an HMM at work from speech-processing. For the Dutch word
natuurlijk, the schwa, which I will portray as  [@] is realized as an /ij/. A schwa may also correspond to a written /u/. How is the incoming stream of speech to be decoded when a schwa is observed?







The situation is analogous to the
drunk sailors example above. The ambiguity this time is that schwa ([^]) can be mapped to /ij/ or /u/. The computation has to decide whether to write /ij/ or /u/, and cut its way through a thicket of possibilities, or rather probabilities. The path will take into account whether the previous sound is an /l/ (Markov assumption!), and compute that following the /l/,the /ij/ realization is more likely. The most likely decoded path is then /l/ - /ij/. (Words stress is another required clue; I am simplifying enormously.)

The Natural Language group of the Department of Computer Science at UC Berkeley has an incredibly fast
tagger demo. (The tagger happens to not be based on HMM technology, which does not affect my point). I cut-and-pasted a sentence from a legal document I found on the internet into the demo window. The sentence was long enough to be a block of text I am repeating here (note the pathetic writing).  

The offenses were committed under the following circumstances: Deponant states that deponent is
informed by an individual known to the District Attorney's Office, that informant observed threw
a cellular phone at defendant striking informant in the back of the head and thereby causing a
laceration to the back of informant's head requiring staples and substantial pain.


The tagged text, returned to me lickety split across the Internet at 1200 bps was:

Tags:The[DT] offenses[NNS] were[VBD] committed[VBN] under[IN] the[DT] following[VBG] circumstances[NNS] :[:] Deponant[NNP] states[VBZ] that[IN] deponent[NN] is[VBZ] informed[VBN] by[IN] an[DT] individual[JJ] known[VBN] to[TO] the[DT] District[NNP] Attorney[NNP] 's[POS] Office[NNP] ,[,] that[DT] informant[NN] observed[VBD] threw[VBD] a[DT] cellular[JJ] phone[NN] at[IN] defendant[JJ]
striking[VBG] informant[NN] in[IN] the[DT] back[NN] of[IN] the[DT] head[NN] and[CC] thereby[RB]
causing[VBG] a[DT] laceration[NN] to[TO] the[DT] back[NN] of[IN] informant[NN] 's[POS] head[NN]
requiring[VBG] staples[NNS] and[CC] substantial[JJ] pain[NN] .[.]

There are a few errors: that is not a preposition (IN) nor a determiner (DT). Nonetheless, it's an impressive demo.

                                                 2.  Once the text has been tagged                                                    

Once the text has been thusly enriched with decoded grammatical tags, we can grab chunks of text or engage in partial parsing of text.  There are two ways of doing so.  One is to write phrase structure rules
or regular expression rules.  Another way is to treat chunking as Named Entity Recognition, assigning the appropriate tags to 'regions' of the text conditioned on previous and/or current words and tags.

In the application of cfgs, we write grammar rules. We might, for instance, suspect that a sequence of capitalized nouns (NNPs) in a row is often important. The hybrid context-free phrase structure - regular expression rule

IMPORTANT_PHRASE  --> (NNP|POS)*
                                                    
next
Drunk sailors man boats
_________ _________ _________ _________
NN NNS NN NNS
VBZ VBZ VBZ VBZ
JJ 0 0 0
voice [natu:r l ^ k]
_________ _________ _________ _________ _________
spelling 1 l ij k
spelling2 not l u k
1