intro

                                                                    Site map
0 Resume
1 Computational Linguistics
2 Statistical Natural Language Processing
             2.1 HMMS, tagging, chunking
             2.2 Tagging, chunking, Named Entity Recognition (Bikel et.al. critique - unfinished)
3. My LexisNexis Risk Solutions Project:Tsunami! (Defunct because of merger/acquisition)
              3.1 Code sample:   Forward algorithm in Perl
              3.2 Code sample:   Viterbi algorithm in Perl
4. Publications
5. Annotated Core Bibliography.

I am not old or famous enough for memoirs, and no one is paying me to do scholarly surveys. So these pages are best viewed as an account of the personal experiences of one linguist in industry. In other words, these pages are meant to serve as a kind of state-of-the-art report for the working linguist: state-of-the-art from the vantage point of someone working for a software company is different from state-of-the-art from the vantage point of someone working at a major research institution. They are notes to myself and others, hopefully helpful, about where the field is going. A large part of my motivation is the increasing amount of electronic text creates opportunity of all kinds - financial, intellectual - for those who work with text.

In one view, and I stole the circumlocution from a former colleague, Natural Language Processing is nothing but programming with text. As such, it consists of an array of techniques the practitioner has to be familiar with. The practitioner then offers these techniques to society - "here is what we can do to make document access better". We can summarize them automatically, we can sift out predicates and arguments, we can find all kinds of named entities, etc. In another view, NLP is much more - a peeping hole into the human mind. Participants in the great debates argue, endlessly it seems sometimes, about language acquisition, the representation of language in the brain, and Universal Grammar. I consider looking at these debates a kind of fringe benefit of being in this field. Yet another view of NLP is as a kind of algorithm celebration. The 80s bathed in the glow of the algorithm (Berlinski 2000, 2001), the 2000's offer a similar celebration of statistical sequencing. Bioinformaticians do it, and textual programmers do it. (On the cross-fertilization between bioinformatics and computational linguistics, look here).

In any event, NLP was once restricted to all the grammars in the Chomsky hierarchy: types 1-3. Reseachers adorned these grammars, especially context-free grammars, with grammatical features, they embellished them with unification mechanisms, they dragged van Wijngaarden metagrammars into he machinery and called them HPSG grammars - yet, Natural Language Processing remained AI-complete, and that was that. This period corresponds to my career as described in the computational linguistics pages.

Next, hesitatingly, probability theory insinuated itself onto the scene. Or maybe it was always there gathering dust since the 50s, bustling with the intellectual excitement generated by WW II code breaking efforts and Turing's work. Nowadays, data everywhere facilitate looking at populations of words, and everyone wants to be a corpus linguist and apply these techniques. This period corresponds to my statistical NLP pages. I was relatively late in discovering these techniques. I had already given up on NLP and linguistics, and was dying to be a teacher, nurse, electrician or programmer. "Programming with text" seemed to offer nothing but misery and lack of results.

The structure of my pages mirror this sequence of stages. That's the organizing principle. It's easy seeing the trees in the forest in hindsight: the research that stands out and that shapes the field today. But where does someone who works in industry go today? What does a professional have to know to be considered a technical computational linguist? What can employers reasonably expect from us? Do we really have to be expert programmers, excellent linguists, and be full computer scientists?

One important disclaimer applies. When one, in addition to reading books and articles, websurfs intensely on a given topic, sometimes entire phrases, or stretches of sentence, lodge themselves in one's mind. At times it becomes hard to draw the boundary between received knowledge and knowledge one incurred five minutes ago while scouring the Internet. I apologize if I crossed boundaries while writing; I never did so with the conscious intent to plagiarize, however.