Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

The Statisticians

When Frederick Jelinek walked into IBM's Thomas J. Watson Research Center in 1972, he meant to stay three months. He was a tenured information theorist at Cornell, taking what was supposed to be an unpaid leave, and the speech-recognition group he agreed to look after had just lost its manager. He stayed twenty-one years. It was the kind of accident that defined his life — a man who, by his own account, kept sliding into the work that made him famous rather than choosing it. He had not set out to study speech, or translation, or the strange new discipline that would one day be called natural language processing. He had set out, decades earlier, to find a corner of engineering that did not require him to build anything physical. What he built instead was a way of thinking that would outlast every machine of his era.

The path to that room in Yorktown Heights ran through one of the century's darker passages. Jelinek was born Bedřich Jelínek in 1932 in Kladno, near Prague, the son of a Jewish dentist. When the Nazis occupied Czechoslovakia he was barred from school and taught in shifting underground classes until those, too, were forbidden. His father died after his imprisonment at the Terezín concentration camp; Jelinek, his mother, and his sister survived. In 1949 the family emigrated to New York, where he took evening engineering classes at City College before winning a stipend to attend MIT. There he fell into the orbit of Claude Shannon, whose recent theory of communication had turned messages, noise, and meaning into mathematics. Under Robert Fano he wrote a doctorate in information theory, choosing the field, he later said, precisely because its aim was not the construction of physical systems. He brushed against linguistics only glancingly — his wife Milena enrolled in Noam Chomsky's lectures and he sometimes sat in, briefly tempted to switch fields, until Fano told him to finish his degree first.

That glancing brush mattered, because linguistics was about to become his great antagonist. The reigning conviction of the era, articulated most forcefully by Chomsky, was that language was a system of rules — and that the very idea of assigning a probability to a sentence was, in Chomsky's words, an entirely useless notion. To approach language by counting words seemed not merely crude but conceptually confused. Jelinek's group at IBM proceeded to do exactly that. Rather than hand-code the grammar and phonetics of English, they treated speech recognition as the problem of decoding a message sent through a noisy channel: the speaker has a sentence in mind, the world garbles it into sound, and the machine's task is to recover the most probable original. It was Shannon's code-breaking logic, repurposed. You did not need to know why people said what they said. You needed only to estimate, from enormous quantities of text, how likely any given string of words was to occur, and to weigh that against how likely a given sound was to have come from a given word. The orthodoxy called it heresy. Jelinek called it engineering.

It is from this period that the most quoted sentence in the field descends — and it is worth pausing on how little we actually know about it. Some version of "every time I fire a linguist, the recognizer gets better" has trailed Jelinek for forty years, repeated in textbooks, lectures, and a thousand conference corridors. But there is no canonical wording, no agreed date, and no published source. One account dates a version to a workshop in 1985; Jelinek himself recalled saying something like it in 1988, in Wayne, Pennsylvania, though the line appears nowhere in the proceedings of that meeting. He never claimed to have actually fired anyone. And he came to dislike the reputation it earned him so thoroughly that in 2004, accepting a lifetime award, he titled his talk "Some of My Best Friends Are Linguists." A colleague who knew him well recalled that Jelinek's favorite adjective for his own statistical models was "moronic" — not a boast but a goad, a standing reminder that the crude methods worked only because no one had yet found something better. The man behind the legend respected the rules he was dismantling. The legend has never bothered to notice.

The deeper provocation came when the group turned the same machinery on translation. By the late 1980s, machine translation was a discredited cause; a famous government report two decades earlier had pronounced it hopeless and choked off its funding. Jelinek's team revived an idea that had been floated and abandoned in 1949 by Warren Weaver, who had wondered whether a foreign language might be treated as one's own written in a secret code. Now there were two things Weaver had lacked: vast quantities of machine-readable text, and computers fast enough to chew through it. The text came from Canada, whose parliament published its proceedings in both English and French — millions of sentences, each one a human translation of the other. The group used these to estimate, statistically, how English words mapped onto French ones, then cast translation as another noisy-channel problem: given a French sentence, find the English sentence most likely to have produced it. No grammar, no hand-built dictionaries of rules, only probabilities learned from data.

Their system was called Candide, and the myth that has grown up around it — that it beat the established commercial translators — is more flattering than true. In the government evaluations of the early 1990s, Candide edged the veteran rule-based system Systran on fluency, the naturalness of its English, but trailed it on adequacy, the question of whether the meaning survived. It comfortably beat a rival knowledge-based system. The honest headline was not victory but parity: a five-year-old program built on nothing but statistics had drawn level, on one measure, with a system that had fifteen years of hand-crafted linguistic rules behind it. To anyone paying attention, that was the more astonishing result. The rules-based champions had a head start of more than a decade, and the upstart that knew nothing about language had very nearly caught them simply by reading.

IBM never sold Candide. By around 1995 the effort had dissolved, partly because the translations remained far too slow to be practical, and partly because the men who had built it left. Their destination is one of the stranger codas in the history of computing. Several of the group's leading modelers — Peter Brown, Robert Mercer, the Della Pietra brothers, Lalit Bahl — went to a hedge fund called Renaissance Technologies, where they trained the same instinct on a different noisy channel. The stock market, like French, was a stream of signals concealing a hidden order; if you could model the one statistically, you could model the other. They became extraordinarily wealthy. Jelinek took the other road, founding a language and speech research center at Johns Hopkins and working there until the day he died, at his desk, in 2010.

He did not live to see what his heresy became, but its shape is unmistakable in everything that followed. The conviction that you could learn language from data rather than legislate it with rules; the discipline of measuring progress against shared corpora and fixed benchmarks; the willingness to let a crude method win for as long as it kept winning — these hardened into the working assumptions of an entire field. The probability of a sentence, that supposedly useless notion, turned out to be the thread running from Shannon's wartime mathematics straight through to the systems that, decades later, would learn to write. Jelinek had simply been early, and stubborn, and unromantic enough to count.

=> The Winter Gardeners

Page updated

Google Sites

Report abuse