Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.
The story everyone knows about Tomáš Mikolov begins in the wrong place. It begins, in the popular retelling, at Google around 2013, with a piece of software called Word2Vec and a single startling line of arithmetic: take the vector for king, subtract the vector for man, add the vector for woman, and you land almost exactly on queen. It is one of the few results in the history of machine learning to escape the journals entirely and become a piece of folklore, the kind of thing a journalist can put in a headline and a reader can feel they understand. Meaning, it seemed to say, was not some ineffable property of the human mind. Meaning was geometry. Words were points in space, and the relationships between them were directions you could travel.
Almost every part of that story is slightly off, and the man at its center has spent years gently saying so. To understand why, you have to start not in Mountain View but in Brno, in the Czech Republic, where a doctoral student was trying to make a speech recognizer transcribe broadcast news a little more accurately.
Mikolov, born in 1982 in the Moravian town of Šumperk, did his PhD at the Brno University of Technology under Jan Černocký, in a speech group with the no-nonsense priorities of speech groups everywhere: lower the word-error rate, or the work does not count. The milestone that actually belongs at the head of his story is a 2010 conference paper, presented at INTERSPEECH in Japan, with the deliberately unglamorous title "Recurrent neural network based language model." Its co-authors included Sanjeev Khudanpur of Johns Hopkins University's Center for Language and Speech Processing — the laboratory built by Frederick Jelinek, the man who had taught a generation that the way to model language was to count, not to parse. The recurrent network Mikolov brought to that lab was, in a sense, the next move in a game Jelinek had started. It beat the carefully tuned statistical models the field had spent two decades perfecting, cutting perplexity by roughly half when several networks were combined and shaving word-error rates on the standard benchmarks even when the older models were given far more data to train on.
What made the recurrent model matter was not raw cleverness but memory. A conventional n-gram model sees the world through a narrow sliding window of a few words; Bengio's earlier neural language model widened that window but kept it fixed. Mikolov's network, by contrast, fed its own hidden state back into itself at every step, so that in principle it could carry information forward across an entire sentence. The architecture was simple — an Elman-style recurrent net with a single hidden layer, trained by unrolling it through time — but the implication was large. Context no longer had to be chopped into fixed lengths. The model could, at least in theory, remember.
Mikolov himself considers this his real contribution, and has said so pointedly. In a retrospective written years later, he noted that although Word2Vec became his most cited paper, he never thought of it as his most important; the Word2Vec code, he pointed out, had begun life as a stripped-down offshoot of his recurrent-network project. He went further, listing a string of firsts he attributes to that earlier work, and comparing its significance to AlexNet's. These are a participant's claims, and some are contested — the technique of gradient clipping he says he introduced is more commonly credited to a 2013 paper he co-wrote with Razvan Pascanu and Yoshua Bengio — but they are worth hearing as testimony, because they reveal how differently the field's protagonists rank their own achievements from how the public ranks them for them.
Then came Google, and Word2Vec, and the part of the story that did go viral. In early 2013 Mikolov and colleagues released two short papers describing a pair of shallow models, called CBOW and Skip-gram, that learned word vectors by predicting words from their neighbors or neighbors from a word. The trick was austerity. Earlier neural language models had been expensive, weighed down by a large hidden layer; Mikolov threw the hidden layer away, leaving something closer to a log-linear model that could be trained on more than a billion words in under a day. The result was not a deeper idea than what had come before but a faster and more usable one, and that turned out to matter enormously. The accompanying software, open-sourced by Google, made high-quality pre-trained word vectors something any researcher could download and drop into their own system. It became, as one writer put it, the model that launched a thousand embedding papers.
But it did not, contrary to the legend, produce king − man + woman ≈ queen. That result was first published in a separate paper that same spring, written with Wen-tau Yih and Geoffrey Zweig at Microsoft Research, and the vectors it used came not from Word2Vec at all but from Mikolov's recurrent language model — the Brno-and-Baltimore project, not the Google one. The famous sentence appears in that paper's abstract, describing how the male/female relationship had been learned automatically. The Word2Vec paper, posted a few weeks later, actually cites the other paper for the analogy and offers a different example of its own, about the relationship between big and biggest. So the popular shorthand gets it doubly wrong: wrong paper, and wrong family of model.
There is a further, quieter caveat that the folklore omits entirely, and it is the one most worth keeping in mind. The clean answer depends on a rule. When the system computes king − man + woman and searches for the nearest word, it is forbidden from returning any of the three input words. Lift that restriction, as several later analyses have shown, and the nearest vector to king − man + woman is usually king itself; queen comes second. The result is real, but it is also partly an artifact of how the test is scored — the system is being quietly told not to give the obvious boring answer. A 2020 paper made the point with a deadpan title noting that, by the same arithmetic, man is to doctor as woman is to doctor. Analogies, it argued, are a poor instrument for measuring what these vectors actually encode.
It is important to be clear about what Mikolov did and did not do, because the field has a habit of compressing lineages into single names. He did not invent the idea that a word's meaning lives in the company it keeps; that runs back through Zellig Harris and J. R. Firth in the 1950s, through Hinton's distributed representations in the 1980s, through Bengio's neural language model, through Collobert and Weston's pre-trained vectors a few years before Word2Vec. What Mikolov supplied was efficiency, scale, and a toolkit — the engineering that turned a respectable academic idea into a default tool.
The same vectors that delighted everyone soon began to trouble them. In 2016, researchers used Word2Vec's publicly released Google News vectors to show that the geometry encoded ugly regularities along with charming ones, completing the analogy "man is to computer programmer as woman is to" with "homemaker." That paper became a founding text of the field that would later worry, in chapters still to come, about what large language models absorb from the text we feed them. The line runs forward in people as well as in ideas: Ilya Sutskever, a co-author on the second Word2Vec paper, would carry the thread onward through sequence-to-sequence learning and, eventually, to OpenAI. Mikolov himself has aired a grievance about the origins of that seq2seq work, claiming an uncredited early role; the claim was publicly disputed by Quoc Le, and is best read as one man's account rather than settled history.
What survives all the corrections is the idea underneath, and it is genuinely strange that it works at all. You can take the texture of human language — the way Paris sits to France as Tokyo sits to Japan — and freeze it into a cloud of points, and the analogies become subtraction and addition, directions you can walk. The honest version of the story is messier than the headline: it was a recurrent network, not Word2Vec, that first found the king and the queen; the cleanness of the demonstration owes something to a scoring rule; and the man who built it thinks his most famous result was the least of his work. But the payload remains. For a brief and consequential moment, meaning looked like a place you could point to.
=> The Eight