Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

The Fork

In June 2018, a small team at OpenAI posted a blog entry and a PDF that almost no one outside the field noticed. It described a model called the Generative Pre-trained Transformer, trained to do nothing more glamorous than guess the next word in a sentence, over and over, across seven thousand unpublished novels. Four months later, in October, a team at Google uploaded a paper to arXiv that the field could not stop talking about. It was called BERT, and within weeks it had rewritten the leaderboards that researchers used to measure whether a machine understood language at all. The two releases were not a quarrel. The people behind them worked at different companies, pursued different bets, and as far as anyone has ever reported, bore each other no ill will. But standing where we stand now, it is hard not to read those four months as the moment the road forked—one path leading to the model everyone admired, the other to the model that would eventually swallow the world.

Both paths began at the same place. A year earlier, eight researchers had introduced the Transformer, an architecture built around a mechanism called attention that let a model weigh every word against every other word at once. The original Transformer had two halves: an encoder that read an input and built a rich internal representation of it, and a decoder that generated an output one token at a time. The fork of 2018 is best understood as two teams each walking off with one half. Google's BERT was the encoder, set loose on its own. OpenAI's GPT was the decoder. It is worth saying plainly, because the popular version of this story so often gets it wrong: neither Jacob Devlin nor Alec Radford invented attention, and neither invented the Transformer. They inherited it. What they did was decide which half of it to believe in.

Devlin's bet was the encoder, and the reason it worked was a training trick of disarming simplicity. He had his model read a sentence with roughly fifteen percent of its words hidden, then asked it to guess the missing words from everything around them—the words to the left and the words to the right at the same time. This is what "bidirectional" means, and it was the source of BERT's power. A model that can see both directions builds a fuller picture of meaning than one that reads strictly left to right. The idea of blanking out words and predicting them was not new; psychologists had been doing a version of it since Wilson Taylor's "cloze" readability tests in the 1950s, and BERT's own paper credited that lineage rather than claiming the idea as its own. What was new was making it work at the scale of a deep Transformer, and the results were startling. BERT set a new state of the art on eleven separate tasks at once. It pushed the field's main benchmark score up by nearly eight points in a single stroke. When it was formally presented at the NAACL conference the following June, it won Best Long Paper. For something like a year afterward, "fine-tune BERT" was simply what serious natural-language work meant, and a small family of successors with names like RoBERTa and ALBERT grew up around it. In 2018 and 2019, if you had asked which of the two forks was the landmark, almost everyone would have said BERT.

The other fork was quieter by temperament. GPT-1 had arrived four months before BERT and made a softer claim: take a decoder, train it to predict the next word on ordinary text, then fine-tune it for whatever you need. It improved on the state of the art for nine of twelve tasks, a respectable showing, but it did not dominate, and it carried none of BERT's institutional weight. It was never peer-reviewed—not published at a conference, not even posted to arXiv, just a blog post and a technical report. The contrast is almost too neat to be true: the model that would seed the entire frontier of the coming decade was, by the credentialing standards of its own moment, the lesser of the two. Its successor, GPT-2, scaled the same recipe up to 1.5 billion parameters in early 2019 and showed it could handle tasks it had never been trained on. OpenAI chose not to release the full model at first, citing concerns about malicious uses and rolling it out in stages over the following months. The press distilled this into the headline that GPT-2 was "too dangerous to release," a phrase OpenAI never actually used, and plenty of researchers at the time grumbled that the caution was less about safety than about publicity—that the model was no great algorithmic leap and the withholding functioned partly as theater. The full version shipped that November, and the world kept turning.

The two men at the center of the fork were studies in contrast, and the contrast ran the opposite way you might expect. Devlin had the pedigree. He had come to Google from Microsoft Research, where he had led the move to neural machine translation, and after BERT his career became a kind of barometer of the talent war between the labs. In early 2023 he left Google for OpenAI—reportedly after warning Google's leadership that the team building its Bard chatbot appeared to be training on data derived from ChatGPT, which he felt crossed a line. Google flatly denied the allegation. By that June, Devlin had returned to his old job and was working on Gemini. The author of the encoder model spent the year shuttling between the two great houses of the decoder era.

Radford, by contrast, had no PhD at all. He had attended Olin College and co-founded a startup as a student before joining OpenAI in 2016, and he became the through-line of the entire GPT lineage: lead author of GPT-1 and GPT-2, a credited contributor to GPT-3 and GPT-4, and later the originator of the systems that became CLIP, DALL·E, and Whisper. He was famously hard to find in public—much of his work prototyped in Jupyter notebooks, his last public tweet posted in 2021, interviews almost nonexistent. When he left OpenAI in December 2024 to pursue independent research, Sam Altman wrote that he was "a genius at the level of einstein" and lamented that he was nowhere near as well known as he should be. The rare glimpse Radford has given of his own process comes from a magazine profile, where he described testing the first GPT as a loop of skepticism—he would tell himself the thing surely "won't be able to do x," code up a quick evaluation, and discover, again and again, that it could.

That loop is, in miniature, the whole argument of the decade to come. The reasons the decoder fork would eventually pull ahead were real and technical: a model trained only to predict the next token can learn from any raw text without elaborate setup, can generate efficiently because each new word need only attend to the words already written, and can fold every task—understanding and generation alike—into the single act of completing a sequence. There was even a quiet observation among OpenAI's researchers that at sufficient scale the bidirectional advantage that made BERT so strong simply stopped mattering for most things. But none of that was obvious in 2018. The decoder's victory was contingent on scale, and the scale that would prove it had not yet been built. For now the fork was genuinely open, and the model everyone admired was the one that read in both directions. The verdict would have to wait.

=> The Scaling Bet

Page updated

Google Sites

Report abuse