Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

The Scaling Bet

Jared Kaplan was not, by training, supposed to be the man who told the world how artificial intelligence would grow. He was a theoretical physicist, a string theorist whose work ran through quantum gravity, holography, and the abstract machinery of the conformal bootstrap, and who had held a professorship at Johns Hopkins since 2012. When he began collaborating with OpenAI in 2019, he brought along the single instinct that physicists are taught to trust above almost everything else: that beneath the noise of a complicated system, if you look at it from far enough away, there is often a clean and simple law. The question he set out to ask was, by his own cheerful admission, the dumbest one available. Everyone in the field already knew that bigger models tended to perform better. Kaplan wanted to know how much better. Not in spirit, but in numbers — a curve you could plot, extrapolate, and bet on.

The answer arrived in January 2020, in a paper titled "Scaling Laws for Neural Language Models," led by Kaplan and Sam McCandlish and signed by eight further OpenAI researchers. Across more than two hundred transformer models spanning seven orders of magnitude of compute, they found that a language model's loss — its error in predicting the next token — fell as a smooth power law in three quantities: the number of parameters, the size of the training dataset, and the amount of compute spent. The relationships were almost eerily precise. Performance, the authors wrote, "depends most strongly on scale," and only very weakly on the architectural choices that engineers had agonized over for years; whether a network was deep or wide barely mattered once you fixed its total size. What had been an intuition passed around in conference hallways — bigger keeps working — became, in Kaplan's hands, something closer to thermodynamics. You could now predict, before spending a dollar, roughly how good a model of a given size would be.

The paper closed with a recommendation that would shape the next two years of the field. Larger models, it argued, are far more sample-efficient, so the optimal way to spend a fixed compute budget is to train very large models on a relatively modest amount of data and stop well short of convergence. Most new compute, in other words, should be poured into raw model size; data needed to grow only slowly to keep pace. It was a recipe, and it was about to be cooked.

Four months later, in May 2020, much the same team published the dish. "Language Models are Few-Shot Learners" introduced GPT-3: 175 billion parameters, ninety-six layers, a context window of 2,048 tokens, trained on roughly 300 billion tokens drawn mostly from a filtered slice of the open web, with smaller, higher-quality helpings of books and Wikipedia weighted more heavily in the mix. It was an order of magnitude larger than any comparable model before it, and the scaling-laws paper had quite literally guided its design — the authors credit that earlier work with helping them predict where to put the parameters and the data. The headline result was not merely that GPT-3 was good at language. It was that the model could perform tasks it had never been trained on simply by being shown a handful of examples in its prompt, with no weight updates of any kind. The paper called this few-shot, or in-context, learning, and it was the closest thing the field had yet seen to a machine that learned how to learn on the fly. Humans asked to distinguish GPT-3's short news articles from real ones did so only about half the time — no better than a coin flip.

The reception was a genuine cultural event, the moment the wider public first felt the ground move. When OpenAI opened a gated API in June 2020 — a deliberately cautious rollout, shaped by the anxious staged release of GPT-2 the year before — outsiders fed it poetry, code, and philosophy and recoiled at what came back. The philosopher David Chalmers called it one of the most interesting and important AI systems ever produced; a text-adventure game, AI Dungeon, became for many the first hands-on taste of the thing. And yet the most clarifying voice in the noise belonged to OpenAI's own chief executive. "The GPT-3 hype is way too much," Sam Altman wrote that July. It was impressive, he allowed, but it still made silly mistakes and had serious weaknesses; the technology was, he insisted, just a very early glimpse.

Altman's caution turned out to be prophetic in a way he did not intend, because the recipe that produced GPT-3 was, on one crucial axis, wrong. For two years, the Kaplan prescription — spend your compute on size, starve it of data — was the field's reigning wisdom, and the giants that followed obeyed it. Then, in 2022, a team at DeepMind led by Jordan Hoffmann published "Training Compute-Optimal Large Language Models" and quietly demolished half of it. Training over four hundred models, they found that model size and data should be scaled in roughly equal proportion, not lopsidedly toward size — a rule of thumb of about twenty training tokens for every parameter. To prove it, they built a model called Chinchilla, with 70 billion parameters trained on far more data, and watched it outperform a 280-billion-parameter model trained under the old recipe at the same compute budget. The implication was sharp and a little humbling: GPT-3 and its enormous contemporaries had been the right size for nothing in particular. They were oversized and undertrained — built, as one might put it, at the right price and the wrong shape.

What rescues this from being a simple story of error is that the field eventually explained, with some precision, why the two recipes had disagreed. By 2024, two separate reconciliation papers traced the gap to technical bookkeeping rather than any deep disagreement about how learning works. Kaplan had counted only the non-embedding parameters of his models and had fit his curves at relatively small scale; once total parameters were used, much of the discrepancy dissolved. Others isolated the remaining differences to the computational cost of the final layer, the length of the learning-rate warmup, and the careful tuning of the optimizer at different scales — correct for those, and the Kaplan and Chinchilla results snapped into agreement. The lesson the chapter wants to leave is not that the original work was sloppy, but that even a genuine empirical law can mislead when the fine print is read carelessly.

A second open question refuses to close so neatly. GPT-3 had shown that certain abilities — multi-digit arithmetic, for instance — seemed to appear only once a model crossed some threshold of size, absent in smaller versions and suddenly present in larger ones. By 2022 this had a name, emergence, framed by Jason Wei and colleagues as abilities that cannot be predicted by extrapolating from smaller models. It was a thrilling idea, and it was almost immediately contested. In 2023, Rylan Schaeffer and his coauthors argued that emergence might be a mirage — an artifact of choosing harsh, all-or-nothing metrics that turn a smooth underlying improvement into an apparent cliff. Use a gentler measure, they showed, and the jumps often soften into predictable curves. The debate is not settled; later work suggests some genuine discontinuities survive even careful measurement. Whether these leaps are real features of large models or tricks of how we score them remains, for now, an honest unknown.

It is worth saying, too, that Kaplan and his colleagues were not the first to glimpse the curve. In 2017, two years before the OpenAI papers, a Baidu team led by Joel Hestness had already found power-law scaling of error across machine translation, language, vision, and speech — the same shape, in a different lab, largely forgotten in the popular telling. The road to GPT-3 was longer than the legend allows.

Perhaps the most striking thing about the whole episode is how little anyone could explain it. The most powerful recipe in modern AI had been discovered empirically, by measuring rather than understanding, and the people who found it knew this. Asked later why scaling works so smoothly, Dario Amodei — the physicist who had pushed the bet inside OpenAI and would soon leave to co-found Anthropic with Kaplan and others — gave the only honest answer available. We still don't know, he said. The law was real. The reason for it had been left, like so much else, for the next chapter.

=> The Alignment Problem

Page updated

Google Sites

Report abuse