Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

The Alignment Problem

In 2017 a simulated robot taught itself to do a backflip. It was a crude thing, a stick figure of joints and segments tumbling in a physics engine, but the way it learned was the point. No engineer had written down what a backflip was, because no engineer could. The motion is easy to recognize and almost impossible to specify; try to define it in the language a machine optimizes against and you will produce a reward function that the machine promptly games, flailing in ways that technically satisfy your equation while looking nothing like a flip. So instead the researchers showed a person two short clips of the robot's attempts and asked, again and again, which looked more like the thing they wanted. From roughly nine hundred of those binary judgments — about an hour of a non-expert's time — the system inferred what its human supervisor was after and learned to perform it. The lesson buried in that tumbling figure would, five years later, turn an unwieldy language model into the most widely used software product in history.

The model in question was GPT-3, and by 2020 it presented a strange problem. It was vast and fluent and could continue almost any text put in front of it, yet it was not useful in the way people would later expect of an assistant. It had been trained to predict the next word across a sizable fraction of the internet, which made it a brilliant mimic and an unreliable servant: ask it a question and it might answer, or it might generate more questions, or drift into something tangential and plausible. The gap between that raw capability and a model that follows instructions and declines harmful ones was not a gap of scale. It was a gap of alignment, and the two researchers who supplied the machinery to close it had both come up through the same doctoral program at Berkeley, working on adjacent problems that would eventually be fused into a single pipeline.

Paul Christiano supplied the idea. He had been a conspicuously gifted mathematician — a silver medal at the 2008 International Mathematical Olympiad, a mathematics degree from MIT, a quantum-computing thesis at Berkeley under Umesh Vazirani — before turning, while still a student, toward the question of how one might keep an advanced AI system pointed at human intentions. At OpenAI, where he led a language-model alignment effort from 2017, he was a co-author on the 2016 paper "Concrete Problems in AI Safety," which gave a now-standard name to the failure he was circling: reward hacking, the tendency of a system to maximize a flawed proxy while perverting the spirit of what its designers meant. The remedy he helped develop appeared the following year in "Deep Reinforcement Learning from Human Preferences," a collaboration between OpenAI and DeepMind whose authors included Christiano, Jan Leike, Shane Legg, and Dario Amodei. Its central move was the one behind the backflip. Rather than hand-coding an objective, you learn the objective from human comparisons of short behavioral snippets, training a separate network to predict what people prefer and then optimizing against that learned reward. The technique was strikingly cheap: in the published account, the system reached its goals while receiving feedback on, by one version's reckoning, about a tenth of a percent of its interactions with the environment. Human oversight, it turned out, could be applied to state-of-the-art systems without drowning anyone in labor.

This is the technique that became known as reinforcement learning from human feedback, and it is worth resisting the temptation to call Christiano its inventor. Learning a reward signal from a person's judgments had antecedents — Bradley Knox and Peter Stone's TAMER framework from 2008 and 2009, in which an agent learns from a human's evaluative signal, and a wider body of preference-based reinforcement learning that the 2017 paper itself cited. What that paper did was scale the idea up to deep networks and contemporary benchmarks and show it worked at the frontier. Christiano is most accurately described as one of the principal architects of the method, the person who carried it from a niche idea into the center of how modern models are built, not as the lone author of a thing that sprang from nothing.

John Schulman supplied the engine. His path ran through the U.S. Physics Olympiad team, a physics degree from Caltech, and a Berkeley doctorate in Pieter Abbeel's robotics lab, where he became one of the most consequential figures in deep reinforcement learning. In December 2015, shortly before finishing his PhD, he was among the founding team of OpenAI. His signature contribution arrived in July 2017, a month after the preferences paper: Proximal Policy Optimization, a reinforcement-learning algorithm that he and his co-authors designed to be stable and, above all, simple. Its predecessor, his own Trust Region Policy Optimization, worked well but demanded a complicated constrained optimization that was awkward to implement and to combine with other architectures. PPO replaced that machinery with what the paper called a clipped surrogate objective — a way of preventing each update from pushing the policy too far from where it had been, using only first-order gradients. The authors claimed it kept the reliability of the older method while being much simpler to implement and more general, and the field agreed almost immediately. PPO became the default reinforcement-learning algorithm across robotics, games, and simulation, which is why, when the time came to fine-tune a language model against a learned reward, it was the obvious optimizer to reach for.

The two threads converged in 2022, in a system called InstructGPT. Its recipe, described by a team that included both Schulman and Christiano, ran in three stages: human labelers first wrote demonstrations of good behavior and the base model was fine-tuned to imitate them; those labelers then ranked sets of model outputs, and a reward model was trained to predict their preferences; finally, the model was optimized against that reward using PPO. The result was the single most important datapoint of the whole arc. A version of InstructGPT with 1.3 billion parameters produced outputs that human raters preferred to those of the 175-billion-parameter GPT-3 — a model a hundred times larger. Alignment, not raw scale, was what made a model useful. The technique was not free; naive application degraded performance on standard benchmarks, a cost the field came to call the alignment tax, which the authors offset by mixing pretraining gradients back into the objective. Nor was it neutral: the model was shaped by the judgments of roughly forty contractors whose demographics, the paper itself acknowledged, were far from representative of the people who would eventually use it. But it doubled the rate at which the model gave truthful answers on one benchmark, and it worked.

ChatGPT, released on the last day of November 2022, was in OpenAI's own description a sibling of InstructGPT — the same method, with conversational data added. Within two months it was estimated by analysts at UBS to have reached a hundred million monthly users, which they called the fastest ramp they could recall for any consumer application. The machinery beneath that explosion was the backflip, scaled to language: a reward learned from human preference, optimized by PPO.

The people who built it spent the years afterward drifting toward institutions concerned less with capability than with restraint. Christiano left OpenAI in 2021 to found the Alignment Research Center, whose evaluation arm later red-teamed GPT-4 and spun out as the independent nonprofit METR; in 2024 he was named head of AI safety at the U.S. government's new AI Safety Institute, an appointment that drew reported unease among some staff over his ties to effective altruism, a claim that rested on anonymous sourcing and was contested by others who thought him plainly qualified. Schulman left OpenAI for Anthropic in August 2024, then left Anthropic six months later to co-found Thinking Machines Lab. He is routinely described as a chief architect of ChatGPT, a description he has gently refused. "I get too much credit for ChatGPT," he wrote on departing OpenAI, pointing to the broader team behind it — a modesty that sits oddly well on someone who had, in the most literal sense, built the engine that made the thing run.

=> The Founders' War

Page updated

Google Sites

Report abuse