Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.
In a workshop room in Florence in October 2012, a graduate student named Alex Krizhevsky presented a result that most of the computer-vision researchers in attendance were not prepared to believe. The contest in question, the ImageNet Large Scale Visual Recognition Challenge, asked machines to sort photographs into a thousand categories — leopard, container ship, mushroom, motor scooter — across roughly 1.2 million training images. For years the field had advanced by increments, shaving perhaps a single percentage point off its error rate from one year to the next. Krizhevsky's system had not shaved a point. It had cut the top-five error rate to 15.3 percent, against 26.2 percent for the next-best entry — a margin of nearly eleven points, the kind of discontinuity that does not occur in a mature research field unless something fundamental has shifted underneath it.
The veterans in the room were skeptical, and they had earned the right to be: neural networks had promised much before and delivered mostly embarrassment. But Yann LeCun was present, and he recognized what he was looking at. He called it a turning point, and he was right. Before that autumn, almost none of the leading vision papers relied on neural networks; within a couple of years, almost all of them would. To understand why a single image-recognition result mattered far beyond computer vision — why it sits, in a sense, one degree of separation from ChatGPT — you have to follow two people who arrived at that room from opposite directions, neither of whom set out to build anything resembling a language model.
The first was Fei-Fei Li, who had spent the previous six years arguing an unfashionable thesis. Born in Beijing in 1976 and raised in Chengdu, she emigrated to New Jersey at sixteen, worked weekends in her family's dry-cleaning business, and went on to study physics at Princeton and earn a doctorate at Caltech under Pietro Perona. By around 2006, watching her colleagues chase ever-cleverer algorithms, she had concluded they were optimizing the wrong thing. The bottleneck in computer vision, she believed, was not the models but the absence of data — the lack of any large, messy, real-world catalogue of labeled images from which a system could actually learn what the world looks like. Her ambition was almost comically large. She wanted, as she put it, to map out the entire world of objects, and she took the scale of that target partly from the psychologist Irving Biederman, whose back-of-the-envelope estimate from the 1980s held that a person recognizes something on the order of thirty thousand distinct kinds of things.
To organize that universe, Li borrowed a scaffold already standing on her own campus: WordNet, George Miller's hierarchical map of the English language, which sorted nouns into nested categories. ImageNet would simply hang pictures on WordNet's branches. The labeling was the hard part, and the breakthrough was unglamorous — Amazon's Mechanical Turk, the crowdsourcing marketplace, which let an army of anonymous workers sort images by the millions and briefly made an academic project one of the platform's largest users. The 2009 paper that introduced the database, with Jia Deng as lead author, described a collection already running to millions of images, with the stated intention of populating most of WordNet's tens of thousands of categories. The following year Li's team launched the annual competition that would turn the dataset into a proving ground. For two years the contest produced respectable, incremental progress. Then SuperVision entered.
That was the name under which Krizhevsky's team actually competed; "AlexNet" is a label the field applied afterward, and it is worth remembering that the 2012 paper never used it. Krizhevsky, born in Ukraine and raised in Canada, was a doctoral student in Geoffrey Hinton's lab at Toronto, where he had already built the small image datasets that other researchers would lean on for years. His teammate on the network was a fellow Hinton student, Ilya Sutskever, who pushed the conviction that ImageNet was the problem worth attacking. The architecture they built was not, in its bones, new. It was a convolutional neural network, the design Yann LeCun had refined through the late 1980s and 1990s for reading handwritten digits, whose own ancestry ran back through Kunihiko Fukushima's Neocognitron to ideas about the visual cortex. What Krizhevsky did was make it deeper — eight learned layers, some sixty million parameters — and train it on a scale no one had managed, using engineering tricks that each bought a crucial margin: rectified linear units that let the network learn several times faster, dropout to keep it from memorizing its training set, and aggressive augmentation of the images themselves.
The decisive ingredient was hardware. The network was too large to fit on a single graphics card, so Krizhevsky split it across two consumer NVIDIA GTX 580 GPUs, each with three gigabytes of memory and a retail price around five hundred dollars, and wrote the training code by hand in CUDA. It ran for roughly five to six days. That detail is easy to romanticize, and Sutskever has since waved off the legend that a trunk full of gaming cards changed the course of history — he says he simply bought them online. But the substance survives the deflation: AlexNet demonstrated that a pair of off-the-shelf GPUs could train a deep network to superhuman-adjacent performance, and that demonstration standardized GPU-accelerated deep learning more or less overnight.
Honesty requires noting that they were not first to the underlying idea. At the Swiss lab IDSIA, Dan Cireșan and Jürgen Schmidhuber had been training convolutional networks on GPUs since 2011, winning a string of vision contests — including a traffic-sign benchmark at superhuman accuracy — and publishing a multi-column network months before AlexNet appeared, a line of work the 2012 paper explicitly cites. The fair verdict is that Cireșan and Schmidhuber got there first on GPUs, but that AlexNet, by winning on ImageNet's scale and in ImageNet's spotlight, was the result that galvanized the entire field. Both things are true, and the chapter loses nothing by saying so.
What happened next is the part that connects this computer-vision milestone to everything that followed. Hinton, Krizhevsky, and Sutskever formed a tiny company, DNNresearch, with no product and three employees, and Hinton ran a quiet auction for it during the December 2012 conference at Lake Tahoe. Google, Microsoft, Baidu, and the young DeepMind all bid; by Cade Metz's account the price opened at twelve million dollars, climbed in million-dollar increments over several hours, and reached forty-four million before Google won. Years later, after a Nobel Prize had landed on the whole enterprise, Hinton would compress the entire story into a single dry line: Ilya thought we should do it, Alex made it work, and I got the Nobel Prize. The auction is the moment the proof became people. Three researchers who had shown that depth plus data plus GPUs worked were now inside the company that could afford to scale all three without limit.
One of those three is the through-line. Ilya Sutskever went to Google, helped develop sequence-to-sequence learning, and in 2015 left to co-found OpenAI, where as chief scientist he would oversee the research that produced the GPT models and, eventually, ChatGPT. The young man who insisted on training on ImageNet was, three years later, helping to build the system that would put a language model in front of a hundred million people. The image-recognition win was never a language milestone in itself. It was the event that supplied the proof, the playbook, and the personnel.
Krizhevsky, the man who actually made it work, is the quiet counterpoint. He joined Google with the others, worked on photos and self-driving cars, and left in 2017, reportedly having lost interest in the work — a phrasing that comes to us secondhand and that he has never, characteristically, bothered to confirm. He keeps a famously low profile and has largely stayed out of the field he detonated. There is something fitting in that. The person who lit the fuse walked away from the fire, leaving the others to follow it wherever it led — which, within a decade, was very nearly everywhere.