Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.
The standard story is that three stubborn men invented backpropagation, the algorithm that teaches neural networks, and were ignored for twenty years before the world caught up. It is a good story, and almost every part of it is wrong. The most reliable witness against it is Geoffrey Hinton himself, who has spent years swatting away the credit. "I've seen things in the press that say that I invented backpropagation," he once remarked, "and that is completely wrong. It's one of these rare cases where an academic feels he has got too much credit for something." The honest version is stranger and more interesting than the legend: the field's defining idea had already been discovered, more than once, by people who never heard the words "neural network," and the three figures the press would eventually crown as the godfathers of deep learning earned their place not by inventing the method but by believing in it when belief had gone out of fashion.
The method in question is simple to state and was, in 1986, anything but new. A neural network is a stack of artificial neurons, each connection carrying a tunable weight; training the network means nudging those weights until the output is right. The difficulty is the one Marvin Minsky named in 1961 as the "credit-assignment problem" — when a system of many interacting parts produces a wrong answer, how do you decide which internal decision deserves the blame? Backpropagation answers this by running the error backward through the network and using the chain rule of calculus to apportion responsibility, layer by layer, to every weight. Jürgen Schmidhuber would later compress the whole enterprise into a sentence: "Machine learning is the science of credit assignment." The mathematics that solves it — the reverse mode of automatic differentiation — was published by the Finnish researcher Seppo Linnainmaa in a 1970 master's thesis that never mentioned neural networks at all. Paul Werbos proposed using it to train them in his 1974 Harvard dissertation. The underlying gradient technique was older still, worked out by the control theorists Henry Kelley and Arthur Bryson around 1960 to optimize flight paths. By the time the famous paper appeared, the idea had a long and largely forgotten pedigree.
That paper — "Learning representations by back-propagating errors," in Nature on the ninth of October, 1986 — carries three names in an order worth remembering: David Rumelhart first, Hinton second, Ronald Williams third. Its genuine contribution was not the algorithm but a demonstration: it showed, cleanly and persuasively, that a multilayer network trained by backpropagation would discover useful internal representations in its hidden layers, learning features no one had hand-coded. That, plus the sheer reach of its publication, is what lodged the method in the field's imagination. The paper had only four references and cited none of its true precursors, an omission Hinton attributes not to bad faith but to simple ignorance of a history scattered across disciplines that did not talk to one another. The fuller argument lived in the two volumes of Parallel Distributed Processing, edited by Rumelhart and James McClelland the same year, which served as the manifesto of the connectionist revival.
The name that should anchor the chapter, then, is one the wider world has largely forgotten. David Rumelhart, a cognitive psychologist at San Diego, was by most accounts the senior intellectual force behind the 1986 work; Hinton has flatly called backpropagation "his invention." Rumelhart developed Pick's disease, a frontotemporal dementia, and died in 2011, which is why his name is absent from the honor that later defined the others. "He died of a brain disease," Hinton said at his Turing lecture. "Otherwise, he would be here instead of me." The myth of three lone heroes is, in no small part, an artifact of an award that caps its recipients at three and refuses to give it posthumously.
While the theorists argued, the most concrete vindication of the whole approach was quietly happening in New Jersey. Yann LeCun, French-born and freshly out of a Toronto postdoc with Hinton, joined AT&T Bell Laboratories in Holmdel in 1988 and set about making backpropagation earn its keep. Building on Kunihiko Fukushima's Neocognitron of 1980 — which had supplied the architecture of shared weights and local receptive fields but trained it by other means — LeCun added end-to-end learning by backpropagation and produced the convolutional networks that came to be called LeNet. By 1989 they were reading handwritten ZIP codes; the line of work culminated in the LeNet-5 system described in 1998. This was not a laboratory curiosity. NCR deployed LeNet-based check readers starting in June 1996, and by 2001 the system was estimated to be reading about twenty million checks a day, roughly a tenth of all the checks written in the United States. The supposedly dead field was, the whole time, parsing the nation's money.
This is the detail that punctures the "winter" so often described as total. Neural networks in the 1990s and early 2000s were unfashionable and underfunded, not extinct. The work was hard to publish — "neural networks" had become almost a slur in grant committees — but Sepp Hochreiter and Schmidhuber published the LSTM in 1997, LeCun's nets read checks, and a quiet apparatus kept the research alive. Its institutional lifeline was the Canadian Institute for Advanced Research, whose program Hinton directed from 2004, knitting together the groups in Toronto, Montreal, and New York. Yoshua Bengio has described the revival as a deliberate act rather than a lucky thaw: the field, he said, had gone out of fashion around 2005 until a small group decided that something needed to be done to revive it because it was "the right thing to do." Part of what they did was rhetorical. "Deep learning" was, in effect, a rebranding of neural networks around 2006, a fresh name for an old idea that had become unsellable under its own.
Schmidhuber has never let the credit question rest, and any honest account has to give him his due without surrendering to him. From his institute in Switzerland he has pressed, year after year, the case that the celebrated trio failed to acknowledge the originators — Linnainmaa for the method, Werbos for its application to networks — and his critique of the 2018 Turing Award lists eight specific priority disputes. On the dates and the prior publications he is substantially right, and most of the field quietly concedes as much. Where opinion divides is on tone and on the deeper question of what deserves reward: the first statement of an idea, or the labor of making it work. The fairest verdict is that both matter, and that a discipline can owe its origins to one person and its arrival to another.
Of the three, Bengio is the one whose work runs straight down the spine of this book. In 2003, with collaborators, he published "A Neural Probabilistic Language Model," which learned distributed representations of words — what we now call word embeddings — and used a neural network to predict the next word in a sequence, directly attacking the curse of dimensionality that had hobbled the old n-gram models. It is the conceptual ancestor of word2vec and of every neural language model since. In 2014 the attention mechanism arrived in a paper by Dzmitry Bahdanau, Kyunghyun Cho, and Bengio on machine translation, and that mechanism is the direct precursor of the Transformer architecture that would, three years later, make the modern era possible. The road from Bengio's hidden layers to ChatGPT is nearly a straight line.
The crowning came late and somewhat by accident of category. In 2012, Hinton and his students Alex Krizhevsky and Ilya Sutskever entered AlexNet in the ImageNet competition and won by a margin so wide it ended the argument — a top-five error of 15.3 percent against 26.2 for the runner-up. Hinton's own summary is characteristically dry: "Ilya thought we should do it, Alex made it work, and I got the Nobel Prize." Google bought their tiny company for forty-four million dollars after Hinton steered an auction between Google and Baidu. In March 2019 the Association for Computing Machinery named Bengio, Hinton, and LeCun the recipients of the 2018 Turing Award, "the Fathers of the Deep Learning Revolution," for breakthroughs that had made deep neural networks central to computing. The man who had not sat down since 2005 because of a ruined back, the great-great-grandson of George Boole whose middle name was Everest, had finally been told he was right.
And then, almost immediately, the victors grew afraid of the victory. In May 2023 Hinton left Google to speak freely about the dangers of the technology he had nurtured, telling interviewers he had once thought human-level AI was thirty to fifty years away and no longer believed it. He consoled himself with the usual excuse, that if he had not done it someone else would have. Bengio turned his energy toward safety, signing the statement that placed the risk of human extinction from AI alongside pandemics and nuclear war. The gardeners who had tended the unfashionable seedlings through the long winter lived to see them overgrow the whole landscape — and to wonder, having waited so long for spring, whether they should have.