Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

The Memory Problem

In 1991, at the Technical University of Munich, a student named Josef Hochreiter — everyone called him Sepp — finished his diploma thesis and, almost without anyone noticing, explained why neural networks could not remember. The thesis carried a forgettable academic title, Untersuchungen zu dynamischen neuronalen Netzen, and it was written in German, which is one reason the rest of the field would take years to catch up to it. His supervisor, a young researcher named Jürgen Schmidhuber, would later call it one of the most important documents in the history of machine learning. For a long time it was also one of the most ignored.

The problem Hochreiter had diagnosed was easy to state and brutal in its consequences. Recurrent neural networks were supposed to be the architecture for sequences — speech, handwriting, language, anything that unfolds in time — because they carried information forward in a hidden state, processing one step after another. To train them, researchers unrolled the network across time into something very deep, one layer for every moment in the sequence, and pushed the error signal backward through all of it. Hochreiter showed mathematically what happened to that signal on the way back. At each step it was multiplied by a factor tied to the network's weights. If the factor was smaller than one, the signal shrank geometrically until, by the time it reached the early steps, it had effectively vanished. If the factor was larger than one, it blew up into wild, useless oscillations. Either way, the network could not learn which distant event in the past mattered to the present. In practice, a standard recurrent network could not reliably connect things more than about ten time steps apart. This was the quiet wall that the whole project of sequence learning had been hitting, and now there was a name for it.

Three years later, in 1994, Yoshua Bengio, Patrice Simard, and Paolo Frasconi arrived at essentially the same conclusion from a different direction, publishing "Learning long-term dependencies with gradient descent is difficult." That two independent groups had identified the same obstacle made it harder to dismiss as an idiosyncrasy, and the two strands of work would eventually be braided together in a 2001 chapter that all four men, Schmidhuber included, put their names to. But independent confirmation of a problem is not a solution, and the solution had, in fact, already been written down.

It appeared in 1997 in the journal Neural Computation, under a name that has since become one of the most recognizable in the field: "Long Short-Term Memory," by Hochreiter and Schmidhuber. The abstract stated the agenda without ornament, promising to review the 1991 analysis and then address it with a new method. The trick at the center of LSTM was something they called the constant error carousel. Inside each memory cell sat a unit with a self-connection of weight one. Because the error was multiplied by one at every step rather than by some shrinking fraction, it could travel backward across hundreds or even thousands of steps without fading away — the paper claimed it could bridge time lags in excess of a thousand discrete steps. Around that protected core, multiplicative gates learned to manage the traffic: an input gate decided when new information was allowed to write into the cell, and an output gate decided when the cell's contents were released to the rest of the network. It is worth being precise here, because popular accounts usually are not: the original 1997 design had only those two gates. The forget gate, which lets a cell learn to reset itself and which is now treated as a standard part of the architecture, came later, added by Felix Gers, Schmidhuber, and Fred Cummins around the turn of the decade in work pointedly titled "Learning to Forget."

The reception was tepid, and the reasons have hardened into legend. It is widely repeated that the foundational LSTM work was rejected from the field's flagship conference, NIPS, and the chronology is genuinely tangled: the name was coined in a 1995 technical report, a narrower companion paper did appear in the NIPS 1996 proceedings, and the full paper landed in Neural Computation. The most coherent reading, supported by the recollection of researchers who were close to the field, is that the major paper was turned away and rerouted to the journal while the smaller one squeaked through — though the exact submission and year have never been pinned to a primary document, and the rejection is best described as reported rather than fully established. Whatever the precise paper trail, the verdict of the moment was clear enough. The idea had arrived before the hardware and the data that would prove it right.

Vindication took the better part of a decade and ran through a chain of students. Alex Graves, working in Schmidhuber's lab at IDSIA in Switzerland, was central; with colleagues he introduced connectionist temporal classification in 2006, which let an LSTM train on raw, unsegmented sequences like audio and handwriting. In 2009, a network built on that approach became the first recurrent net to win international handwriting-recognition competitions. Then the dam broke. By 2015 Google was using LSTM in voice recognition and announcing dramatic cuts in transcription errors; in 2016 its new neural machine translation system, an LSTM architecture, reported reducing translation errors by an average of around sixty percent and triggered the overhaul of Google Translate that millions of people felt overnight. The same family of models ended up inside Siri and Alexa. The 1997 paper, ignored at birth, became one of the most cited articles in the history of computing. It would reign as the dominant way to model sequences until 2017, when the Transformer arrived and, in time, displaced it.

The other half of this story belongs to the advisor, and it is a stranger one. Jürgen Schmidhuber, born in Munich in 1963, has spent decades as the field's most relentless keeper of its own record. Various outlets have called him the father of modern AI, of generative AI, even of deep learning — though he himself rejects the last title, insisting it belongs to the Ukrainian mathematician Alexey Ivakhnenko, whose group built working deep networks in 1965. What made him famous beyond LSTM was a campaign over credit assignment that he has never stopped waging. He critiqued a celebrated 2015 Nature review by Yann LeCun, Bengio, and Geoffrey Hinton for citing one another while neglecting earlier pioneers. He rose at a 2016 tutorial on generative adversarial networks to argue, at length and to the audible irritation of the audience, that the idea echoed his own work from 1990. When the same three researchers won the 2018 Turing Award, he published a point-by-point objection; when Hinton shared the 2024 Nobel Prize in Physics, he called it, flatly, a prize for plagiarism.

The temptation is to file him under crank, and the field has often done exactly that — LeCun once told the New York Times that Schmidhuber was "manically obsessed with recognition" and prone to standing up at the end of every talk to claim credit, and a profile of him reached for the image of Rodney Dangerfield, the comedian who got no respect. There are even meme communities devoted to the bit. But the honest verdict is more uncomfortable, because when independent observers have actually checked his specific technical claims, a surprising number hold up. His early-1990s "fast weight" networks were later shown to be formally equivalent, up to normalization, to a class of modern linearized Transformers. His Highway Networks of 2015, built explicitly on LSTM-style gating, were a direct precursor to the residual networks that became ubiquitous. The man is impossible and frequently right at the same time, and a fair history has to hold both facts at once.

That same doubled quality runs through LSTM itself. The core analysis of the vanishing gradient was Hochreiter's, sole-authored in that German thesis; the architecture that solved it was genuinely joint, with Hochreiter as first author and Schmidhuber as the advisor who framed the problem and shaped the answer. The two never fell into the public feuds that defined Schmidhuber's relations with so many others — Hochreiter remained, for years, an adviser to Schmidhuber's company. There is something fitting in that. They had named the memory problem and then built the thing that defeated it, and the deepest irony of Schmidhuber's long crusade is that he was fighting, more than anything, against a kind of forgetting — refusing to let the field's own error signal vanish on the way back through time.

=> The Ignition

Page updated

Google Sites

Report abuse