Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.
A scientific paper is, among other things, a ranking. The first author is understood to have done the most; the last is often the senior figure who lent the lab and the gravity; everyone between is sorted by a quiet, sometimes bruising calculus of credit. So when eight researchers at Google finished the paper that would remake their field, the small rebellion they chose was almost invisible. They placed an asterisk beside every name, and beneath it a single sentence: equal contribution, listing order is random. It was a deliberate refusal of the hierarchy the convention would have imposed on them — a decision to flatten the byline on principle. One of them, glancing at an early draft and finding his own name printed first, was startled enough to say he hadn't been thinking about the order at all, and joked afterward that if he had understood what the paper would become, he might have worried about it considerably more.
The paper was "Attention Is All You Need," posted in June 2017 and presented that December at the field's marquee machine-learning conference in Long Beach. Its claim was narrow on the page and enormous in consequence. For years the best systems for translating one sequence of words into another had leaned on recurrent networks, which read a sentence one token at a time, carrying a running memory forward, and on the convolutions that sometimes supplemented them. That sequential reading was a bottleneck: because each step waited on the one before, the work could not be spread efficiently across the parallel hardware that was, by 2017, the real engine of progress. The eight proposed to throw the machinery out. Their architecture, which they called the Transformer, relied on attention alone — a mechanism that lets every word in a sentence weigh its relationship to every other word at once, in a single parallel sweep. The model reached a new high-water mark on the standard English-to-German benchmark, scoring 28.4 where the previous best, including ensembles, had trailed by more than two points, and it set a fresh single-model record on English-to-French after training for three and a half days on eight GPUs — a fraction of the cost the leading systems had demanded.
The temptation, looking back, is to tell this as a story of sudden vision, but it was nothing of the kind. It accreted. The thread began with Jakob Uszkoreit, the son of a computational linguist, who had drifted into Google's translation group as an intern, abandoned plans for a doctorate, and come to believe that self-attention could replace recurrence outright. The idea met skepticism; it was, after all, a rejection of the very techniques that defined the state of the art. What turned a contrarian hunch into a project was a kind of collaborative gravity. Noam Shazeer, a Google veteran whose early work had powered the search engine's spelling suggestions, was drawn in almost by accident, catching a hallway remark about replacing the sequential networks with attention, and he supplied the refinements that made the thing work — the scaled dot-product attention, the multi-head design, the position encoding that let a parallel model keep track of word order. Ashish Vaswani, with Illia Polosukhin, built and trained the first working Transformers. Niki Parmar designed, tuned, and evaluated an exhausting parade of variants. Łukasz Kaiser, a former tenured logician who had crossed over from French academia, joined with Aidan Gomez, then a roughly twenty-year-old intern, to build the open codebase that accelerated everyone's experiments. Llion Jones, a Welsh engineer who had come up through YouTube, owned the initial code, the visualizations, the efficient inference. The famous contribution footnote reads, clause by clause, less like a credit assignment than a cast list — each sentence a person, each person indispensable, none of them first.
It was Jones, too, who named the paper. A couple of nights before the deadline the team realized they still had no title, and someone observed that the work amounted to a wholesale bet on a single idea. Jones thought of the Beatles, who had once recorded a song called "All You Need Is Love," and offered the obvious echo. He would later wave off the moment with characteristic deflation, saying that he was British, that it had taken roughly five seconds of thought, and that he never expected the others to actually use it. The architecture's own name came from Uszkoreit, who simply liked the word "transformer." Two small christenings, often blurred together afterward, that between them gave the modern era both its slogan and its noun.
Here a measure of restraint is owed, because the paper has since been folded into a tidy legend in which eight people invented modern artificial intelligence in a single stroke. They did not, and the truer story is more interesting. Attention itself was not theirs; it had been introduced three years earlier by Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, who used it to help a translation network align source and target words. The Transformer also stood on the encoder-decoder lineage of sequence-to-sequence learning and on a body of work in normalization and convolutional modeling. What the eight contributed was not the spark of attention but the audacity to make it the whole engine — to discard recurrence and convolution entirely and discover that what remained was not only sufficient but faster, more scalable, and almost frighteningly general. Their novelty was subtraction, and subtraction turned out to be the thing.
The irony that hangs over the chapter is that the company holding this idea did not grasp what it held. By one account, around the time of publication Shazeer went to Google's leadership with a proposal that, in retrospect, reads like prophecy: abandon the search index, the crown jewel, and train one enormous network on transformers instead. Nothing came of it. Sam Altman, watching from outside, would later remark that when the paper appeared he did not think anyone at Google realized what it meant. The people who did realize were, increasingly, leaving.
They left in a slow scatter rather than a single exodus, and where they went is now a map of the industry the paper created. Aidan Gomez co-founded Cohere in 2019 and built it into an enterprise-AI company valued, after a 2025 financing extension, at around seven billion dollars. Shazeer departed in 2021 with a colleague to start Character.AI, the conversational-chatbot company, frustrated that Google would not release a chatbot of its own; in 2024 Google brought him back through a roughly $2.7 billion arrangement that was a licensing deal for the technology paired with a talent move, not a purchase of the company, which continued to operate independently. He became a co-lead of Google's Gemini effort — and then, in June 2026, announced he was leaving once more, this time for OpenAI. Vaswani and Parmar founded Adept and then Essential AI. Polosukhin built NEAR, which began as a tool to write code from plain language before pivoting into a blockchain. Uszkoreit turned to biology, co-founding Inceptive to design mRNA molecules with deep learning. Jones, reported as the last of the eight to walk out of Google in 2023, co-founded the Tokyo-based Sakana AI, betting on smaller, nature-inspired models. Kaiser, the one who started no company, went to OpenAI and helped lead the reasoning models that arrived in 2024. Their ventures are worth a great deal in the aggregate, though the figures are too unlike one another — venture valuations, a volatile crypto-token market capitalization, a licensing headline — to honestly add into a single sum.
What ties the diaspora back to the page is the same idea that started it. In 2018, barely a year after Long Beach, two laboratories built directly on the Transformer and opened the modern era: OpenAI's GPT, a decoder-only model, and Google's own BERT, an encoder-only one. The "T" at the end of every later acronym is the architecture eight people declined to rank themselves over. The chapter that began with a refusal to be sorted ends with the world sorting itself around their invention — and with the quiet recognition that the most consequential idea of the decade was written by a group that insisted, on the record, that no one among them had been first.
=> The Fork