Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

Chapter 5: A Generalised Model

For a long stretch in the late 2010s, the most exciting problem in virtual humans was how to make a face move correctly. The body of theory that grew up around it was elegant and discrete. You decided what an agent meant to communicate, you planned the behaviour that would carry that meaning — a glance here, a hand raised there, a nod timed to a stressed syllable — and then a piece of software called a realiser turned that plan into motion. The plan was written in a markup language, the way a web page is written in HTML, and the realiser read the markup and rendered the gestures. The pipeline had a name — intent, to behaviour, to realisation — and a small constellation of academic engines that implemented it. To anyone building believable characters, that pipeline looked like the road ahead.

It was a good map of a country that was about to be paved over.

The destination, it turns out, was correct. The people sketching the future of virtual humans were right about where the field was going. They predicted that characters would become autonomous — that the influencer's curated backstory and the live-streamer's real-time presence would fuse into a single entity, and that the human puppeteer working the strings behind the avatar would simply be removed and replaced by software. They even had a name for the thing that would result: a virtual robot. And in December 2022 it arrived, more or less on schedule, in the shape of an AI VTuber called Neuro-sama — a language model wearing an animated avatar and a synthetic voice, livestreaming for hours without a script, playing games, bantering with chat, picking fights, holding a personality. No operator. By early 2026 she was breaking the platform records held by flesh-and-blood streamers. The prediction was vindicated almost to the letter.

What was wrong was everything underneath it. The vehicle the field expected to drive to that destination — the markup languages, the rule-based realisers, the carefully engineered "cognitive architecture" that would orchestrate language and gesture from a central module — was quietly abandoned while everyone was looking at the horizon. The large language model, which barely figured in the original picture, turned out to be the engine, the steering wheel and most of the road.

You can read the whole shift in the fate of a single piece of software. Of the academic realisers that once represented the cutting edge, most are now dormant or gone — last meaningful releases a decade old, repositories with no recent activity, their authors moved on to other things. One survives and is actively developed, and the way it survives is the entire story in miniature: it stayed alive by bolting a language model, a neural speech recogniser and frame-by-frame neural generation onto its old symbolic spine, so that it can now run the rule-based pipeline and the neural one side by side. The system endured by absorbing the very techniques that made its original design obsolete. That is not a paradigm winning. That is a paradigm being kept on life support by its successor.

The same irony runs through the commercial exemplars. The two facial-animation companies most often held up as proof that machine learning had conquered the face are, on inspection, the procedural holdouts — one built on anatomical muscle simulation, the other openly marketing the fact that it needs no neural network at all. They are thriving precisely because they didn't go neural. The genuine neural wave came from somewhere else entirely: audio-driven face models, diffusion talking-heads that generate a speaking face from a single still image and a voice clip, and the open-sourcing of the tools that put a lip-synced face on any character in real time. And the predicted bottleneck — the worry that chatbots would need text-driven faces, not just speech-driven ones — was real, but the industry routed around it rather than solving it as posed. The standard production stack today is almost boringly linear: a language model writes the words, a text-to-speech system speaks them, and an audio-driven model animates the face from that audio. The clever inversion of recognition models that some imagined would crack the problem never became the method. The pipeline just got longer and simpler at the same time.

The starkest lesson is financial. The worldview that treated a virtual human as a bespoke, high-craft artefact — a "black art" practised by specialists and sold at a premium — assumed that difficulty itself was a moat. In February 2026 the company that best embodied that assumption entered receivership, having raised more than a hundred and thirty-five million dollars, lost its marquee customers and both its founders, and watched its standalone product get commoditised by the foundation-model giants. Meanwhile the capability it had sold at a premium became something you assemble from off-the-shelf parts: a character engine, a hosted language model, an animation microservice. A rival that pivoted in time survived by selling the assembly itself. The moat drained in about three years.

None of this makes the original forecast foolish. The honest verdict is almost the opposite: the direction was unusually sharp and the mechanism was unusually wrong, and the single event that explains the gap is the one nobody building these systems in 2020 could have weighted properly — the public arrival of large language models at the end of 2022, which dissolved the careful modular architectures into a pile of learned weights. The people who predicted the virtual robot were imagining an elaborate clockwork that would, eventually, behave like a mind. What they got instead was a mind-shaped statistical engine that made most of the clockwork unnecessary.

The puppeteer did vanish, exactly as predicted. It's just that nobody expected the strings to vanish too.

Page updated

Google Sites

Report abuse