Supporters of Marcus Endicott’s Patreon can access weekly or monthly consultations on this topic.

Chapter 7: Conclusion

In 2021, the smart bet on virtual humans looked something like this: the bandwidth was almost here, the game engines were drifting into the cloud, and consumer mixed reality was a season or two away from going mainstream. Lay those pieces side by side and the conclusion seemed to write itself. Lifelike conversational figures would soon spill out of the laboratory and into the marketplace, embodied and three-dimensional, reaching across every screen and headset we owned. The infrastructure was not yet in place, but all the building blocks were on the horizon. It was a confident, plausible, and almost entirely misdirected forecast, and the strange thing is that it still arrived at roughly the right destination.

Five years on, virtual humans really did converge toward mass-market relevance. A genuine commercial marketplace now exists, with dozens of vendors selling AI-driven avatars for training, marketing, and customer service, and with valuations that would have sounded fanciful in 2021. The directional intuition was sound. What collapsed was the causal story underneath it. Almost every specific enabler named as the engine of this transformation either underdelivered or actively went into reverse, while the thing that actually did the work was nowhere in the original text.

Start with the road that was supposed to carry everything: the last mile of 5G. The promise was that new bandwidth at the edge would complete the picture, complementing cloud-delivered game engines so that rich virtual humans could be streamed to anyone, anywhere. In practice, 5G's contribution to delivering immersive experiences turned out to be marginal relative to its billing. The bottleneck was never really the last mile; it sat further back in the network, in the middle-mile routing that no consumer upgrade addressed. The most candid verdict came from within the telecom industry itself, where one carrier's forward-looking research conceded that extended reality, autonomous driving, and digital twins had failed to take off not because the networks were too slow but because the devices, the demand, and the surrounding conditions were not ready. The performance of the network, it turned out, had never been the problem.

The cloud game engine fared no better. The forecast assumed that game engines were steadily relocating to the cloud and that this infrastructure would expand on cue to host a generation of streamed virtual humans. Instead, the most visible attempt to make cloud gaming mainstream shut down in early 2023 after failing to find an audience, and the damage to the idea's reputation lingered. The services that survived settled into life as niche conveniences rather than the booming backbone the prediction required. More tellingly, the virtual humans that did flourish mostly walked out of the game-engine house altogether. The breakout successes run as flat, two-dimensional talking-head video or as browser and on-device avatars, needing no streamed engine at all. A real-time three-dimensional engine still matters for one slice of the field, the embodied non-player characters being built into games, but the broader market simply went around it. Even the chip-makers pushing the embodied vision began moving inference onto local devices rather than off to the cloud, the opposite of the streamed model the forecast leaned on.

Then there was the headset, the piece that was meant to make all of this ubiquitous. Mixed reality was supposed to become the ambient medium through which virtual humans reached everyone. Consumer extended reality instead became a study in disappointment. The most ambitious premium device arrived in 2024 at a price that kept it firmly in the enthusiast tier, sold in modest numbers, and saw its marketing support quietly cut to almost nothing within the year, with analysts blaming cost, form factor, and a thin catalogue of native software. Headset sales across the category declined rather than climbed. The companies that had bet hardest on virtual presence for work began dismantling those efforts, shutting down their flagship collaboration spaces and redirecting attention toward phones and lightweight glasses. Ubiquity did arrive, but not in the shape anyone had drawn. It came as AI assistants living on the devices people already carried, and as smart glasses that added a voice rather than a virtual world.

So if the bandwidth, the cloud engines, and the headsets were not the engine, what was? The answer is the one technology the original forecast never named. The generative-AI wave that broke in late 2022 is what actually made conversational virtual humans viable, because large language models made open-ended dialogue cheap and abundant in a way the older, hand-built systems never could. The earlier architecture imagined a tidy division of labour: a dialogue system handling the words on one side, a behaviour system rendering the gestures on the other, with conversation reaching virtual reality through that pairing. That whole framing came apart. The intent-driven dialogue pipelines that defined the field were displaced almost wholesale by language models, to the point where even the companies that built those pipelines now describe the old approach as bringing a sports car onto a speed-limited road. And conversation did not arrive through a virtual body at all. It reached the public overwhelmingly through plain text and voice on ordinary screens, the two-dimensional chat window rather than the avatar.

There was, in fairness, one genuinely prescient call buried in the old text, and it deserves credit. The forecast recognised that the manual, rule-based methods for animating these figures were giving way to automated approaches built on neural networks. That trend proved exactly right and then some. The work of generating gesture and facial motion from speech moved decisively to data-driven models, and at the field's leading competition the diffusion-based systems came to dominate; in one striking year, some synthetic gestures were even judged more humanlike than recorded human motion, though the same systems remained much worse at matching gesture to the meaning of the speech, a gap that is still far from closed. The commoditisation went far enough that a leading vendor open-sourced its facial-animation engine entirely. The instinct about neural methods was correct. What did not survive was the two-box picture around it, since the realiser and the dialogue system dissolved into unified, multimodal pipelines conditioned on language models rather than persisting as two neat components.

All of which makes the honest verdict less a grade than a distinction. The hunch was right and the engines were wrong. The marketplace that exists today is real but flourishing as something other than the embodied extended-reality vision that was forecast, and it is not uniformly healthy even on its own terms; one of the more prominent builders of lifelike computer-generated humans tipped into receivership in early 2026, a reminder that a thriving category can be brutal to its individual members. The figures did converge toward the mainstream, more or less on schedule. They simply came by a road no one had mapped, powered by an engine no one had named, and they arrived, for the most part, on the flat screens that the whole adventure had been trying to escape.

Page updated

Google Sites

Report abuse