Supporters of Marcus Endicott’s Patreon can access weekly or monthly video consultations on this topic.

100 Best 2026 China Digital Human Papers

Dong, H. (2026). The impact of virtual digital human cognition on technological anxiety among university teachers: A moderating effect based on mutual trust and competitive atmosphere (虚拟数字人认知对高校教师技术焦虑的影响：基于相互信任与竞争氛围的调节作用). Modern Educational Technology, 2026(1).

Examining how university teachers’ understanding of virtual digital humans relates to technological anxiety in higher education, the study reports that stronger virtual-digital-human cognition is associated with higher technological anxiety, while mutual trust weakens that association and a competitive atmosphere strengthens it, indicating that workplace climate conditions how strongly perceived technological change translates into anxiety. Its evidential value lies in moving beyond a simple adoption narrative by identifying opposite social moderators around the same anxiety effect, but the evidence is limited by its cross-sectional sample from 12 universities in Shaanxi, which constrains causal inference and broader generalization.

Fan, Y. (2026). Research on the design of animated virtual characters under the evolution of the virtual human integration ecosystem. In W. Pedrycz, J. Wang, & J. Li (Eds.), Advances in information, computing and technology: ICICT 2025 (Lecture Notes in Networks and Systems, Vol. 1734). Springer.

The chapter examines how animated virtual characters should be designed within an ecosystem converging AI, VR, and 3D animation, and argues that effective character development depends on integrating technological infrastructure, aesthetic design, and system-level deployment around real-time animation pipelines such as Unity3D and Omniverse Audio2Face, with blendshape-based facial animation presented as a foundation for lifelike expression and multimodal responsiveness; its main evidential value is a clear synthesis linking technical animation frameworks to the broader ecological demands of intelligent and emotionally expressive virtual characters, but the chapter’s limitation is that its conclusions remain largely framework-driven and tool-oriented rather than being supported by direct comparative evidence on user response, performance outcomes, or cross-context deployment.

Fu, Y., & Liu, B. (2026). Research on resolving the copyright protection dilemma of virtual idols (虚拟偶像著作权保护困境破解研究). Open Journal of Legal Science. Advance online publication.

Addressing how copyright law should protect virtual idols, the paper concludes that protection should be disaggregated rather than treated as a single undifferentiated object: visual design should be handled as artwork, voice through a proposed synthetic-voice category, character settings through textual or artistic protection plus unfair-competition law, and overall ownership through allocation rules tied to investment, creative contribution, and contract. Its evidential value lies in offering a concrete transitional legal framework responsive to the composite nature of virtual idols, but the argument is primarily doctrinal and normative rather than empirically tested against case outcomes or industry-wide disputes.

Huang, Y., & Jiao, J. (2026). The impact of virtual digital human interactivity on customer’s continuance usage intention: a cognitive and emotional perspective. Journal of Hospitality Marketing & Management, 1–37.

This study examines how virtual digital human interactivity sustains users’ willingness to keep using services by tracing two linked pathways: a cognitive route in which interactivity increases perceived agency and robot-service fit, and an emotional route in which interactivity increases perceived experience and emotional connection. The main conclusion is that greater interactivity strengthens continuance intention through both routes, with algorithm transparency altering the strength of those effects. Its evidential value lies in testing a multi-path model across experiments and a survey, but reliance on scenario-based designs and self-reported continuance intention limits confidence that the same effects translate to actual long-term service use.

Jiang, J., Zeng, W., Zheng, Z., Yang, J., Liang, C., Liao, W., Liang, H., Chen, W., Wang, X., Zhang, Y., & Gao, M. (2026). OmniHuman-1.5: Instilling an active mind in avatars via cognitive simulation. In Proceedings of the International Conference on Learning Representations (ICLR 2026).

The paper addresses the gap between physically plausible avatar animation and semantically meaningful behavior and concludes that adding a dual-system cognitive simulation layer, with high-level multimodal semantic planning coupled to a specialized multimodal diffusion architecture, produces more contextually coherent, emotionally resonant, and logically consistent avatar motion while also improving lip-sync, video quality, motion naturalness, and prompt alignment in reported evaluations; its main evidential value is a strong claim that avatar generation benefits from explicit modeling of intent and context rather than low-level audio-reactive motion alone, but the study remains limited by benchmark-centered validation of generated behavior, so whether this “active mind” framing captures genuine interactive reasoning rather than improved conditioned synthesis in open-ended real-world use remains unresolved.

Jin, C., Zhang, R., Gao, Q., & Shi, H. (2026). SentiAvatar: Towards expressive and interactive digital humans. arXiv.

This study introduces a framework for generating expressive, real-time interactive digital humans grounded in dialogue, finding that separating high-level semantic planning from prosody-driven motion synthesis, supported by a large multimodal dataset and motion pretraining, yields state-of-the-art improvements in semantic alignment, motion quality, and speech synchronization while enabling low-latency multi-turn interaction; however, the study’s reliance on a single-character dataset with controlled scripted interactions constrains ecological validity and limits generalization to diverse identities, spontaneous behaviors, and broader conversational contexts.

Li, J., et al. (2026). Lightweight high-fidelity low-bitrate talking face compression for 3D video conference. arXiv.

The paper addresses ultra-low-bitrate 3D talking-face transmission for video conferencing and concludes that a metadata-driven representation combining FLAME parameters with 3D Gaussian Splatting can preserve fine facial geometry and appearance while achieving superior rate–distortion performance at very low bitrates, including more than 7× compression for the face model and real-time-oriented reconstruction from transmitted facial metadata; its main evidential value is a concrete claim that high-fidelity 3D conferencing can be made substantially more transmission-efficient without reverting to dense video or expensive implicit rendering, but the study is limited by task-specific evaluation on face-centric conferencing scenarios, so robustness to broader subjects, environments, and less constrained real-world communication settings remains unproven.

Li, Z., Pun, C.-M., Fang, C., Wang, J., & Cun, X. (2026). PersonaLive! Expressive portrait image animation for live streaming. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2026).

The paper targets real-time portrait animation for live streaming and concludes that combining expressive motion control with appearance distillation and autoregressive micro-chunk streaming enables substantially faster generation while preserving expression realism, long-horizon stability, and visual quality, with reported gains of roughly 7–22× over earlier diffusion-based portrait animation systems; its main evidential value is a concrete demonstration that diffusion-based portrait animation can be pushed toward genuinely streaming use rather than offline rendering, but the study remains limited by evaluation centered on portrait live-stream scenarios, leaving performance across more varied subjects, backgrounds, interaction patterns, and unconstrained production settings unresolved.

Lin, F., & Wu, Y. (2026). From pixelated to biopolitical: The genealogy of the humanness in China's virtual anchors. In Y. Chandra & R. Fan (Eds.), Artificial intelligence and the future of human relations. Springer.

The chapter traces how “humanness” has been socially constructed in China’s virtual-anchor industry over roughly two decades and argues that this construction shifted through four phases—pixelated, pent-up, expressive, and finally biopolitical humanness—showing that virtual anchors evolved from crude technical experiments into strategically governed media figures whose value lies in affective labor, platform integration, and audience management rather than resemblance alone; its main evidential value is a historically differentiated account that links changing forms of human-likeness to industry development and wider social control, but the chapter’s limitation is that its genealogical interpretation foregrounds periodization and conceptual framing more than direct comparative evidence about how audiences or creators across different cases actually experienced those shifts.

Lu, R., et al. (2026). MIRRORTALK: Forging personalized avatars via disentangled style and hierarchical motion control. arXiv.

The paper addresses personalized talking-avatar generation and concludes that separating speaker style from speech semantics, then controlling facial motion hierarchically across regions, improves both lip-sync accuracy and preservation of an individual’s characteristic speaking style relative to prior methods; its main evidential value is showing that personalization gains need not come at the expense of synchronization quality, but the study is limited by evaluation centered on benchmark-style talking-face tasks, so robustness across more diverse real-world recording conditions, longer interactions, and broader avatar settings remains unresolved.

Lyu, J., Qu, L., Zhang, W., Jiang, H., Liu, K., Zhou, Z., Xia, X., Xue, J., & Chua, T.-S. (2026). AUHead: Realistic emotional talking head generation via action units control. In Proceedings of the International Conference on Learning Representations (ICLR 2026).

The paper targets fine-grained emotional control in talking-head generation and concludes that treating facial action units as an explicit control space improves emotional realism while maintaining strong lip synchronization, visual coherence, and identity consistency, with benchmark results reported to surpass prior methods on expressive talking-head synthesis; its main evidential value is showing that AU-based disentanglement can make emotion control more precise than coarser label-based or implicit conditioning approaches, but the study remains limited by evaluation within benchmark emotional talking-head settings rather than broader real-world conversational or stylistically varied video conditions.

Meng, R., Wu, W., Yin, Y., Li, Y., & Ma, C. (2026). EchoTorrent: Towards swift, sustained, and streaming multi-modal video generation. arXiv.

The paper addresses the efficiency–quality trade-off in streaming multi-modal human video generation and concludes that its post-training framework can sustain longer-horizon generation with better temporal consistency, identity preservation, and audio–lip synchronization while reducing inference cost through single-pass calibration and tail-focused alignment; the study’s main evidential value is a coherent empirical claim that streaming degradation can be mitigated without abandoning real-time aims, but its limitation is that the evidence is confined to benchmark-style comparisons and model ablations for human video generation, leaving broader generalization to other video domains and real deployment conditions untested.

Mo, J., Chen, H., Ye, C., Wang, Z., & Chen, C. (2026). Exploring the drivers of users' adoption of museum digital humans. npj Heritage Science, 14, Article 43.

The study examines what drives adoption of a museum digital human and finds that behavioral intention is shaped less by information quality alone than by a dual pathway in which information richness improves perceived information quality and aesthetic experience, aesthetic experience strengthens usefulness, ease of use, and flow, and flow is the strongest direct predictor of intention, while information quality does not significantly increase flow; this gives the paper solid evidential value as a theory-testing case showing that immersive and aesthetic engagement matters alongside cognitive evaluation in heritage interpretation, but its conclusions are constrained by a narrowly concentrated sample dominated by Chinese university-aged respondents and by reliance on a single National Museum of China case, which limits generalizability across visitor groups and museum contexts.

Pei, G., Dong, B., Jin, J., Meng, L., & Zhang, J. (2026). Dynamic processing of conversational intelligence features in marketing digital humans and its neural mechanisms (营销数字人对话智能特征的动态加工与神经机制). Advances in Psychological Science, 34(2), 227–238.

Focusing on how conversationally intelligent marketing digital humans may shape consumer behavior, the paper argues that effects should be understood through multidimensional dialogue features, multi-turn interaction, and the distinct roles of cognitive and affective trust, and it proposes that clarifying the dynamic processing and neural basis of these trust pathways could guide feature optimization for better consumer experience and greater business efficiency. Its evidential value lies in synthesizing a concrete trust-centered research framework that links marketing digital humans to behavioral and neurocognitive mechanisms, but as a forward-looking theoretical program rather than a completed empirical study, it does not itself provide tested findings on those mechanisms or their commercial effects.

Sun, X., Wang, F., & Jin, W. (2026). Continuance intention of cultural museum virtual human based on PLS-SEM analysis of MRT and UGT. npj Heritage Science, 14, Article 139.

Sun, Wang, and Jin analyze why users keep engaging with cultural museum virtual humans by testing an integrated Media Richness Theory and Uses and Gratifications model, with cultural identity as a mediator and information literacy as a moderator, on Chinese museum users using PLS-SEM; they find that richer media features improve gratification, hedonic and technology gratification increase continuance intention and cultural identity, and information literacy strengthens the effect of technology gratification on continued use, making the study useful as a current theory-building empirical account of virtual humans in Chinese museum settings but limited by its focus on continuance intention rather than measured heritage outcomes and by reliance on one survey-based analytical design.

Tan, S., Gong, B., Ma, K., Feng, Y., Zhang, Q., Wang, Y., ... & Zhao, H. (2026). CoDance: An Unbind-Rebind Paradigm for Robust Multi-Subject Animation. arXiv preprint arXiv:2601.11096.

Addressing the failure of existing pose-driven animation models when applied to multiple simultaneous subjects, CoDance introduces a two-stage framework in which a pose shift encoder decouples motion from spatial position by injecting stochastic perturbations, then a Rebind module reassociates learned motion to intended characters using text-prompt semantics and per-subject masks. Evaluated on the authors' own CoDanceBench — 20 multi-subject dance clips spanning two to five human and cartoon subjects — the system achieves state-of-the-art scores across frame-level (PSNR, SSIM, LPIPS) and video-level (FID, FVD) metrics, with human evaluators preferring CoDance output 90% of the time for video quality, 88% for identity preservation, and 83% for temporal consistency. The benchmark's small scale (20 test videos), restriction to dance motion, and static-background-only scenes limit generalisability to broader activity types and dynamic environments, and code and weights had not been publicly released at time of preprint.

Wang, L., & Yeap, J. A. L. (2026). Mapping the evolution of virtual characters in digital culture: A bibliometric analysis of research trends (2019–2024). Humanities and Social Sciences Communications. Advance online publication.

Synthesizes 507 studies on virtual characters to trace a shift from early technical development toward social media integration and recent emphasis on consumer trust, authenticity, and purchase behaviour, identifying a persistent China–U.S. divide between application-driven and theory-driven research and concluding that credibility, anthropomorphism, and engagement form the field’s central conceptual axis; evidential value lies in its cross-database consolidation of trends and clear periodization of the field’s evolution, while its reliance on bibliometric patterns limits depth on causal mechanisms and underrepresents ethical and sociocultural analysis relative to commercial and technological emphases.

Wang, R., Feng, J., Tian, L., Luo, H., Li, C., Zhou, L., Zhang, H., Wu, Y., & He, X. (2026). JoyAvatar: Unlocking highly expressive avatars via harmonized text-audio conditioning. arXiv.

The paper addresses a common weakness of avatar video generation: strong audio synchronization but weak compliance with complex text instructions involving full-body motion, camera movement, background change, and object interaction. Its main claim is that harmonized text-audio conditioning, implemented through twin-teacher training and a decoupled inference strategy, enables longer and more expressive avatar videos with better text controllability while preserving audio-visual synchronization than prior methods. Its evidential value lies in extending avatar generation beyond narrow talking-head performance toward richer multimodal control, while the main limitation is that the evidence is still benchmarked within the paper’s own evaluation setting and therefore supports improved controllability under curated test conditions more clearly than robustness across unconstrained real-world scenes and interaction demands.

Wang, Z., et al. (2026). 3DXTalker: Unifying identity, lip sync, emotion, and spatial dynamics in expressive 3D talking avatars. arXiv.

The paper targets expressive 3D talking-avatar generation from a single image and speech, aiming to unify identity preservation, lip synchronization, emotional expression, and head-pose dynamics within one controllable framework. Its main claim is that a combination of curated 2D-to-3D identity modeling, richer audio cues beyond standard speech embeddings, and flow-matching-based dynamics modeling improves identity generalization, lip-sync accuracy, emotional nuance, and natural spatial motion over prior baselines. Its evidential value lies in presenting a genuinely integrated generation framework rather than optimizing only one expressive dimension, while the main limitation is that the evidence remains benchmark-based within the paper’s own curated training and evaluation setup, so broader robustness across unconstrained speakers, styles, and deployment conditions is not yet established.

Xie, S., Cong, X., Yu, B., Gui, Z., Gui, J., Tang, Y. Y., & Kwok, J. T.-Y. (2026). Toward fine-grained facial control in 3D talking head generation. arXiv.

The paper addresses fine-grained control of local facial motion in 3D talking-head generation, especially lip-synchronization errors and facial jitter that weaken realism. Its main claim is that FG-3DGS improves temporal consistency, high-frequency mouth-and-eye control, and lip-sync accuracy over recent baselines by separating low- and high-frequency facial dynamics and adding a refined post-rendering alignment stage. Its evidential value lies in presenting a targeted advance for controllable, high-fidelity 3D avatar animation, while the main limitation is that the evidence is centered on benchmark comparisons for visual performance and does not by itself establish robustness across more varied real-world speaking conditions, identities, and interaction settings.

Xie, X. (2026). Mechanisms and strategies of trust repair toward virtual streamers in e-commerce live streaming. E-Commerce Letters, 15(1), 10–16.

The paper examines how trust can be repaired after failures by virtual streamers in e-commerce live streaming, focusing on how consumers attribute errors differently to virtual rather than human agents. Its main conclusion is that repair should combine attributional intervention with reconstruction of competence, integrity, and benevolence, while using the plasticity of virtual personas to restore perceived interactive fairness and avoiding excessive anthropomorphism that creates further ethical risk. Its evidential value lies in offering a clear conceptual repair model tailored to virtual streamers, while the main limitation is that the study is framework-building through theoretical deduction rather than direct empirical testing, so the proposed mechanisms and strategies remain more suggestive than demonstrated.

Yang, S., Lyu, Y., Chen, Z., Li, Y., Dong, B., Han, X., Yang, P., Wang, Z., Rao, A., Liu, Z., Dong, J., Fu, H., Shan, C., Liu, X., Wang, L., & Si, C. (2026). Human-centric content generation with diffusion models: A survey. TechRxiv.

Yang et al.’s February 18, 2026 TechRxiv preprint surveys diffusion-model methods for human-centric generation across faces, bodies, and behavior-related tasks, organizing the area into a unified task-level framework and concluding that diffusion models are a strong general foundation for this domain while open challenges remain in control, realism, and broader task coverage; its value is as a recent synthesis of a fast-moving literature, but because it is a non-peer-reviewed survey rather than a comparative experiment, it mainly consolidates existing work instead of providing new validating evidence.

Ye, Q., Li, Y., Luo, Y., & Pang, Z. (2026). The impact of AI anchor anthropomorphism on users' willingness to co-create value in tourism live-streaming contexts: The mediating role of social presence and the moderating role of perceived control. Frontiers in Psychology, 16, Article 1724176.

The study examines whether more human-like AI tourism anchors increase users’ willingness to co-create value during live streaming and identifies the psychological conditions under which that effect operates. Its main finding is that higher anthropomorphism increases co-creation willingness, partly because it strengthens social presence, while greater perceived control weakens the anthropomorphism-to-social-presence pathway without weakening the direct effect on co-creation willingness. Its evidential value lies in isolating a specific mechanism linking AI-anchor design to participatory user behavior in tourism live streaming, while the main limitation is that the evidence is bounded to experimentally manipulated anthropomorphism within a single tourism live-streaming context, which limits broader generalization across platforms, behaviors, and real market settings.

Ye, Y., & Song, Z. (2026). CF-GAT: Curvature-fused graph attention network for high-precision unordered facial point cloud landmark detection. IEEE Transactions on Circuits and Systems for Video Technology. Advance online publication.

The paper addresses the performance ceiling imposed by reliance on 2D texture mapping and synthetic digital face templates in 3D facial landmark detection, arguing that these dependencies introduce systematic errors because digital facial geometry diverges from real human anatomy. CF-GAT, trained on approximately 200,000 real-world 3D facial scans, achieved superior noise robustness, stronger cross-subject generalization, and finer-grained landmark localization than conventional template-dependent approaches. The study's evidential value is constrained by the dataset's origin within a single acquisition pipeline at a single institution, leaving cross-scanner and cross-demographic generalization unconfirmed by independent validation.

Yu, T., Qiao, Q., Shen, L., Zhou, K., Hu, J., Sheng, D., ... & Liu, S. (2026). SoulX-FlashHead: Oracle-guided Generation of Infinite Real-time Streaming Talking Heads. arXiv preprint arXiv:2602.07449.

The preprint addresses the problem of maintaining high-fidelity talking-head generation under real-time, effectively unbounded streaming conditions. It concludes that streaming-aware spatiotemporal pre-training with temporal audio context caching and oracle-guided bidirectional distillation reduces audio-feature instability, identity drift, and long-sequence error accumulation enough to deliver state-of-the-art results on HDTF and VFHQ, with a Lite variant reported at 96 FPS on a single RTX 4090. Evidential value is strongest for benchmarked speed-and-quality performance supported by a 782-hour aligned training corpus, but the study’s evidence remains concentrated on two benchmarks and the authors’ training setup, so broader real-world generalization is less directly established.

Zhang, H., Yang, Y., Jiao, F. et al. Designing an extensible digital human framework to automate the evaluation of teaching practicum in pre-service teacher training. Humanit Soc Sci Commun (2026).

This study develops a digital human–based framework integrating a large language model (GLM-4), virtual reality classroom simulation, and real-time speech recognition to automate the evaluation of pre-service teacher practicums, finding that the system can deliver immediate, targeted feedback and improve evaluation efficiency, personalization, and perceived objectivity compared to traditional methods; however, the evidence is limited by its reliance on simulated environments and system-embedded evaluation criteria, which constrains claims about effectiveness in real classroom contexts and may embed model-driven biases into assessment outcomes.

Zhang, L., Huo, Y., Ye, Q., Chen, A., Guo, S., & Chen, J. (2026). Review of 3D human reconstruction methods empowering VR/AR. Journal of System Simulation, 38(3), 545–562.

Synthesizes the evolution of 3D human reconstruction for VR/AR, concluding that recent 3D Gaussian splatting approaches best reconcile real-time rendering with high geometric fidelity while parametric and implicit methods each trade efficiency for detail or flexibility, and that future progress depends on multimodal fusion and tighter hardware–algorithm integration to resolve persistent accuracy–efficiency constraints; evidential value lies in its systematic comparison of technical paradigms and performance metrics across datasets, but conclusions are limited by reliance on prior studies with heterogeneous benchmarks and by insufficient empirical validation of proposed future directions.

Zhang, Y., & Xu, Q. (2026). Whether and how to introduce AI-driven virtual streamers: Selling mode selection in live streaming commerce. Electronic Commerce Research. Advance online publication.

Zhang and Xu’s January 16, 2026 Electronic Commerce Research article models three live-streaming commerce modes—human-only, virtual-only, and human–virtual collaboration—to test when brands should introduce AI virtual streamers, finding that adoption depends on consumer distrust and technological barriers: high levels favor human-only selling, low levels favor virtual-only selling, and intermediate levels make collaborative streaming most profitable, with collaboration also yielding the highest social welfare; its evidential value is a clear formal decision framework for mode selection, but its main limitation is that the results are theoretical and parameter-driven rather than validated with observed market behavior.

Zhen, D., Zheng, X., Zhang, R., Jiang, Z., Yan, Y., Tao, M., & Yin, S. (2026). SoulX-LiveAct: Towards Hour-Scale Real-Time Human Animation with Neighbor Forcing and ConvKV Memory. arXiv preprint arXiv:2603.11746.

This study develops an autoregressive diffusion system for long-duration real-time human animation that targets stable temporal conditioning and constant-memory streaming; it reports strongest evidence on HDTF, where lip-sync and distributional/video-quality metrics exceed the compared baselines, and it also claims real-time deployment efficiency at 20 FPS on two H100/H200 GPUs with lower per-frame compute than the cited real-time alternatives, but evidential strength is tempered because gains are not uniform on EMTD, where lip-sync and human-fidelity scores improve while FID and FVD worsen markedly, and the evaluation remains limited to two human-animation benchmarks at 512×512.

Zeng, A., Yang, C., Ge, C., Zhang, E., Xu, G., Lin, G., ... & Ye, Z. (2026). LPM 1.0: Video-based Character Performance Model. arXiv preprint arXiv:2604.07823.

This work introduces a 17-billion-parameter diffusion transformer, paired with a distilled causal streaming variant, designed to generate audio-visual conversational performance from a single character image, where performance is defined as the fused expression of intent, emotion, and personality through voice, face, and timing. From user audio and synthesized audio plus text prompts for motion control, the system produces both listening and speaking video in real time, holds character identity stable across indefinitely long interactions, and is framed by the authors as resolving what they call the performance trilemma among expressiveness, latency, and long-horizon identity consistency, with a companion benchmark, LPM-Bench, proposed to evaluate interactive character performance in this setting.

Zhong, L., Wang, Y., Yue, Z., & Yang, Y. (2026). Study on the influence of intelligent human–computer interaction of AI virtual anchors on consumers' initial trust and value co-creation behavior under the technophobia. Frontiers in Psychology, 16, Article 1732258.

The study examines how intelligent interaction by AI virtual anchors in e-commerce shapes consumers’ initial trust and value co-creation under technophobia, with emphasis on guidance, recognition, analysis, and feedback as interaction dimensions. Its main finding is that all four dimensions increase perceived usefulness and ease of use, these perceptions strengthen initial trust, and initial trust predicts participation behaviors but not citizenship behaviors, while technophobia weakens the conversion of perceived ease of use into trust more than the usefulness pathway. Its evidential value lies in specifying a bounded trust-formation mechanism for virtual-anchor commerce, while the main limitation is that the results are drawn from a single TAM-based survey model of early consumer responses and therefore support association within that setting more strongly than broader causal or cross-context generalization.

Zhou, Y., Zhang, Z., Wu, S., Jia, J., Jiang, Y., Sun, W., Liu, X., Min, X., & Zhai, G. (2026). MI3S: A multimodal large language model assisted quality assessment framework for AI-generated talking heads. Information Processing & Management, 63(1), Article 104321.

The paper develops an objective quality-assessment framework for AI-generated talking heads across image quality, aesthetics, identity consistency, and lip-sync, using a multimodal large language model plus a temporal memory filter to better match human visual perception. Its main finding is that the system reaches a prediction-to-human perceptual correlation of 0.7946 on THQA and improves consistency by about 3.4% over earlier assessment methods, supporting its claim as a stronger automatic evaluator for generated talking-head videos. Its evidential value is as a benchmarked advance in evaluation rather than generation, while the main limitation is that validation appears concentrated on one 800-video dataset and one correlation-based performance framing, which narrows evidence for broader generalization.

Zhu, L., Lin, L., Ye, Z., Wu, J., Hou, X., Li, Y., Liu, Y., & Chen, J. (2026). MANGO: Natural multi-speaker 3D talking head generation via 2D-lifted enhancement. arXiv.

The paper targets two-person 3D talking-head generation with natural alternation between speaking and listening, arguing that pure image-level supervision can correct facial-motion noise introduced by pseudo-3D labels. Its main claim is that a two-stage system combining dual-audio interaction modeling with 2D photometric refinement yields more accurate, realistic, and controllable conversational head motion than prior approaches, supported by experiments and a new MANGO-Dialog dataset spanning more than 50 hours of aligned 2D-3D dialogue across 500-plus identities. Evidential value is strongest as a technical advance for multi-speaker conversational animation, while the main limitation is that the evidence is tied to benchmarked avatar-generation performance rather than broader validation of robustness across unconstrained real-world conversational conditions.

Page updated

Google Sites

Report abuse