AvatarStudio: High-fidelity and Animatable 3D Avatar Creation from Text

Jianfeng Zhang1* Xuanmeng Zhang2* Huichao Zhang1 Jun Hao Liew1 Chenxu Zhang1 Yi Yang3 Jiashi Feng1
1 ByteDance Inc. 2 University of Technology Sydney 3 Zhejiang University
Existing text-to-avatar methods are either limited to static avatars which cannot be animated or struggle to generate animatable avatars with promising quality and precise pose control. To address these, we propose AvatarStudio, a coarse-to-fine generative model that generates explicit textured 3D meshes for animatable human avatars. Specifically, AvatarStudio begins with a low-resolution NeRF-based representation for coarse generation, followed by incorporating SMPL-guided articulation into the explicit mesh representation to support avatar animation and high-resolution rendering. To ensure view consistency and pose controllability of the resulting avatars, we introduce a 2D diffusion model conditioned on DensePose for Score Distillation Sampling supervision. By effectively leveraging the synergy between the articulated mesh representation and the DensePose-conditional diffusion model, AvatarStudio can create high-quality avatars from text that are ready for animation, significantly outperforming previous methods. Moreover, it is competent for many applications, e.g., multimodal avatar animations and style-guided avatar creation.


Example generated avatars

AvatarStudio generates high-quality avatars in a multi-view consistent way.

Bruce Lee
Donald Trump
Kim Kardashian
Terracotta Warriors
Albert Einstein
A man wearing kilt
Captain America
A chef wearing in white
A man with dreadlocks
Lara Croft in Tomb Raider
A karate master wearing a black belt
A professional boxer
A man with curly hair wearing glasses
An American football player
Wolfgang Amadeus Mozart

Avatar creation with more complicated prompts

AvatarStudio has shown promising results, effectively aligning the generated avatars with the detailed descriptions of the complex prompts.

Elderly woman, dressed in a traditional Native American outfit, holding dream catchers, braided hair
Cute chibi Lara Croft, game, Pixar design, studio lighting, modern Disney style, 3D character
Chibi Thor with Mjolnir, cute, volumetric lighting, reflective textures, game, character
Medieval solder holding two longswords on hands, fantasy, game, character
Tesla trooper, wearing Mecha suit, scifi, game character, unreal, 3D rendering, fantasy
Chibi, single boy, cute, magician's outfit, top hat, magic wand, curly hair, shiny shoes
Young man, dressed in a futuristic cyberpunk outfit, neon accents, holding a high-tech gadget
Elderly gentleman, dressed in a vintage suit, monocle, holding walking canes on hands
Teenage boy, dressed in a modern hip-hop style, baseball cap tilted, holding basketballs
Chibi, 1boy, cute, knight armor, helmet, holding toy knife on hands, Pixar design
Elderly man, dressed in a traditional samurai outfit, holding katana
Chibi, 1girl, hanfu, cat ears, cat girl, silk robe, wavy hair, wearing traditional sandals
Stealthy hinja holding dual katanas, 3D, game character, unreal
A little girl dressed as Wonder Woman, chibi style, volumetric lighting, Disney style
Strong Slayer, holding machete on hands, game character, 3D rendering, unreal
Cute chibi Son Goku, Sporty style outfit, shoes, nike jacket, little boy, cartoon

Comparison Results

We compare AvatarStudio with other text-guided generation methods.






Assassin Creed

A standing Captain Jack Sparrow from Pirates of the Caribbean






A man wearing a bomber jacket

A karate master wearing a black belt

Generations with fewer optimization steps

We conduct a comparative analysis of the avatar generation results that are achieved with fewer optimization steps. Left: the results obtained with reduced optimization steps (1 hour). Right: original results (2.5 hours). We see the model, even when optimized with fewer steps, can still yield results that are comparable to the original ones.

Fewer steps
Fewer steps
Abraham Lincoln
Harry Potter
A karate master wearing a black belt

Multimodal Avatar Animation

AvatarStudio provides high-quality and easy-to-use animation, allowing users to drive the generated avatars with multimodal signals, such as text or video.

Text-driven animation. We adopt MDM to convert text prompts, like "A person is punching a bag", into SMPL sequences for animation.

Video-driven animation. We use VIBE to estimate SMPL sequences from driving videos for animation.

Stylized avatar creation

AvatarStudio supports stylized avatar creation by simply providing an additional style image.

Style image
A chef
A karate master
A girl wearing skirt
Style image
A karate master
Style image
A girl wearing dress
A karate master
A girl wearing skirt
Style image
A chef
A karate master
A ninja


