
daVinci-MagiHuman
No description available.
The Lens
DaVinci-MagiHuman does it in one model. No separate video generation, no separate voice synthesis, no stitching. One 15-billion-parameter transformer takes text and a reference image and jointly produces video and audio.
The numbers are real: 5-second 1080p video in 38 seconds on a single H100. Supports Mandarin, Cantonese, English, Japanese, Korean, German, and French. Beats Ovi 1.1 (80% win rate) and LTX 2.3 (60.9% win rate) in human evaluation. The full model stack is released: base model, distilled model, super-resolution model, and inference code.
From Shanghai's GAIR Lab and Sand.ai.
The catch: you need serious hardware. An H100 for the fast inference numbers, and the 15B parameter model isn't running on a consumer GPU. No license file listed; check before commercial use. And 'joint audio-video generation' is still early. The 5-second clip limit means this is for avatars and short-form content, not video production.
Free vs Self-Hosted vs Paid
fully freeOpen source research release. No paid tier, no hosted version. You need your own GPU infrastructure: an H100 or equivalent for reasonable inference times. The model weights are on Hugging Face.
Free to use. You pay for GPU compute, and you'll need a lot of it.
About
- Owner
- SII - Generative Artificial Intelligence Research Lab (GAIR) (Organization)
- Stars
- 1,602
- Forks
- 139
Explore Further
More tools in the directory
Get tools like this delivered weekly
The Open Source Drop — the best new open source tools, analyzed. Free.