AInspiro
中文

AI Voice Cloning Showdown: ElevenLabs vs Doubao vs Hailuo — Which Wins for Chinese?

Tool Reviews
🤖 This article was generated by AI. Content is for informational purposes only.

How far has AI voice cloning come in 2026

Two years ago, voice cloning felt like a toy — cloned voices had a mechanical quality, instantly identifiable as fake. Testing again this year, honestly, it's startling. 3-second samples can clone a voice; 10-second samples are nearly indistinguishable from the real thing. Give it a recording, and AI doesn't just mimic the timbre — it mimics speaking rhythm, breathing patterns, even verbal tics.

Today: three mainstream tools — ElevenLabs (overseas benchmark), Volcengine Doubao Voice (domestic rising star), and Hailuo AI/MiniMax (multimodal player).

Three tools' positions

ElevenLabs: the overseas voice cloning benchmark. v3 model + Flash v2.5, supports 29 languages. Free tier: 10K characters/month. Starter $5/mo, Creator $22/mo, Pro $99/mo. Strongest English cloning; Chinese is decent but has an accent. Mature developer ecosystem, well-documented API.

Volcengine Doubao Voice: ByteDance's offering, Seed-ICL 2.0 model. The domestic leader in Chinese speech synthesis and cloning. Pay-per-use pricing, cheaper than ElevenLabs. Supports instruction-based emotion control — describe the emotion in natural language, and it adjusts tone accordingly.

Hailuo AI/MiniMax: multimodal approach, voice is just one component. Advantage: voice + video + text integrated, suited for digital avatars. On voice cloning quality alone, slightly behind the other two, but the integrated solution saves the hassle of connecting multiple tools.

Bottom line: English → ElevenLabs, Chinese → Doubao, digital avatars → Hailuo.

Chinese cloning quality tested

Cloning with the same 10-second Chinese sample:

  • Doubao Voice: highest timbre fidelity, natural Chinese intonation, strong emotion control. Can adjust emotion via natural language instructions — "say it more excitedly," and it genuinely gets excited
  • ElevenLabs: good timbre fidelity, but slight accent in Chinese — sounds like a foreigner who speaks Chinese very well. Emotional expression tends flat
  • Hailuo AI: decent timbre fidelity, but long-sentence phrasing isn't natural enough, occasionally swallows words

One more test dimension: dialects. Doubao's cloning of Cantonese and Sichuanese is clearly better than ElevenLabs — its training data has more Chinese dialect coverage. ElevenLabs handling dialects is basically "reading dialect vocabulary with Mandarin intonation" — unnatural.

Emotion control: Doubao's killer feature

This is Doubao Voice's most surprising feature. ElevenLabs adjusts emotion mainly by selecting preset voices — "excited," "sad," "calm" are different voice models. Doubao uses natural language instructions — input "say it like catching up with an old friend," and it genuinely adjusts speed, pauses, and intonation.

For short video voiceovers and audiobooks, this feature is incredibly practical. No need to clone a different voice for each emotion — one voice model handles all emotions. And emotional transitions are smoother — ElevenLabs has a "disjointed" feel when switching between emotion voices, while Doubao's instruction-based control transitions smoothly within the same voice.

Price comparison

ElevenLabs: free 10K characters/month, Starter $5/mo (30K chars), Creator $22/mo (100K chars), Pro $99/mo (500K chars)

Doubao Voice: pay-per-character, about 0.5-1 RMB per 10K characters. No monthly minimum — pay for what you use. For Chinese scenarios, roughly one-fifth of ElevenLabs' cost

Hailuo AI: limited free quota, Pro version charges by feature module

How to choose

Pure Chinese scenarios (short video voiceover, audiobooks, customer service voice) → Doubao Voice. Best Chinese quality, cheapest price, strongest emotion control.

Multilingual scenarios (overseas content, cross-language projects) → ElevenLabs. 29-language support, unrivaled English cloning.

Digital avatars / video + voice → Hailuo AI. Voice is one component; when you need sync with video, the integrated solution saves hassle.


AI voice cloning is genuinely usable now, but there are risks — scam calls, fake celebrity voices have started appearing. The technology itself is neutral; where and how it's used is the real question. For legitimate content creation, these three tools are sufficient to replace most manual voiceover work.