文本转语音工具和服务对比

选择合适的文本转语音（TTS）工具或服务对于项目成功至关重要。本文将全面对比市面上主流的 TTS 解决方案，帮助你做出最佳选择。

云服务对比

1. Google Cloud Text-to-Speech

特点

WaveNet 技术 - DeepMind 研发的高质量语音合成
多语言支持 - 40+ 种语言，220+ 种语音
SSML 支持 - 精细控制语音输出
神经网络语音 - 最新一代高质量语音

价格方案

标准语音：$4.00 / 100 万字符
WaveNet 语音：$16.00 / 100 万字符
每月免费额度：400 万字符

代码示例

javascript

// Node.js 示例
const textToSpeech = require('@google-cloud/text-to-speech');
const client = new textToSpeech.TextToSpeechClient();

async function synthesizeText(text) {
  const request = {
    input: { text: text },
    voice: {
      languageCode: 'zh-CN',
      ssmlGender: 'FEMALE',
      name: 'zh-CN-Wavenet-A'
    },
    audioConfig: {
      audioEncoding: 'MP3',
      speakingRate: 1.0,
      pitch: 0,
      volumeGainDb: 0
    }
  };

  const [response] = await client.synthesizeSpeech(request);
  return response.audioContent;
}

// 使用示例
synthesizeText('你好，欢迎使用文本转语音服务')
  .then(audio => {
    require('fs').writeFileSync('output.mp3', audio, 'binary');
    console.log('音频已保存');
  });

优缺点

优点

音质优秀，接近真人
语言支持广泛
API 稳定可靠
免费额度充足

缺点

WaveNet 价格较高
需要翻墙访问（国内）
中文语音选择相对较少

2. Amazon Polly

特点

神经语音 - 使用深度学习的 NTTS 引擎
品牌语音 - 可定制专属声音
实时流式 - 支持实时音频流
SSML 标签 - 丰富的语音控制选项

价格方案

标准语音：$4.00 / 100 万字符
神经语音：$16.00 / 100 万字符
免费层：500 万字符/月（前 12 个月）

代码示例

python

# Python 示例
import boto3

polly = boto3.client('polly', region_name='us-east-1')

def synthesize_speech(text, voice_id='Zhiyu'):
    response = polly.synthesize_speech(
        Text=text,
        OutputFormat='mp3',
        VoiceId=voice_id,
        Engine='neural'
    )
    
    audio_stream = response['AudioStream']
    with open('output.mp3', 'wb') as f:
        f.write(audio_stream.read())
    
    return 'output.mp3'

# 使用示例
synthesize_speech('你好，这是 Amazon Polly 的示例')

优缺点

优点

与 AWS 生态深度集成
支持品牌语音定制
长文本处理能力强
详细的文档和教程

缺点

国内访问可能不稳定
定制服务价格昂贵
学习曲线较陡

3. Microsoft Azure Speech Service

特点

神经网络语音 - 超过 100 种神经网络语音
情感语音 - 支持多种情感表达
自定义语音 - 创建独特的品牌声音
实时翻译 - 语音翻译一体化

价格方案

标准语音：$4.00 / 100 万字符
神经网络语音：$16.00 / 100 万字符
自定义神经语音：$32.00 / 100 万字符
免费层：500 万字符/月

代码示例

python

# Python SDK 示例
import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription='YOUR_KEY',
    region='eastasia'
)
speech_config.speech_synthesis_voice_name = 'zh-CN-XiaoxiaoNeural'

synthesizer = speechsdk.SpeechSynthesizer(
    speech_config=speech_config,
    audio_config=None
)

result = synthesizer.speak_text_async('你好，Azure 语音服务').get()

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    audio_data = result.audio_data
    with open('output.wav', 'wb') as f:
        f.write(audio_data)

优缺点

优点

中文语音质量最佳
情感语音表现丰富
自定义能力强大
国内可直接访问

缺点

自定义语音门槛较高
API 调用复杂度较高
需要注册 Azure 账号

4. 百度语音合成

特点

离在线融合 - 支持离线和在线模式
中文优化 - 针对中文深度优化
多种音色 - 提供多种男女声选择
SSML 支持 - 支持语音标记语言

价格方案

在线合成：免费额度 + 按量计费
离线合成：一次性购买 SDK
详细价格需咨询客服

代码示例

python

# Python 示例
from aip import AipSpeech

APP_ID = 'YOUR_APP_ID'
API_KEY = 'YOUR_API_KEY'
SECRET_KEY = 'YOUR_SECRET_KEY'

client = AipSpeech(APP_ID, API_KEY, SECRET_KEY)

result = client.synthesis(
    '你好，这是百度语音合成',
    'zh',
    1,
    {
        'vol': 5,
        'pit': 5,
        'spd': 5,
        'per': 4  # 发音人选择
    }
)

if not isinstance(result, dict):
    with open('audio.mp3', 'wb') as f:
        f.write(result)

优缺点

优点

中文效果优秀
国内访问稳定
价格相对便宜
提供离线方案

缺点

语言支持有限
音质略逊于国际大厂
文档更新不及时

开源方案对比

1. Coqui TTS

简介

Coqui TTS 是一个开源的深度学习文本转语音工具包。

特点

多模型支持 - Tacotron, VITS, Glow-TTS 等
多语言训练 - 支持自定义语言模型
活跃社区 - 持续更新维护
易于部署 - 提供 Docker 镜像

安装使用

bash

# 安装
pip install TTS

# 列出可用模型
tts --list_models

# 生成语音
tts --text "你好，这是开源TTS示例" \
    --model_name tts_models/zh-CN/baker/tacotron2-DDC \
    --out_path output.wav

# Python 调用
from TTS.api import TTS
tts = TTS(model_name="tts_models/zh-CN/baker/tacotron2-DDC")
tts.tts_to_file(text="你好世界", file_path="output.wav")

优缺点

优点

完全免费开源
可高度定制
社区活跃
支持训练自己的模型

缺点

需要技术能力
训练成本高
部署复杂
中文预训练模型较少

2. VITS

简介

VITS 是一种端到端的 TTS 模型，可以生成高质量的语音。

特点

单阶段生成 - 无需声码器
高质量输出 - 接近真人音质
多说话人 - 支持多音色
快速推理 - 实时合成

项目结构

VITS/
├── monotonic_align/
├── text/
│   ├── cleaners.py
│   ├── symbols.py
│   └── zh.py
├── configs/
│   └── config.json
├── models/
│   └── pretrained.pth
└── inference.py

推理示例

python

import torch
from vits import VITS

# 加载模型
model = VITS.load("pretrained.pth")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# 文本转语音
text = "你好，这是 VITS 示例"
audio = model.infer(text)

# 保存音频
import soundfile as sf
sf.write("output.wav", audio, 22050)

3. PaddleSpeech

简介

百度开源的语音工具套件，包含语音识别和语音合成。

特点

中文优化 - 专为中文设计
丰富模型 - 多种预训练模型
易于使用 - 简单 API
生产就绪 - 企业级支持

安装使用

bash

# 安装
pip install paddlespeech

# 命令行使用
paddlespeech tts --input "你好，世界" --output output.wav

# Python API
from paddlespeech.cli import TTSExecutor
tts_executor = TTSExecutor()
wav_file = tts_executor(
    text="你好，欢迎使用 PaddleSpeech",
    output="output.wav",
    am="fastspeech2_mix",
    voc="hifigan_csmsc"
)

浏览器原生 API

Web Speech API

现代浏览器内置的语音合成 API，无需外部依赖。

基础使用

javascript

// 检查支持
if ('speechSynthesis' in window) {
  console.log('支持 Web Speech API');
}

// 简单使用
const utterance = new SpeechSynthesisUtterance('你好，世界');
speechSynthesis.speak(utterance);

// 获取可用语音
function listVoices() {
  const voices = speechSynthesis.getVoices();
  voices.forEach(voice => {
    console.log(`${voice.name} (${voice.lang}) ${voice.default ? '- 默认' : ''}`);
  });
}

// 语音加载完成后调用
speechSynthesis.onvoiceschanged = listVoices;

完整示例

javascript

class TextToSpeech {
  constructor() {
    this.synth = window.speechSynthesis;
    this.utterance = null;
    this.voices = [];
    this.loadVoices();
  }

  loadVoices() {
    this.voices = this.synth.getVoices();
    if (this.voices.length === 0) {
      this.synth.onvoiceschanged = () => {
        this.voices = this.synth.getVoices();
      };
    }
  }

  speak(text, options = {}) {
    // 停止当前播放
    this.synth.cancel();

    this.utterance = new SpeechSynthesisUtterance(text);
    
    // 设置参数
    const voice = this.voices.find(v => v.lang.startsWith(options.lang || 'zh'));
    if (voice) this.utterance.voice = voice;
    
    this.utterance.rate = options.rate || 1.0;      // 语速
    this.utterance.pitch = options.pitch || 1.0;     // 音调
    this.utterance.volume = options.volume || 1.0;   // 音量

    // 事件监听
    this.utterance.onstart = () => console.log('开始播放');
    this.utterance.onend = () => console.log('播放结束');
    this.utterance.onerror = (e) => console.error('播放错误:', e);

    this.synth.speak(this.utterance);
  }

  pause() {
    this.synth.pause();
  }

  resume() {
    this.synth.resume();
  }

  stop() {
    this.synth.cancel();
  }
}

// 使用
const tts = new TextToSpeech();
tts.speak('你好，这是一个完整的示例', {
  lang: 'zh-CN',
  rate: 0.9,
  pitch: 1.0
});

浏览器兼容性

浏览器	支持情况	备注
Chrome	✅ 完全支持	语音选择最多
Firefox	✅ 完全支持	语音选择较少
Safari	✅ 完全支持	macOS/iOS 支持
Edge	✅ 完全支持	基于 Chromium
IE	❌ 不支持	-

选择建议

个人项目 / 学习

浏览器 API - 免费、简单、足够使用
Coqui TTS - 学习深度学习 TTS
百度语音 - 中文项目

商业应用

场景	推荐方案
企业级应用	Azure Speech
国际化项目	Google TTS / Amazon Polly
国内用户为主	百度语音 / Azure 中国
成本敏感	混合方案（云端 + 本地）

特殊需求

极致音质 → Azure Neural Voice
最低延迟 → 本地部署 VITS
自定义音色 → Azure Custom Voice / 开源训练
离线需求 → PaddleSpeech / Coqui TTS

总结

选择 TTS 方案需要综合考虑：

预算 - 云服务按量付费 vs 开源自建
音质要求 - 神经网络语音 vs 标准语音
语言支持 - 中文为主 vs 多语言
技术能力 - 开箱即用 vs 自定义开发
合规要求 - 数据隐私、访问稳定性

对于大多数项目，建议先从浏览器原生 API 或免费云服务额度开始，随着需求增长再考虑付费方案或自建服务。

发布于 2025-06-28

文本转语音工具和服务对比 ​

云服务对比 ​

1. Google Cloud Text-to-Speech ​

特点 ​

价格方案 ​

代码示例 ​

优缺点 ​

2. Amazon Polly ​

特点 ​

价格方案 ​

代码示例 ​

优缺点 ​

3. Microsoft Azure Speech Service ​

特点 ​

价格方案 ​

代码示例 ​

优缺点 ​

4. 百度语音合成 ​

特点 ​

价格方案 ​

代码示例 ​

优缺点 ​

开源方案对比 ​

1. Coqui TTS ​

简介 ​

特点 ​

安装使用 ​

优缺点 ​

2. VITS ​

简介 ​

特点 ​

项目结构 ​

推理示例 ​

3. PaddleSpeech ​

简介 ​

特点 ​

安装使用 ​

浏览器原生 API ​

Web Speech API ​

基础使用 ​

完整示例 ​

浏览器兼容性 ​

选择建议 ​

个人项目 / 学习 ​

商业应用 ​

特殊需求 ​

总结 ​

文本转语音工具和服务对比

云服务对比

1. Google Cloud Text-to-Speech

特点

价格方案

代码示例

优缺点

2. Amazon Polly

特点

价格方案

代码示例

优缺点

3. Microsoft Azure Speech Service

特点

价格方案

代码示例

优缺点

4. 百度语音合成

特点

价格方案

代码示例

优缺点

开源方案对比

1. Coqui TTS

简介

特点

安装使用

优缺点

2. VITS

简介

特点

项目结构

推理示例

3. PaddleSpeech

简介

特点

安装使用

浏览器原生 API

Web Speech API

基础使用

完整示例

浏览器兼容性

选择建议

个人项目 / 学习

商业应用

特殊需求

总结