Voice Cloning¶
Clone any speaker from a 3–10s reference clip. Whisper-powered auto-transcription bundled.
Clone any speaker's voice from a short reference audio clip.
from strands_omnivoice import omnivoice_clone
omnivoice_clone(
text="In a parallel universe, machines speak softly.",
output="/tmp/cloned.wav",
ref_audio="/tmp/ref.wav",
ref_text="optional transcript of ref.wav",
)
Hear it in action¶
Step 1 — original auto-voice sample (used as the clone reference):
Step 2 — same voice, different words, generated by omnivoice_clone:
Best Practices¶
- Length: 3–10 seconds of reference. Longer slows inference and may drop quality.
- Cleanliness: minimal background noise, single speaker.
- Same language: for native pronunciation, use a reference clip in the same language as your target text. Cross-lingual cloning works but carries the source-language accent.
ref_text: optional. If omitted, OmniVoice auto-transcribes via Whisper.
Pipeline: ASR → Clone¶
from strands_omnivoice import omnivoice_transcribe, omnivoice_clone
t = omnivoice_transcribe(audio_path="/tmp/ref.wav")
ref_text = t["content"][1]["json"]["transcript"]
omnivoice_clone(
text="The quick brown fox jumps over the lazy dog.",
output="/tmp/cloned.wav",
ref_audio="/tmp/ref.wav",
ref_text=ref_text,
)
Or let an agent orchestrate:
agent("""1. omnivoice_transcribe /tmp/ref.wav
2. omnivoice_clone with that transcript and text='Hello cloned world'
output=/tmp/cloned.wav
3. audio_play it.""")
Tuning¶
num_step— diffusion steps. Lower (16) is faster; higher (32+) cleaner.guidance_scale— classifier-free guidance. 2.0 default; raise for stronger voice match.speed/duration— control pacing.