Skip to content

Voice Cloning

Clone any speaker from a 3–10s reference clip. Whisper-powered auto-transcription bundled.

Clone any speaker's voice from a short reference audio clip.

from strands_omnivoice import omnivoice_clone

omnivoice_clone(
    text="In a parallel universe, machines speak softly.",
    output="/tmp/cloned.wav",
    ref_audio="/tmp/ref.wav",
    ref_text="optional transcript of ref.wav",
)

Hear it in action

Step 1 — original auto-voice sample (used as the clone reference):

Step 2 — same voice, different words, generated by omnivoice_clone:

Best Practices

  • Length: 3–10 seconds of reference. Longer slows inference and may drop quality.
  • Cleanliness: minimal background noise, single speaker.
  • Same language: for native pronunciation, use a reference clip in the same language as your target text. Cross-lingual cloning works but carries the source-language accent.
  • ref_text: optional. If omitted, OmniVoice auto-transcribes via Whisper.

Pipeline: ASR → Clone

from strands_omnivoice import omnivoice_transcribe, omnivoice_clone

t = omnivoice_transcribe(audio_path="/tmp/ref.wav")
ref_text = t["content"][1]["json"]["transcript"]

omnivoice_clone(
    text="The quick brown fox jumps over the lazy dog.",
    output="/tmp/cloned.wav",
    ref_audio="/tmp/ref.wav",
    ref_text=ref_text,
)

Or let an agent orchestrate:

agent("""1. omnivoice_transcribe /tmp/ref.wav
        2. omnivoice_clone with that transcript and text='Hello cloned world'
           output=/tmp/cloned.wav
        3. audio_play it.""")

Tuning

  • num_step — diffusion steps. Lower (16) is faster; higher (32+) cleaner.
  • guidance_scale — classifier-free guidance. 2.0 default; raise for stronger voice match.
  • speed / duration — control pacing.