Voice Cloning¶

Clone any speaker from a 3–10s reference clip. Whisper-powered auto-transcription bundled.

Clone any speaker's voice from a short reference audio clip.

from strands_omnivoice import omnivoice_clone

omnivoice_clone(
    text="In a parallel universe, machines speak softly.",
    output="/tmp/cloned.wav",
    ref_audio="/tmp/ref.wav",
    ref_text="optional transcript of ref.wav",
)

Hear it in action¶

Step 1 — original auto-voice sample (used as the clone reference):

Step 2 — same voice, different words, generated by omnivoice_clone:

Best Practices¶

Length: 3–10 seconds of reference. Longer slows inference and may drop quality.
Cleanliness: minimal background noise, single speaker.
Same language: for native pronunciation, use a reference clip in the same language as your target text. Cross-lingual cloning works but carries the source-language accent.
ref_text: optional. If omitted, OmniVoice auto-transcribes via Whisper.

Pipeline: ASR → Clone¶

from strands_omnivoice import omnivoice_transcribe, omnivoice_clone

t = omnivoice_transcribe(audio_path="/tmp/ref.wav")
ref_text = t["content"][1]["json"]["transcript"]

omnivoice_clone(
    text="The quick brown fox jumps over the lazy dog.",
    output="/tmp/cloned.wav",
    ref_audio="/tmp/ref.wav",
    ref_text=ref_text,
)

Or let an agent orchestrate:

agent("""1. omnivoice_transcribe /tmp/ref.wav
        2. omnivoice_clone with that transcript and text='Hello cloned world'
           output=/tmp/cloned.wav
        3. audio_play it.""")

Tuning¶

num_step — diffusion steps. Lower (16) is faster; higher (32+) cleaner.
guidance_scale — classifier-free guidance. 2.0 default; raise for stronger voice match.
speed / duration — control pacing.