An AI-powered pipeline to transcribe, translate, and re-synthesize audio. This tool can be used to "clean" noisy speech by re-generating it with a high-quality Text-to-Speech (TTS) engine, or to perform automatic dubbing into other languages.
- Robust Transcription: Uses
faster-whisperto extract text and precise timestamps from audio. - Translation Options:
- Offline NMT: Uses
argostranslate(OpenNMT/CTranslate2) for private, offline translation between languages. - LLM Translation: Uses Google Gemini (
gemini-2.5-flash) for context-aware translations (requiresGOOGLE_API_KEY).
- Offline NMT: Uses
- Advanced Synthesis Engines:
- F5-TTS & E2-TTS: Flow-matching engines for high-quality speech synthesis. Automatically fetches models from Hugging Face Hub if not found locally.
- Chatterbox TTS: A multilingual speech synthesis engine supporting custom models (e.g.,
t3_cs.safetensorsfor Czech). - Voice Cloning: Uses a high-quality sample (
--ref-audio-file) to clone the voice of the target speaker.
- Dynamic Speed Adaptation: Automatically calculates if generated segments fit their designated time slots, applying time-stretching/speed-up dynamically to prevent overlaps. Supports
pytsmodoraudiostretchyfor fallback time-stretching. - Modular Workflow: Run only transcription, only translation, only synthesis, or the full end-to-end pipeline.
- Audio Slicing & Cropping: Process specific time ranges (
--time-startand--time-end) or crop leading/trailing silence (--crop).
- Clone the repository.
- Install dependencies:
Note: For time-stretching capabilities, ensure you have either
pip install -r requirements.txt
pytsmodoraudiostretchyinstalled. For Chatterbox TTS, ensurechatterbox-ttsis installed.
Translate English audio to Czech using Argos Translate and save the transcription data:
python voice_processor.py --input-file noisy_track.wav --output-language cs --transcribe-onlyTranslate English audio to Czech using Google Gemini with a context file:
export GOOGLE_API_KEY="your-api-key"
python voice_processor.py --input-file noisy_track.wav --output-language cs --translation-engine llm --translation-context-file translation-context.txtUse a clean reference clip to clone the voice and re-generate the entire track:
python voice_processor.py --input-file noisy_track.wav \
--ref-audio-file clean_sample.wav \
--ref-text-file clean_sample.txtRe-synthesize audio from a pre-existing transcription YAML file (useful for iterative tuning of TTS parameters):
python voice_processor.py --synthesize-only --transcript-file transcription.yaml \
--ref-audio-file clean_sample.wav \
--ref-text-file clean_sample.txtUse the Chatterbox multilingual TTS engine (e.g., for Czech):
python voice_processor.py --input-file noisy_track.wav \
--tts-engine chatterbox \
--ref-audio-file clean_sample.wav \
--output-language cs--input-file: Path to source audio (required unless--synthesize-onlyor--translate-onlyis specified).--output-file: Output path (default:regenerated_track.wav).--transcript-file: Path to save/load metadata (default:transcription.yaml).
--input-language: Language code of source audio (e.g.,en,cs). If not provided for transcription, Whisper will attempt auto-detection.--output-language: Target language code for translation.--translation-engine: Choose betweennmt(offline Argos Translate) orllm(Google Gemini, default:nmt).--translation-context-file: Path to a text file containing context to guide the LLM during translation.
--whisper-model: Whisper model size (base,small,medium,large-v3, default:base).--transcribe-only: Run only transcription and exit.--translate-only: Run only translation on an existing transcript file and exit.--synthesize-only: Run only synthesis on an existing transcript file and exit.
--tts-engine: The synthesis engine to use (f5-ttsorchatterbox, default:f5-tts).--ref-audio-file: Reference audio file for voice cloning (required by both TTS engines).--ref-text-file: Path to a.txtfile containing the text spoken in the reference audio (used by F5-TTS).--f5-hf-repo: HuggingFace repo ID for F5-TTS (default:SWivid/F5-TTS).--f5-model-type: ChooseF5-TTSorE2-TTSvariation.--checkpoint-freq: Save intermediate audio every N segments (default:100,0to disable).
--base-speed: Baseline synthesis speed multiplier (default:1.0). Increase if target language has more syllables.--max-speed: Maximum allowed speedup ratio (default:1.4). Segments requiring more speedup to fit are markedCRITICAL.--time-start: Start time in seconds for processing a slice of the audio (default:0).--time-end: End time in seconds for processing a slice of the audio.--crop: Trim leading and trailing silence from the final output using time slice boundaries.
The project includes utility scripts to optimize and normalize transcripts in YAML format before synthesis.
Re-segments a YAML transcription file so that each output segment aligns with a natural sentence (split by periods .).
- Splitting: Timestamps for split segments are linearly interpolated by character count, and a configurable gap is inserted between them.
- Merging: Short segments that belong to the same sentence are merged.
- Usage:
python normalize_transcription.py input.yaml -o output.yaml --max-words-in-segment 30 --gap-for-splitting 1.0
Compacts a YAML transcription file by merging short segments. This helps TTS engines that struggle with very short clips.
- Merging rules (in priority order):
- Break at long gaps between segments (
--min-gap-for-break, default 3s). - Break at sentence-ending punctuation (
.,?,!). - Break at pause punctuation (
,,;,:) when approaching the word limit. - Break when segment would exceed
--max-words-in-segment. - Keep single oversized segments as-is.
- Break at long gaps between segments (
- Usage:
python optimize_transcription_for_tts.py input.yaml -o output.yaml --max-words-in-segment 20 --min-gap-for-break 3.0
Filters out empty or noisy segments (e.g., Whisper hallucinated dot-only . segments) from a transcription YAML file.
- Backups: Automatically creates a backup of the original file (e.g.
transcription.yaml.bak) before processing. - Usage:
python clean_yaml.py transcription.yaml