ScriptSmith-Engine turns a plain text script into a fully edited, narrated YouTube video — automatically.
Write your script as a .txt file, then run a small pipeline of scripts that:
- Narrates it with Gemini TTS and transcribes the result into timestamped segments.
- Plans and generates scene images with AI (MiniMax for scene planning, Runware/GPT Image for the artwork), synced to those timestamps.
- Adds sound effects, automatically picked and placed by AI and sourced from Freesound.
- Assembles everything into a final video, syncing images and audio on a timeline.
The goal is to go from "I have a script" to "I have a video" with minimal manual editing — each step can also be run independently, so you can swap in your own audio, images, or SFX at any stage.
run_pipeline.py → narration audio + timestamps
run_image_generate.py → scene plan + AI images
run_sfx.py → sound effects mixed into audio
run_editor.py → final video
git clone https://github.com/TheUnknown550/ScriptSmith-Engine.git
cd ScriptSmith-Engine
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt
cp .env.example .env
python run_pipeline.py
python run_image_generate.py
python run_sfx.py
python run_editor.pyFill in your API keys in
.envbefore running the pipeline — see APIs Required below. Each script can also be run on its own; see the per-part sections further down for options.
This project relies on a few external services. You'll need an API key for each:
| Service | Used for | Get a key |
|---|---|---|
| Google Gemini | Text-to-speech narration (GOOGLE_API_KEY) |
aistudio.google.com |
| MiniMax | Scene planning + sound effect planning (MINIMAX_API_KEY) |
minimax.io |
| Runware | AI scene image generation (RUNWARE_API_KEY) |
runware.ai |
| Freesound | Sound effect search and download (FREESOUND_API_KEY, FREESOUND_CLIENT_ID, FREESOUND_CLIENT_SECRET) |
freesound.org/apiv2/apply |
Also required locally:
- FFmpeg — audio/video processing (must be on your
PATH) - A CUDA-capable GPU is recommended for transcription (
faster-whisper), but not required
.\.venv\Scripts\activate
pip install -r requirements.txtCopy .env.example to .env and fill in your API keys:
cp .env.example .envGOOGLE_API_KEY= # Gemini TTS
MINIMAX_API_KEY= # Scene planning + SFX planning
RUNWARE_API_KEY= # Image generation
FREESOUND_API_KEY= # SFX search and download
FREESOUND_CLIENT_ID= # Freesound app client ID
FREESOUND_CLIENT_SECRET=
Reads script.txt, generates narration with Gemini TTS, and transcribes it into per-segment timestamps.
python run_pipeline.pyFor slightly slower pacing:
python run_pipeline.py --pace 0.90Reuse existing audio and only regenerate timestamps:
python run_pipeline.py --skip-ttsOutputs:
output/audio/full.wavoutput/transcripts/segments.jsonoutput/transcripts/segments.txtoutput/transcripts/segments.srt
The script is split into 1–6 requests depending on word count (roughly 250 words per request, capped at 6 total). Each chunk is loudness-normalised and crossfaded at the join so the result sounds like one continuous recording.
Uses MiniMax M3 to read transcript segments and write one image prompt per scene. Images are generated via Runware (GPT Image 2).
Build only the scene plan (no images generated, fast):
python run_image_generate.py --plan-onlyBuild scene plan and generate all images:
python run_image_generate.pyReuse existing scene plan and generate images only:
python run_image_generator.py --generate-onlyGenerate one random scene to test image quality:
python run_image_generator.py --generate-only --testOutputs:
output/image_plan/scene_plan.jsonoutput/image_plan/scene_prompts.txtoutput/images/<timestamp>.png
Image filenames are timestamp-only (e.g. 01-40-700.png) so the editor can read them directly without needing the transcript.
To control image quality and cost, set in .env:
RUNWARE_IMAGE_QUALITY=low # $0.006/image ← default
RUNWARE_IMAGE_QUALITY=medium # $0.053/image
RUNWARE_IMAGE_QUALITY=high # $0.211/image
If MINIMAX_API_KEY is missing the planner falls back to a local heuristic.
Uses MiniMax to decide which scenes get a sound effect, searches Freesound for each, downloads the audio, and mixes it into the narration at the correct timestamp.
Plan SFX and download files (no mixing yet — good for reviewing choices):
python run_sfx.py --plan-onlyFull run — plan, download, and mix:
python run_sfx.pyReuse existing SFX plan and only re-mix:
python run_sfx.py --mix-onlyOutputs:
output/sfx_plan/sfx_plan.json— which scenes get SFX, what was found, volume levelsoutput/sfx/*.mp3— downloaded Freesound previews (cached, reused on reruns)output/audio/full_with_sfx.wav— narration with SFX mixed in
How density is controlled:
- 1 SFX per ~20 seconds of video maximum
- MiniMax targets hard scene changes, major reveals, the first scene, and the last scene
- Narration stays at full volume; SFX are layered underneath at 0.15–0.55 volume
- Leading silence is stripped from each SFX file so the hit lands exactly on the scene change
To adjust volumes or swap a sound, edit output/sfx_plan/sfx_plan.json and run --mix-only.
Stitches scene images into a video synced to the narration audio.
Using the AI-generated images (default):
python run_editor.pyUsing images from a custom folder:
python run_editor.py --images "D:\path\to\img"Using the SFX-mixed audio:
python run_editor.py --audio output\audio\full_with_sfx.wavFull run with custom images and SFX audio:
python run_editor.py --images "D:\path\to\img" --audio output\audio\full_with_sfx.wavOutput:
output/video/final_video.mp4
The editor auto-detects whether your images have timestamp filenames (e.g. 01-40-700.png) and uses them as timeline anchors. If they don't, it falls back to pairing images to transcript segments in order. GPU encoding (h264_nvenc) is used automatically if available.
The examples/ folder contains a real sample run end-to-end:
examples/script.txt— a complete sample input scriptexamples/scene_plan.json— the AI-generated scene plan for the opening scenes (output of Part 2)examples/final_video_preview.mp4— a short preview clip of the final rendered video (output of Part 4)
Use examples/script.txt as script.txt to try the full pipeline yourself.
MIT — see LICENSE.