AI Captions · Local Whisper · No API Costs
Add Captions
to Your Video
Drop your video — Whisper transcribes the audio and burns captions directly into the frames. Four caption styles, every aspect ratio, no per-video fee. Your audio never leaves our server.
Unlimited captions · Cancel anytime
How it works
Drop your video
.mp4 · .mov · .webm
→
Whisper transcribes
Local · ~1.3× realtime · 95% accuracy
→
Download captioned MP4
Burned in · plays everywhere
Why use this
Captions that look right and sound right
Whisper-quality transcription
OpenAI Whisper (small.en model) runs locally on our server via whisper.cpp. ~95% word accuracy on clear English audio. Handles accents, background music, and natural speech well.
Four caption styles
Subtle (small bottom-center, default), Bold (large white-on-black for TikTok/Reels), Cinematic (italic, raised), and Block (semi-opaque background). All scale at 720p, 1080p, and 4K.
OCR fallback for animations
If your video has no spoken audio (animation with kinetic typography, marketing video with text overlays), Tesseract reads the on-screen text and uses that as the caption track instead.
Your audio stays private
Whisper runs entirely on our box. Your audio is never sent to OpenAI, Google, AssemblyAI, or any other transcription service. The temporary audio extract is deleted at the end of your render.
Any aspect ratio
One-click platform presets for TikTok, Reels, YouTube Shorts, YouTube, Instagram, and Twitter. Captions auto-position correctly for every aspect — no manual tweaking.
Flat rate, unlimited
$5/month for unlimited captioned renders. No per-video fee, no per-minute charge, no API surcharge. Most paid caption services charge $0.10–$0.50 per minute — at $5/mo, you break even after 10–50 minutes.
FAQ
Adding Captions to Video — Common Questions
How does Animation Machine generate captions?
We run OpenAI's Whisper (small.en model, ~487MB) locally on our server using whisper.cpp — the optimized C++ port. The model transcribes your video's audio with high accuracy and produces an SRT subtitle track. We then burn those captions directly into the video frames via FFmpeg's libass-based subtitles filter, so they're permanently visible everywhere it plays — no separate caption file required.
How accurate are the captions?
Whisper is one of the strongest open-source ASR (automatic speech recognition) models available. For clear English audio, the small.en model produces 95%+ word accuracy. It handles accents, background music, and natural speech well. Quality drops on heavily distorted audio or unusual jargon, but typical YouTube-style narration transcribes cleanly.
Can I customize how the captions look?
Yes. Four built-in caption styles: Subtle (small bottom-center, default), Bold (large white-on-black TikTok-style at 42pt with thick outline), Cinematic (italic with raised margin), and Block (semi-opaque dark background pill). All scale appropriately at every resolution.
Does this work for animations or videos with no spoken audio?
Yes. When Whisper finds no usable speech, the system automatically falls back to OCR (Tesseract) and reads on-screen text from your video frames instead. So a kinetic-typography marketing animation gets captioned from its visible text, even though there's no spoken narration.
Are my videos uploaded to third parties for transcription?
No. Whisper runs entirely on our server. Your audio is never sent to OpenAI, Google, AssemblyAI, Deepgram, or any other transcription service. The temporary audio extract is deleted at the end of your render.
What input formats are supported?
MP4, MOV, M4V, and WebM video files. You can also caption HTML animations (Claude Design exports, Lottie, GSAP, CSS) — the OCR fallback reads on-screen text since these typically don't have audio.
How long does captioning take?
Whisper transcription runs at roughly 1.3× realtime on our 2-core CPU box. So a 30-second video adds ~25 seconds to the render. The caption burn-in step adds another ~5–10 seconds. Total overhead is typically 30–60 seconds for short videos.
How much does it cost?
$5 per month for unlimited renders, including unlimited captions. No per-video fee, no per-minute fee, no API surcharges. Cancel anytime.
Also available