Add captions to X (Twitter) videos
X (Twitter) videos run wide — 16:9, up to 1920 × 1080, and as long as two minutes twenty. The tool above transcribes yours on your own device, lays the words on the frame at the timing they were actually spoken, and exports a captioned MP4 or a subtitle file. No account, no upload, no watermark. Drop a clip in and the words appear; everything below explains how to make them land on a feed that plays your video before anyone has decided to listen.
Captioning a video built for the timeline
X is the rare place where the video and the text live side by side, and the timeline doesn't slow down for either. A clip sits between a quote-tweet and a screenshot, framed wide at 16:9 so it fills the column on desktop and a clean landscape band on mobile. People come to X to read, then occasionally to watch — so captions aren't decoration here. They're how your video joins the conversation it's posted into.
The tool above is built for exactly that handoff. Drop in your file — up to the 2 minute 20 second ceiling X allows — and an on-device speech model writes out every word with the moment it was spoken. From there you place the captions on the frame, tune the look, and export. Nothing about the clip leaves your machine while you work, and nothing about the result asks for a login when you're done.
Why muted autoplay makes captions the whole point
A video on the timeline starts playing the instant it scrolls into view, and it starts silent. That's the condition every X clip is born into: motion with no sound, competing against text that's already legible. If the first second can't be read, it reads as nothing, and the thumb keeps moving.
Captions flip that. They turn a muted, ambiguous rectangle into something a passing reader can parse without reaching for the volume — a claim, a punchline, a number, a name. The point isn't only accessibility, though it's that too; on X the caption is often the only version of your audio anyone receives. Get the opening line on screen, in time, and the clip earns the half-second of attention it needs to be worth un-muting.
Word-perfect timing, ninety-nine languages, fully editable
The transcription runs on a speech model that detects the spoken language on its own and aligns each word individually, not line by line. A caption doesn't dump a whole sentence and wait — it tracks the voice word for word, which keeps fast talking and quick cuts from drifting out of sync. It handles around ninety-nine languages, and if it guesses wrong or you want the clip in another one, you can set the language and re-generate.
Then every word is yours to fix. The transcript is editable text: correct a name, a handle, or a piece of jargon the model hadn't met, trim a filler word, and the caption on the video updates with the same timing. You're not re-typing subtitles from scratch — you're proofreading a draft that already knows when each word lands.
It runs in your browser — the file never leaves your device
This is not a service you upload to. The AI model downloads to your browser once, then runs locally on your own hardware, so your video is processed where it already is. There's no server in the loop, no copy sitting in someone's storage bucket, and nothing from your footage is ever used to train a model. That's a literal guarantee, not a privacy paragraph.
For X that matters more than it might sound. Plenty of what gets posted there is reactive and unreleased — a clip from a call, a draft announcement, something embargoed until the tweet goes live. Because the work happens entirely on your device, you can caption any of it without that file touching the open internet a moment before you choose to publish.
Four caption styles for a wide, scrollable frame
X's 16:9 shape gives you a wide canvas and a short attention budget, so the caption has to read in a glance without burying the picture. There are four animated styles to match the clip: Karaoke, where each word lights up as it's said; Highlighted, which boxes the active word in a color; Minimal, one clean word at a time; and Dynamic, one word with a small pop. Karaoke and Highlighted suit talking-head and interview clips where you want the pace visible; Minimal and Dynamic keep a busier or more graphic shot uncluttered.
Everything about the look is adjustable. Choose a typeface — Inter, Montserrat, Oswald, Lora, or JetBrains Mono — set its weight and size, and pick text and highlight colors, an outline, and a shadow so the words hold up over any background. Position is yours too: top, center, or bottom, nudged to taste, with control over words per line. On X, sitting the captions a little above the very bottom edge keeps them clear of the player's scrubber and the engagement row.
Exporting at the right size for X
Once the captions are placed, export a finished MP4 with the words burned into the picture — the safest route for X, since baked-in text plays for everyone, on every client, under autoplay, with no setting to toggle. Render optimized for sharing to keep the upload light, or at source quality when you want the full 1920 × 1080 to survive the platform's re-encode. It's hardware-accelerated, and there's no watermark on the way out.
Keep the frame at 16:9 so X presents it full-width instead of letterboxing it. If you'd rather hand the platform its own subtitle track, download an SRT or VTT and attach it to the upload. Either way you can fine-tune any line's timing before you export — pull a caption a few frames earlier so the opening reads the instant the muted clip starts. Every style, language, and export option here is free, for everyone, with nothing gated at the finish line.
Questions
For reliability, yes. Burning the captions into the MP4 means they show for every viewer the moment the clip autoplays, with no setting to enable — which fits how X plays video silently in the timeline. Export the burned-in MP4 for that. If you'd rather give X a separate subtitle track, the tool also exports SRT and VTT to attach to your upload.
Big enough to read at a glance on a phone, since most of the timeline is mobile. On a 1920 × 1080 frame, a heavy weight with a clear outline or shadow holds up against any background. Keep lines short — a few words each — and place them a little above the bottom edge so they clear the player controls.
X presents timeline video in 16:9 landscape, exported at 1920 × 1080, and allows clips up to 2 minutes 20 seconds. Keep your export at 16:9 so X shows it full-width instead of letterboxing, and the tool handles the full length without trimming.
No. The AI model downloads to your browser once and then runs locally, so the file is processed on your own device and never sent to a server. Nothing from your footage is stored or used to train anything — useful when you're captioning a clip before the tweet is even live.
Yes. The model auto-detects the spoken language across roughly ninety-nine of them, and you can override it or re-generate the captions in a different language. Every word stays editable afterward, so you can fix a name or a handle and keep the original timing.