How to Add Captions and Transcripts to Video
Captions and transcripts make audio and video accessible — and WCAG requires them. Here is what each one is, what the rules say, and how to add them properly.
Captions and transcripts are how you make audio and video usable by people who cannot hear it — and WCAG requires them. In short: captions display all speech and meaningful sound, synchronised with the video, for deaf and hard-of-hearing viewers; a transcript is the full text of the content; subtitles (strictly) assume you can hear and only translate dialogue; and audio description narrates important visuals for blind viewers. WCAG requires captions for prerecorded video at Level A (Success Criterion 1.2.2), a transcript or equivalent for prerecorded audio-only content (1.2.1), and audio description for video at Level AA (1.2.5). You add captions with a caption file — usually WebVTT or SRT — attached via the HTML5 <track> element or uploaded to your video host. This guide explains each piece, the exact requirements, and how to do it well, including why auto-captions always need editing.
It builds on what web accessibility is and why it matters and what WCAG is and the difference between A, AA and AAA, since these requirements live in WCAG's Perceivable principle.
Captions, subtitles, transcripts and audio description
People use these words loosely, but for accessibility the distinctions matter, because they meet different needs. Here is the precise breakdown:
| Term | What it is | Who it is for |
|---|---|---|
| Captions | Synchronised on-screen text of all dialogue plus meaningful non-speech audio ([applause], [phone rings], speaker names) | Deaf and hard-of-hearing viewers, and anyone watching with sound off |
| Subtitles | Synchronised text of dialogue only, often a translation, assuming the viewer can hear | Viewers who can hear but do not understand the spoken language |
| Transcript | The full text of the content — all speech and relevant sound/visual information — as readable text, not synced to playback | People who prefer or need to read; also search engines |
| Audio description | An extra narration track describing important visual information (actions, scene changes, on-screen text) during natural pauses | Blind and low-vision viewers who cannot see the visuals |
The most common confusion is captions versus subtitles. Because they look similar on screen, people treat them as synonyms — but subtitles, strictly defined, omit sound effects and speaker identification because they assume you can hear the audio and just need the words in another language. Captions are what accessibility requires, because a deaf viewer needs to know not only what is said but that a phone is ringing, that ominous music is playing, or which of two off-screen people is speaking. (You will often see "closed captions", meaning captions the viewer can turn on or off, as opposed to "open captions" burned permanently into the video.)
What WCAG actually requires
The relevant rules sit under WCAG's Perceivable principle, in Guideline 1.2 (Time-based Media). The key success criteria, with their levels:
- 1.2.1 Audio-only and Video-only (Prerecorded) — Level A. For prerecorded audio-only content (a podcast, an audio clip), provide a transcript (or other text alternative). For prerecorded video-only content (a silent clip), provide a text or audio description of the visuals.
- 1.2.2 Captions (Prerecorded) — Level A. For prerecorded video that has audio, provide captions. This is the core, baseline requirement and it sits at Level A, the most basic tier — meaning a site cannot even claim minimal conformance without it.
- 1.2.3 Audio Description or Media Alternative (Prerecorded) — Level A. Provide either audio description or a full text alternative for prerecorded video.
- 1.2.4 Captions (Live) — Level AA. Provide captions for live audio content in synchronised media.
- 1.2.5 Audio Description (Prerecorded) — Level AA. Provide audio description for prerecorded video content.
The practical summary for most teams aiming at WCAG 2.2 Level AA (the standard target, as we explain in what WCAG is and the difference between A, AA and AAA): your prerecorded videos need accurate captions and audio description, your podcasts and audio clips need a transcript, and any live audio needs live captions. Captions at Level A are the non-negotiable floor.
This table maps the common media types to what you must provide:
| Media type | Minimum requirement | WCAG criterion |
|---|---|---|
| Prerecorded video with audio | Captions (and audio description for AA) | 1.2.2 (A), 1.2.5 (AA) |
| Prerecorded audio only (podcast) | Transcript / text alternative | 1.2.1 (A) |
| Prerecorded video only (silent clip) | Text or audio description of visuals | 1.2.1 (A) |
| Live video with audio | Live captions | 1.2.4 (AA) |
| Decorative/background video with no audio info | No caption needed, but ensure it does not auto-play disruptively | (See pause/stop criteria) |
Caption file formats: WebVTT and SRT
Captions are stored in a caption file: a plain-text file that pairs lines of text with the timestamps at which they should appear and disappear. Two formats dominate:
- WebVTT (.vtt) — Web Video Text Tracks, the format designed for the web and the one used with the HTML5
<track>element. It supports positioning, basic styling and metadata. A cue looks like a start and end time (00:00:04.000 --> 00:00:07.500) followed by the caption text. - SRT (.srt) — SubRip, an older, extremely widely supported format. It is simpler (numbered cues with timestamps and text) and is accepted by most video platforms, though it lacks WebVTT's styling features.
Both are human-readable and editable in any text editor, which matters because you will often need to correct them by hand. For self-hosted HTML5 video, prefer WebVTT. For hosted platforms, upload whichever of .vtt or .srt the platform accepts.
Adding captions to HTML5 video with <track>
If you host your own video with the HTML5 <video> element, you attach captions using the <track> element. The pattern is straightforward: a track with kind="captions", pointing at your .vtt file, with a language and a label. Conceptually:
A video element contains a source for the video file and one or more track children. Each track declares kind (captions, subtitles, descriptions, or chapters), src (the path to the .vtt file), srclang (the language code, e.g. en), and label (the human-readable name shown in the player's menu, e.g. "English"). Adding default to a track tells the browser to enable it by default. You can include several track elements — for example captions in multiple languages — and the browser's native player exposes them in its captions menu.
The key points to get right: use kind="captions" (not subtitles) for accessibility captions so the intent is correct; provide an accurate srclang and a clear label; and make sure the .vtt file is served correctly by your host. The browser then renders the captions over the video and lets the user toggle them.
Using hosted video platforms
Most sites embed video through a hosting platform rather than self-hosting, and the major platforms all support captions — but the responsibility to make them accurate is yours. Typically you either upload a prepared caption file (.srt or .vtt) or use the platform's editor to create and time captions. Platforms also generate automatic captions, which brings us to the most important caveat in this whole topic.
Auto-captions: a starting point, never the finish
Automatic speech recognition has improved dramatically, and auto-generated captions are genuinely useful — they save enormous time by producing a first draft. But auto-captions are not accessible on their own, and publishing them unedited is a mistake. They reliably get things wrong:
- Misheard words, especially homophones and uncommon vocabulary.
- Names and technical terms mangled or misspelled.
- Missing or wrong punctuation, which changes meaning and makes captions hard to read.
- No speaker identification, so dialogue between people becomes an unattributed blur.
- No non-speech sounds — the [door slams] and [phone rings] that a deaf viewer needs are simply absent.
- Errors during music, accents, fast speech or background noise.
Inaccurate captions can be worse than none, because they actively mislead the viewer. (Disability advocates sometimes call sloppy auto-captions "craptions" for exactly this reason.) The correct workflow is therefore: generate auto-captions to save time, then review and correct them — fix the words, add punctuation, label speakers, and include meaningful sounds — before you publish. Editing a draft is far faster than captioning from scratch, so you get the speed benefit without sacrificing accuracy.
Transcripts: the text alternative
A transcript is the full content of the media rendered as text. For audio-only content like a podcast, a transcript is the WCAG requirement (1.2.1, Level A) — there is nothing to caption, so the text alternative carries the content. For video, captions are the core requirement, but offering a transcript as well is strong practice:
- It serves people who simply prefer to read or to skim, and people who find audio difficult to process.
- It is fully searchable, both on your page and by search engines.
- For deaf-blind users relying on a braille display, a transcript can be read at their own pace.
- It provides a text alternative for video that helps satisfy 1.2.3.
A good transcript includes all dialogue, speaker labels, and descriptions of important sounds and (for video) key visual information, so that someone reading it gets the full content without the media.
The SEO and universal-usability bonus
Captions and transcripts are one of those accessibility features with a large side benefit, which makes them easy to justify. Because search engines cannot watch a video or listen to audio, the text you provide — captions and especially transcripts — is the crawlable content that tells them what the media is about. That improves discoverability, can surface your content for more queries, and feeds the same machine-readability that helps AI search systems (a theme in how to check if your site is ready for AI search).
And the universal usability case is just as strong: a large share of social and web video is watched with the sound off — on public transport, in open-plan offices, in bed beside a sleeping partner. Captions serve all of those people, not only deaf and hard-of-hearing viewers. Like the curb cut built for wheelchairs that helps everyone with a suitcase, captions designed for accessibility quietly improve the experience for a huge mainstream audience.
A practical workflow
Putting it together, a reliable process for accessible media:
- Identify your media and its type — prerecorded video with audio, audio-only, live — to know which requirements apply.
- Generate a first-draft caption file with an auto-caption tool or platform feature to save time.
- Review and correct every caption: words, punctuation, speaker labels, and meaningful non-speech sounds. This step is mandatory.
- Choose your delivery: WebVTT with
<track>for self-hosted HTML5 video, or upload.srt/.vttto your platform. - Add a transcript on the page (always for audio-only; as a bonus for video).
- Add audio description for prerecorded video to reach AA, narrating key visuals in the natural pauses.
- Test it — turn on the captions, mute the audio, and confirm the captions alone convey the full content, then check the transcript and the player's keyboard operability.
That last step ties into broader testing: confirming media controls are operable is part of how to make a website keyboard accessible and a line item in how to run an accessibility audit.
Common mistakes
The recurring errors are easy to name: publishing raw auto-captions without editing; confusing subtitles with captions and so omitting sound effects and speaker labels; providing no transcript for podcasts (a direct Level A failure); forgetting audio description and so missing AA on video; burning open captions into a video when closed captions the user can toggle would be better; and auto-playing video with sound, which is disruptive for everyone and especially for screen-reader users. Each is straightforward to avoid once you know the distinctions this guide lays out.
The bottom line
Captions render all speech and meaningful sound for people who cannot hear; transcripts give the full content as text; subtitles only translate dialogue; and audio description narrates visuals for people who cannot see. WCAG requires captions for prerecorded video at Level A (1.2.2), a transcript for prerecorded audio (1.2.1), and audio description for video at AA (1.2.5). Deliver captions with a WebVTT or SRT file via the HTML5 <track> element or your video host, always edit auto-captions for accuracy, and add transcripts for the accessibility, SEO and universal-usability win. For the standard behind these rules, see what WCAG is and the difference between A, AA and AAA; for the wider case, what web accessibility is and why it matters.
Want a WCAG accessibility check on any URL, alongside SEO and performance, in one report? Analyse any URL with StackOptic — free, no sign-up.
Frequently asked questions
What is the difference between captions and subtitles?
Captions are written for people who cannot hear the audio: they include all spoken dialogue plus meaningful non-speech sounds such as [door slams], [music playing] and speaker identification. Subtitles, in the strict sense, assume the viewer can hear but may not understand the language, so they translate or transcribe dialogue only and omit sound effects. In everyday use the terms are blurred, but for accessibility you specifically need captions, because they convey the full audio experience, not just the words.
Does WCAG require captions on video?
Yes. WCAG Success Criterion 1.2.2 (Captions, Prerecorded) requires captions for prerecorded video that has audio, at Level A — the most basic tier. Separately, 1.2.1 covers prerecorded audio-only and video-only content (an audio file needs a transcript; a silent video needs a description), and 1.2.5 (Audio Description, Prerecorded) requires audio description for prerecorded video at Level AA. Live audio content has its own criterion (1.2.4, captions for live audio, at AA). Meeting WCAG AA therefore means captions and audio description for your prerecorded video.
What is a transcript and when do I need one?
A transcript is the full text of an audio or video's content — all speech and relevant sound or visual information — presented as readable text on the page or as a linked document. WCAG requires a transcript (or equivalent) for prerecorded audio-only content such as a podcast at Level A. For video, captions are the core requirement, but providing a transcript as well is good practice: it helps people who prefer reading, is fully searchable, and gives search engines crawlable text.
Are automatic captions good enough?
Not on their own. Auto-captions from video platforms and speech-to-text tools are a helpful first draft, but they regularly get words wrong, drop punctuation, misspell names and technical terms, and fail to mark speaker changes or sound effects. Inaccurate captions can be worse than none because they mislead. The correct workflow is to generate auto-captions to save time, then review and edit them for accuracy, punctuation, speaker labels and meaningful sounds before publishing.
What caption file formats should I use?
The two common formats are WebVTT (.vtt) and SubRip (.srt). WebVTT is the format designed for the web and is used with the HTML5 <track> element; it supports styling and positioning. SRT is a simple, widely supported format that most video platforms accept. Both are plain-text files that pair lines of caption text with start and end timestamps. For self-hosted HTML5 video, use WebVTT with <track>; for hosted platforms, upload the .vtt or .srt file the platform supports.
Analyse any website with StackOptic
Get the full technology stack, performance, security and SEO report in seconds — free.
Analyse a websiteRelated articles
How to Make Your Website Accessible on Mobile
Mobile accessibility has its own challenges: touch targets, zoom, orientation and mobile screen readers. Here is how to make your site work on a phone.
What Is an Accessibility Statement and How to Write One
An accessibility statement declares your commitment, conformance level and known issues, and gives users a way to get help. Here is how to write a good one.
What Are Screen Readers and How Do They Work?
Screen readers turn web pages into speech and braille for blind users. Here is what they are, how they read a page, and how to test your site with one.