Packaging Subtitles

Unified Packager allows you to package and prepare your subtitles for streaming delivery (using statically packaged files, or dynamic packaging with Origin):

General workflow for adding subtitles to a stream

Whether you are preparing your subtitles for streaming delivery using static packaging with Packager or dynamic packaging with Origin, the general rule is that subtitles have to be packaged in a fMP4 container (.ismt or .cmft) before they can be added to a stream. All styling information and editorial changes should be made before packaging using the relevent encoder or subtitle tooling. When packaged in a fMP4 container, adding a subtitle track to a stream works the same as adding audio or video tracks:

  • For streaming delivery using statically packaged files, add the .ismt or .cmft with subtitles to your mp4split input when generating the client manifest (.mpd or .m3u8)
  • For streaming delivery using dynamic packaging with Origin for VOD, add the .ismt or .cmft with subtitles to your mp4split input when generating the server manifest (.ism)
  • For streaming delivery using dynamic packaging with Origin for Live, the encoder should POST the subtitles one track per language to the publishing point

How you should package your subtitles in a fMP4 container is explained on this page, where you are now. Example command-lines for adding fMP4 packaged subtitles to different kinds of streams can be found in the relevant parts of the documentation, listed below:


The three exceptions to the general rule that you need to package your subtitles in a fMP4 container before you can add them to a stream are:

Supported formats for subtitles

You can use TTML (Timed Text Markup Language), WebVTT (Web Video Text Tracks) or SRT (SubRip Text) as your source and use Packager to convert one to the other, as well as package TTML and WebVTT in a fMP4 container.

For more information on these different formats, please read our blog about subtitles: Welcome to the jungle: caption and subtitle formats in video streaming. In short, WebVTT and SRT are nearly identical formats in plain-text, whereas TTML is XML-based.

New in version 1.10.16.

In addition to the above it is possible to extract subtitles from a CEA-608 embedded captions track, and store them as TTML or WebVTT.

Source Possible outputs
WebVTT (or SRT) TTML, WebVTT in fMP4

Supported TTML profiles

The TTML specification defines the use of profiles. Each profile specifies a certain feature set. You can learn more about these profiles and their features in our blog about subtitles: Welcome to the jungle: caption and subtitle formats in video streaming. Packager can package TTML subtitles that follow any of the following profiles: DFXP, SMPTE-TT, EBU-TT-D, SDP-US, CFF-TT and the IMSC1 Text Profile. Unified Origin supports all of those profiles as well.

Difference between WebVTT and SRT

WebVTT is based on SRT and both are very similar, with only small differences in formatting. Overall, the most important difference is that the WebVTT has an official specification that is recommended by W3C and that allows for more advanced formatting features (such as positioning).

When using WebVTT or SRT as input for mp4split, do consider that:

  • For SRT, mp4split assumes the input file is encoded in ASCII unless it starts with a Byte Order Marker (BOM) that describes how the input should be transformed to Unicode
  • For WebVTT, mp4split always interprets the input files as being encoded as Unicode (regardless of any BOM), because WebVTT is UTF-8 by definition


Both WebVTT and SRT do not contain signaling for the language of the subtitles in the file. Therefore, always specify the language when using WebVTT or SRT as input for mp4split (using the --track_language command-line option). Otherwise, the language that is signaled defaults to English.

Packaging TTML, WebVTT or SRT in fMP4

When you use Packager to package your subtitles in a fMP4 container, we follow ISO 14496-30 in almost all cases. This results in the following:

  • When using WebVTT (or SRT) as input, the resulting fMP4 will use the wvtt codec
  • When using TTML as input, the resulting fMP4 will use the stpp codec

There are only two exceptions to this rule, which are related to packaging TTML and explained in the relevant section below.

When packaging subtitles in a fMP4 container, the following options may be relevant:

  • When you need to add (for WebVTT or SRT) or overrule language signaling (if the source does not contain language signaling and you do not add any, English is the default): --track_language.
  • When you need to define a 'role' for the subtitles track, or want to add signaling for an accessibility feature: --track_role and --track_kind.
  • When you want to specify the duration of the fragments in which the subtitles are stored in the fMP4 to align it with the fragment duration of the other media in your stream: --fragment_duration (the default for all formats is to create a fragment for each separate subtitle cue).

WebVTT (or SRT) in fMP4

New in version 1.7.31.

To create a fMP4 with subtitles that are formatted according to the wvtt codec, use WebVTT (or SRT) subtitles as input. Whether the input is WebVTT or SRT makes no functional difference, but you should always specify the language of the track that you are packaging (using --track_language), because WebVTT and SRT files do not contain language signaling. Specifying a fragment duration that fits well the other tracks in the stream is recommended too (using --fragment_duration), as the default is to use a variable fragment size where each subtitle cue equals a fragment:


mp4split -o tears-of-steel-wvtt-nl.ismt \
  --fragment_duration=60/1 \
  tears-of-steel-nl.webvtt --track_language=nl
mp4split -o tears-of-steel-wvtt-de.ismt \
  --fragment_duration=60/1 \ --track_language=de


Specifically packaging WebVTT in fMP4, instead of relying on Unified Origin to generate WebVTT fragments from a fMP4 with TTML formatted subtitles, allows for WebVTT specific cue settings to define individual subtitle positioning, region and styling information.

TTML in fMP4

To create a fMP4 with subtitles that are formatted according to the stpp codec, use TTML subtitles as input: [1]


mp4split -o tears-of-steel-ttml-nl.ismt \

This command creates a file with a single track, which is why the TTML input file should contain only one language. If you have a single TTML file that contains multiple languages then you will have to extract separate TTML files for each language first.

As already noted above, there are two exceptions to take into account when packaging TTML in fMP4:

  • When you use SMPTE-TT formatted TTML with bitmaps as your input, the samples in the fMP4 are automatically formatted according to SMPTE-TT specification
  • When you are statically packaging HTTP Smooth Streaming (Packaging for HTTP Smooth Streaming (HSS)), you should use command-line option --brand=piff to ensure that the older dfxp codec is used, so that the timing of the @begin and @end attributes in the resulting fMP4 is relative to the start of each sample, instead of relative to the start of the track


The distinction between the stpp and dfxp codec is only relevant for statically packaged content. When you are working with Unified Origin, timing will be adjusted automatically if necessary.

Converting WebVTT (or SRT) to TTML

When you convert WebVTT or SRT to TTML, the TTML will have a default styling and layout that in general should work well (see the overview of supported cue components below). To convert WebVTT or SRT to TTML, use a WebVTT or SRT file as input and specify an output with .ttml or .dfxp as the extension.


mp4split -o tears-of-steel-nl.ttml \
  tears-of-steel-nl.webvtt --track_language="nl"

mp4split -o tears-of-steel-fr.ttml \ --track_language="fr"

Supported cue components

When converting WebVTT or SRT to TTML, only a limited set of markup features is converted to their TTML equivalents. Others are either ignored or escaped (see the example below). The markup features that will be converted are the following:

Name Description
<b></b> Bolds the textual content
<i></i> Italicises the text
<u></u> Underlines the textual content
<s></s> Specifies a line strike through on the text

Here is an example of a regular WebVTT file with some cue point component elements:

WebVTT cue point example :


00:00:15,000 --> 00:00:18.000
At the <u>left</u> we can see...

00:00:18,167 --> 00:00:20,083 position:35% line:20 align:left
At the <u>right</u> we can see the...

00:00:20,083 --> 00:00:22.000
...the <c.highlight>head-snarlers</c>

00:00:22,000 --> 00:00:24.417
Everything is safe.
<i>Perfectly</i> safe.

Result after converting to TTML:

<?xml version="1.0" encoding="utf-8"?>
<tt xmlns="..." xml:lang="en">
    <div xml:lang="en">
      <p begin="00:00:15.000" end="00:00:18.000" region="speaker">
        At the <span tts:textDecoration="underline">left</span> we can see...
      <p begin="00:00:18.167" end="00:00:20.083" region="speaker">
        At the <span tts:textDecoration="underline">right</span> we can see the...
      <p begin="00:00:20.083" end="00:00:22.000" region="speaker">
        ...the &lt;c.highlight&gt;head-snarlers&lt;/c&gt;
      <p begin="00:00:22.000" end="00:00:24.417" region="speaker">
        Everything is safe.<br />
        <span tts:fontStyle="italic">Perfectly</span> safe.


The settings (cue 2) are ignored when converting to TTML and unrecognized styling in the payload is escaped (cue 3).


[1]To create text samples it is important that Unified Packager can derive correct timing information from TTML source. While the TTML spec is liberal (and sometimes ambiguous) in this respect, Packager assumes timing in HH:MM:SS.mmm format in the @begin and @end attributes of tt/body/div/p element. Timing at a different element under tt/body is allowed, but only at the same level. For instance, SMPTE-TT encoders may choose either tt/body/div or tt/body/div/div but should not use both in one file.

Converting TTML to WebVTT

In general, TTML offers a lot more flexibility regarding document structure and styling of cues. When converting TTML to WebVTT, only a subset of this extra information will be maintained:

  • Bold text
  • Italicized text
  • Underlined text
  • Strike through text

Also, only explicit line breaks will be respected (<br />), meaning cues spread out over more than one paragraph (<p>) will end up on one line in WebVTT.


Converting image-based TTML to WebVTT is not supported. When using image-based TTML as an input for Origin, use Using dynamic track selection to filter out the image-based TTML input when requesting HLS.

Extracting embedded captions (to TTML or WebVTT)

To extract embedded captions from a video track, specify the video track carrying the embedded captions as input and specify an output with either a .webvtt or .ttml extension, depending on the format in which you want to store the extracted captions:


mp4split -o captions.ttml \
  video-with-captions.mp4 --track_type=video

mp4split -o captions.webvtt \
  video-with-captions.mp4 --track_type=video

When extracting the captions, Packager will take the language information from the video track that carries the embedded captions and add the information to its output, if the specified output is TTML (for WebVTT it will not, because it does not support language signaling). The have Packager specify a different language in its TTML output, use its --track_language option:


mp4split -o captions.ttml \
  video-with-captions.mp4 --track_type=video --track_language="sp"