Must Fix: alignment of Groups of Pictures (GOPs) across bitrates

Each bitrate needs to be GOP aligned. This enables the player to switch between adaptive bitrate video components without significant degradation of the rendered video.

Should Fix: fragment boundaries are aligned across all tracks (audio, video, text)

To ensure maximum compatibility, audio, video and text tracks should be perfectly aligned at all fragment boundaries. This is advantageous when multiplexing tracks within a single MPEG-2 transport stream, when creating Virtual subclips, and when using Capture without frame accuracy (i.e, without transcoding). Without perfect alignment, a virtual subclip or captured clip may contain (small) gaps of audio or video at the start or end of the clip, resulting in potential playback issues.

Generally speaking, this requires the following:

  • Video with a integer frame rate (i.e., no dropped frame rate)

  • Audio with a sample rate of 48 KHz (i.e., not 44.1 KHz)

  • An encoder that supports non-integer fragment durations

What you want to achieve is a fragment duration that fits an integer number of audio and video frames (fragmenting text cues is much more flexible). To calculate this duration, it is important to know that an AAC audio frame consists of 1024 samples. This means that one frame of AAC audio with a sample rate of 48 KHz is 1024/48000 seconds long. Of course, the length of one frame of video is simply 1 divided by the frame rate (e.g., 1/25 seconds for a frame rate of 25).

So, what is a sensible fragment duration that is both a multiple of 1024/48000 and 1/25? First, we need to know the lowest common multiple of both of these numbers. In this case, that will be 8/25. Then it's simply about finding a fragment duration that fits your use case and which is a multiple of the lowest common multiple that you calculated. A sensible duration in this case would be 1.92 seconds, for example. This equals 90 audio frames and 48 video frames.

Finally, you specify your encoder to encode your content with a GOP length of 48 frames, and, in some cases, also the fragment duration of 1.92 seconds.

Note

For VOD content, Unified Packager can be used to adjust the fragment duration. Text tracks can be fragmented according to whatever duration is preferred. In the case of audio, Packager can adjust this duration to whatever multiple of one audio frame. However, for video there is less flexibility, as Packager will only be able to change the duration to a multiple of the GOP length.

Must Fix: subtitle cues follow a sequential timeline aligned with other tracks

Subtitle cues, whether formatted as WebVTT or TTML, must be sequential and their timing must be aligned with all other tracks. This probably sounds like common-sense, but this requirement is especially relevant for fragmented TTML subtitles, as these subtitles signal timestamps both on a sample (MP4) and a cue (TTML) level, where a sample can contain multiple cues.

Possible problems are erroneous cues that signal a time range that does not align with the timeline of media in other tracks, a time range of later cue that predates earlier cues, or cues that end earlier than they start.

Should Fix: in case of B-frames, use negative composition time offsets (and no edit lists)

The order in which frames need to be decoded (DTS) is not always equal to the order in which they should be presented (PTS). That's why each frame has a decode timestamp (DTS) and a presentation timestamp (PTS). B-frames are an important reason for this, as they increase encoding efficiency by not only relying on data from prior frames, but from frames that need to be presented after the B-frame as well.

The PTS of a frame (or sample, which basically means the same, but is used more often within this context) is calculated based on its DTS. If a sample's PTS is not equal to its DTS, there is an offset. Offsetting PTS relative to DTS can be done using two instruments:

  • A track level edit list in 'moov.trak.edts.elst'

  • A sample level composition time offset (CTO) in 'moof.traf.trun'

To get a better understanding of this, take a look at the start of this track without B-frames first, where there is no need for CTOs or an edit list and the PTS of each sample is equal to its DTS:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ P ][ P ][ P ][ P ][ P ][IDR][ P ][ P ][ P ][ P ][ P ][ P ]
PTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13

The start of a track with B-frames looks very different, as the decode and presentation order of the samples cannot be the same anymore. The below track includes positive CTOs to account for this. They ensure that the P-frame that is to be presented fourth, is decoded second, because the B-frames that are to be presented second third rely on information from this P-frame:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ B ][ B ][ P ][ B ][ B ][IDR][ P ][ B ][ B ][ P ][ B ][ B ]
CTO   1    3    0    0    3    0    0    1    3    0    0    3    0    0
PTS   1    4    2    3    7    5    6    8    11   9    10   14   12   13

As you can see, introducing these positive CTOs in this case necessitates that the PTS of the very first frame is no longer '0', but '1' instead. To make sure the track still starts at '0', and edit list is present as well, which in this case signals that media_time=1, or, in other words, that PTS '1' should actually be considered '0'.

The problem is that this can potentially lead to sync issues, since certain packaging workflows may remove edit lists, leading to misalignment when different tracks that originally contained different edit lists are bundled together in a stream.

Furthermore, it remains open to interpretation whether the start times for all fragments, which are signaled in the fMP4's index ('mfra', or 'sidx' in the case of CMAF), should be understood as referring to DTS, or PTS. As long as DTS does not equal PTS at the start of each fragment, this is a problem.

Fortunately, these issues can be solved by introducing negative CTOs. This approach can guarantee that PTS equals DTS for the first sample of each fragment without the need for an edit list. This also makes sure that the PTS of the first sample of each track aligns across tracks that are encoded according to different video profiles (with and without B-frames).

When we take the earlier example that used B-frames with positive CTO's and an edit list, but now introduce negative CTO's so that the edit list can be left out, it looks this:

DTS   0    1    2    3    4    5    6    7    8    9    10   11   12   13
    [IDR][ P ][ B ][ B ][ P ][ B ][ B ][IDR][ P ][ B ][ B ][ P ][ B ][ B ]
CTO   0    2   -1   -1    2   -1   -1    0    2   -1   -1    2   -1   -1
PTS   0    3    1    2    6    4    5    7    10   8    9    13   11   12

In practice, this recommendation means that you should use version 1 "trun" boxes and the DTS of the first keyframe in a fragment should be equal to its PTS (i.e., no CTO and no edit list). Any samples that follow should use a CTO where applicable, negative or positive.

DASH, Smooth and HLS all support the use of negative CTOs.

Verifying your configuration

#!/bin/bash

# Use 'input' variable to specify input file
# Input file may have multiple tracks, but should contain video only
# One-liner below will check if sync samples (IDR frames) in input track(s) have CTO.
input=tears-of-steel-avc1-1000k.mp4

awk '{ CTO = $6 - $4 } ; \
  $8 == "1" { print "#"$2": ", $6, "(PTS) -", $4,"(DTS) =", CTO, "(CTO)" } ; \
  CTO != 0 && $8 == "1" { yes++ } ; CTO == 0 && $8 == "1" { no++ } \
  END { print "Found", yes+0, "sync samples with composition time offset, and", no+0, "without" }' \
  <(MP4Box -dtsx -std -quiet ${input})

Note

From input with negative CTO's, Origin (and Packager) will produce HLS TS output where the PTS of some frames is smaller than their DTS. This may result in errors when trying to verify the stream with certain transport stream specific tooling. However, PTS < DTS should not cause any issue for OTT delivered transport streams, as content is not streamed continuously but in self-contained segments. In fact, some of Apple's HLS example streams have PTS < DTS: https://developer.apple.com/streaming/examples/.

However, if you want to avoid PTS < DTS and rely on edit lists instead, it is possible to instruct Packager to do so using --positive_composition_offsets (but note that we do not recommend this).

Edit lists for audio tracks Edit lists do have a clear use for audio tracks, as they often contain samples for initialization that must not to be rendered. Using an edit list, the PTS of these samples can be shifted such that the first sample that should be rendered is equal to the PTS of the first sample in each of the video tracks.