Synchronizing Audio and Video Across PTS Domains - Part 4 of 6

Last updated on 06 Apr 2026

Part 4 of 6: Audio/Video Streaming with Swift and GStreamer

In the first three parts of this series, we built a transcoding pipeline, paced its output at real time, and solved the audio processing chain. This post tackles the hardest problem in the system: making audio and video play in sync, even when their timestamps come from different universes.

The Easy Case

For most media files (MKV, MP4, AVI), audio and video timestamps live in the same domain. A video frame with PTS 10.0 seconds and an audio buffer with PTS 10.0 seconds represent the same moment in the content. The server normalizes both to start from zero:

  Server normalizes:
    video PTS 145.3 → emitted as 0.0 (first frame)
    video PTS 145.34 → emitted as 0.04
    audio PTS 145.31 → emitted as 0.01

  Client reconstructs:
    currentPosition = playbackBasePosition + displayedFramePTS

The client uses audio as the master clock. A SyncController compares each video frame's PTS against the audio decoder's current playback time and decides what to do:

func decideSyncAction(for framePTS: Double) -> SyncAction {
    let audioTime = audioDecoder.currentPlaybackTime
    let drift = framePTS - audioTime

    if abs(drift) <= 0.050 {        // Within 50ms
        return .display
    } else if drift > 0.050 {       // Video ahead of audio
        return .hold
    } else if drift < -0.100 {      // Video > 100ms behind audio
        return .drop
    } else {
        return .display             // Slightly late (-50ms to -100ms); display anyway
    }
}

  Drift (video PTS - audio time)
  ◀──────────────────────────────────────────────▶
       drop         display        hold
  ◀──────┤─────────────┼─────────────┤──────────▶
    < -100ms     -50ms to +50ms    > +50ms
                     ▲
                     │
              slightly late zone
              -100ms to -50ms
              (display to recover)

The "slightly late" zone between -50ms and -100ms displays the frame rather than dropping it. Brief stalls (a garbage collection pause, a slow decode) can push video behind audio temporarily; dropping frames during recovery would make the stutter worse. Only frames more than 100ms behind are considered unrecoverable.

The display link callback (driven by the screen's refresh rate) queries this controller for every frame:

func displayLinkCallback() {
    guard let firstFrame = videoFrameQueue.first else { return }

    let action = syncController.decideSyncAction(for: firstFrame.pts)

    switch action {
    case .display:
        let frame = videoFrameQueue.removeFirst()
        displayFrame(frame.pixelBuffer, at: frame.pts)
        currentPosition = playbackBasePosition + frame.pts

    case .hold:
        break  // Leave frame in queue; try again next vsync

    case .drop:
        videoFrameQueue.removeFirst()
        // May cascade: drop additional late frames in the same callback
    }
}

For standard media files, this works well. But Blu-ray discs break the assumption that audio and video share a PTS domain.

The BDMV Problem

Blu-ray discs use MPEG-2 Transport Streams (M2TS). When GStreamer's tsdemux processes a Blu-ray transport stream, it places video and audio into different PTS epochs. In our testing, video PTS values start around 3,600,000 seconds (the raw 90 kHz PTS counter is near its wraparound point), while audio PTS values start near zero.

  tsdemux output PTS domains (Blu-ray):

  Video PTS:  3,600,001.234    3,600,001.276    3,600,001.318  ...
  Audio PTS:        0.012          0.036              0.060    ...
                    ▲                                  ▲
                    └──── Same content moment ─────────┘
                         but ~3.6 million seconds apart

Naive normalization (subtract a fixed offset) does not work because the epochs are unrelated to each other. You cannot assume that video PTS 3,600,001.234 minus audio PTS 0.012 is a meaningful value.

The First Attempt: Epoch Stripping

Our initial approach was to detect the ~3,600,000-second epoch in video PTS values and subtract it:

// DON'T DO THIS (or at least, not as your primary path)
guard presentationTimestamp > 100_000 else { return presentationTimestamp }
let epochSize: Double = 3_600_000.0
let epoch = floor(presentationTimestamp / epochSize) * epochSize
return presentationTimestamp - epoch

This produced results that looked approximately correct. Video at PTS 3,600,001.234 became 1.234. Audio at PTS 0.012 stayed at 0.012. Close enough?

No. The error was about 1 to 2 seconds, varying by disc. The problem is that the epoch boundary does not align with the segment start. tsdemux creates GStreamer segments with start and time fields that define how raw PTS values map to content time, and those segment boundaries depend on the disc's program structure and where the seek landed. Subtracting a fixed epoch ignores this segment metadata entirely.

The Solution: Segment Stream Time

GStreamer provides a function that converts a raw PTS into a domain-independent content time: gst_segment_to_stream_time. This function uses the segment's start, stop, and time fields to compute a position relative to the segment origin. Two streams from the same demuxer will produce stream time values where 0.0 represents the same content moment, regardless of their raw PTS domains.

  ┌─────────────────────────────────────────────────┐
  │              gst_segment_to_stream_time         │
  │                                                 │
  │  Video segment:                                 │
  │    start = 3,600,000.000                        │
  │    raw PTS = 3,600,001.234                      │
  │    stream_time = 1.234                          │
  │                                                 │
  │  Audio segment:                                 │
  │    start = 0.000                                │
  │    raw PTS = 1.246                              │
  │    stream_time = 1.246                          │
  │                                                 │
  │  Both reference the same content position       │
  │  (within 12ms of each other in this example)    │
  └─────────────────────────────────────────────────┘

The video pull loop captures this at the first emitted frame:

if let segmentPtr = gst_sample_get_segment(sample) {
    let rawPTS = buffer.pointee.pts
    if rawPTS != UInt64.max {
        let streamTime = gst_segment_to_stream_time(
            segmentPtr, GST_FORMAT_TIME, rawPTS
        )
        if streamTime != UInt64.max {
            startupVideoSegmentStreamTimeSecs =
                Double(streamTime) / Double(GST_SECOND_NS)
        }
    }
}

The audio pull loop then uses this video stream time to compute the correct target PTS for its alignment:

if let videoStreamTime = startupVideoSegmentStreamTimeSecs {
    if let segPtr = gst_sample_get_segment(audioSample) {
        let seg = segPtr.pointee
        let audioSegStart = Double(seg.start) / Double(GST_SECOND_NS)
        let audioSegTime  = Double(seg.time) / Double(GST_SECOND_NS)

        // The audio PTS that corresponds to the video's stream time
        let targetAudioPTS = audioSegStart + videoStreamTime - audioSegTime
    }
}

The formula PTS = seg.start + stream_time - seg.time is the inverse of gst_segment_to_stream_time. It answers: "what raw audio PTS corresponds to this content moment?" Audio samples with PTS earlier than this target are dropped during startup alignment; the first retained sample lands at the same content position as the first video frame.

This approach eliminates the 1-2 second error of epoch stripping because it uses the actual segment metadata rather than assuming a fixed epoch boundary.

M2TS Pre-Scan: Finding the IDR

Before the server even starts the GStreamer pipeline for Blu-ray content, it performs a pre-scan of the raw M2TS byte stream. The scan reads the first ~30 MB of transport stream data and parses PES (Packetized Elementary Stream) headers to find two things:

The first video IDR keyframe PTS (in 90 kHz ticks)
The first audio PES PTS (in 90 kHz ticks)

  M2TS Pre-Scan
  ┌──────────────────────────────────────────────────────────┐
  │                                                          │
  │  192-byte M2TS packets                                   │
  │  ┌──────┬───────────────────────────────────────┐        │
  │  │ 4B   │ 188-byte TS packet (sync: 0x47)       │        │
  │  │ BD   │ ┌─────┬──────────────────────────────┐│        │
  │  │ hdr  │ │ PID │ payload (PES or data)        ││        │
  │  │      │ └─────┴──────────────────────────────┘│        │
  │  └──────┴───────────────────────────────────────┘        │
  │                                                          │
  │  For each video PID PES packet:                          │
  │    - Accumulate payload across TS packets (max 4 KB)     │
  │    - Scan for NAL start codes                            │
  │    - H.264: look for NAL type 5 (IDR)                    │
  │    - HEVC: look for NAL type 19-20 (IDR_W_RADL/IDR_N_LP) │
  │    - Extract 33-bit PTS from PES header (90 kHz clock)   │
  │                                                          │
  │  Result:                                                 │
  │    firstVideoPTS = 324,027,891 ticks                     │
  │    firstVideoIDRPTS = 324,031,491 ticks                  │
  │    firstAudioPTS = 5,400 ticks                           │
  │    preIDRGap = (31491 - 27891) / 90000 = 0.040s          │
  │                                                          │
  └──────────────────────────────────────────────────────────┘

The pre-IDR gap tells the video pull loop how many seconds of corrupt frames to drop after a seek. When tsdemux seeks into the middle of a transport stream, it starts demuxing at the nearest random access point. The H.264 decoder begins producing output immediately, but the frames between the first PES and the first IDR are decoded without valid reference frames. They appear as green blocks or garbled macroblocks. The gap calculation tells the video startup gate (from Part 2) exactly when to start accepting frames:

// In the video pull loop
if let preIDRGap = bdmvPreIDRVideoGapSeconds,
   let firstRawPTS = bdmvFirstVideoRawPTS {
    if presentationTimestamp < firstRawPTS + preIDRGap {
        return true  // Drop this frame; it is pre-IDR garbage
    }
}

The IDR detection itself requires accumulating payload across transport stream packet boundaries. An IDR NAL unit rarely starts at the beginning of a PES; SPS, PPS, and SEI data typically precede it and can consume the first one or two TS packets. The pre-scanner accumulates up to 4 KB of video PES payload to handle Blu-ray discs where closed caption SEI payloads are hundreds of bytes.

PAT Gate

A subtle detail: the pre-scanner skips all PES packets that arrive before the first PAT (Program Association Table, PID 0x0000). GStreamer's tsdemux cannot demux the stream until it has seen a PAT, because the PAT tells it which PIDs correspond to which programs. PES packets before the PAT are real data, but tsdemux will not output them. If the pre-scanner counted those PES packets, it would overestimate the audio/video gap.

The Four Alignment Strategies

The audio pull loop uses the video's startup anchor to determine where to begin emitting audio. There are four strategies, tried in order:

func deriveStartupAudioAlignmentTargetPTS(...) -> (Double?, Strategy) {
    // Strategy 1: Observed audio origin + video offset
    // Most reliable for BDMV. Uses the earliest audio PTS seen during
    // the gating phase plus the video's relative offset from seek origin.
    if let videoOffset = videoRelativeOffset,
       let observedOrigin = observedAudioOriginPTS {
        let delta = abs(observedOrigin - timelineOrigin)
        if delta <= 0.250 {
            return (observedOrigin + videoOffset, .observedOriginPlusVideoOffset)
        }
    }

    // Strategy 2: Absolute video anchor
    // Fallback for file-based pipelines where audio and video share a domain.
    if let absoluteAnchor = videoAbsoluteAnchorPTS {
        return (absoluteAnchor, .absoluteVideoAnchor)
    }

    // Strategy 3: Requested start position + video offset
    // When no audio has been observed yet during gating.
    if let videoOffset = videoRelativeOffset {
        return (timelineOrigin + videoOffset, .requestedStartPlusVideoOffset)
    }

    // Strategy 4: Unavailable
    return (nil, .unavailable)
}

Strategy 1 is the primary path for Blu-ray content. The observedAudioOriginPTS is tracked during the audio gate phase (Part 2): even while audio samples are being held or dropped, the gate records the earliest raw PTS it sees. This is critical because if the gate drops 200ms of audio, the first retained sample's PTS would be 200ms later than the true origin, and the alignment would be off by that amount.

The alignment decision for each audio sample is a simple comparison:

func shouldDropStartupAudioSample(
    sampleStartPTS: Double,
    sampleDuration: Double,
    targetPTS: Double
) -> Bool {
    let sampleEndPTS = sampleStartPTS + sampleDuration
    return sampleEndPTS < targetPTS
}

A sample is dropped only if its entire duration ends before the target. This allows one buffer's worth of overlap (~32ms), which is preferable to dropping a sample whose second half contains audio that should be heard.

Client-Side Synchronization

On the client, RawFramePlaybackManager receives video frames and audio samples from the server over a WebSocket connection. Video frames go through a VideoDecoder (VideoToolbox VTDecompressionSession) and land in a queue of decoded pixel buffers. Audio samples go to the AudioDecoder (AVAudioEngine) and are scheduled on a player node.

The display link fires at the screen's refresh rate (60 or 120 Hz) and pulls frames from the queue:

  Display Link Callback (every ~16ms or ~8ms)
  ┌──────────────────────────────────────────────────────┐
  │                                                      │
  │  1. Read audioDecoder.currentPlaybackTime            │
  │                                                      │
  │  2. Peek at first frame in videoFrameQueue           │
  │                                                      │
  │  3. syncController.decideSyncAction(framePTS)        │
  │     ┌─────────────────────────────────────────┐      │
  │     │  drift = framePTS - audioTime           │      │
  │     │                                         │      │
  │     │  |drift| <= 50ms  →  DISPLAY            │      │
  │     │  drift > 50ms     →  HOLD               │      │
  │     │  drift < -100ms   →  DROP               │      │
  │     │  -100ms to -50ms  →  DISPLAY (recover)  │      │
  │     └─────────────────────────────────────────┘      │
  │                                                      │
  │  4. Display → enqueue pixel buffer on                │
  │               AVSampleBufferDisplayLayer             │
  │     Hold → do nothing (try again next vsync)         │
  │     Drop → remove frame, check next frame too        │
  │                                                      │
  └──────────────────────────────────────────────────────┘

The AVSampleBufferDisplayLayer renders the pixel buffer at the next compositing opportunity. Because we display at most one frame per display link callback, the maximum display rate matches the screen refresh rate, which is always higher than the content frame rate (24, 30, or 60 fps).

Position Tracking

The client maintains a currentPosition that represents the absolute position in the media timeline:

// When displaying a video frame:
currentPosition = playbackBasePosition + displayedFramePTS

// For audio-only content (no display link):
currentPosition = playbackBasePosition + audioDecoder.currentPlaybackTime

playbackBasePosition is the offset passed to the server when starting playback. If the user seeks to 1:30:00 and the server starts streaming from that point, video frames arrive with PTS starting near 0.0, and playbackBasePosition is 5400.0. The displayed position (which drives the scrubber and subtitle timing) is 5400.0 + framePTS.

The Thread-Safe Ingress Mailbox

Video frames and audio samples arrive on the WebSocket receive thread, which is not the main actor. The client cannot process them on the main thread directly because that would block the WebSocket connection. Instead, incoming data goes into a StreamIngressMailbox:

  WebSocket Thread                  Main Actor
  ────────────────                  ──────────
  didReceiveVideoFrame(data)
       │
       ├─── sessionID check ──▶ reject if stale
       │
       ▼
  ingressMailbox.enqueue(.video(data))
                                    │
                                    ▼
                              processIngressEvent()
                                    │
                              ├── decode video frame
                              ├── update buffer tracking
                              └── yield to AsyncStream

The mailbox uses NSLock for thread safety and an AsyncStream for the main actor to drain. This design avoids two problems: it does not block the WebSocket thread on main actor availability, and it does not require the main actor to poll. The AsyncStream wakes the processing task only when data arrives.

Session ID filtering happens at the earliest possible point. When a seek or pause tears down the current session (Part 6), the first thing it does is nil out currentSessionID. Any packets still in flight from the old session are rejected immediately by the shouldAcceptIngressPacket check, before they touch the mailbox.

What Comes Next

Audio and video are now synchronized, even across Blu-ray's split PTS domains. The server paces output at 1x, the client buffers and renders with audio as master clock, and the sync controller keeps video within 50ms of audio.

But what happens when the network cannot sustain the current bitrate? In a conventional streaming system, you would measure client-side throughput and adjust. In a 1x-paced system, that measurement is meaningless. Part 5 explains why, and what we use instead.