Building a Real-Time Transcoding Pipeline with GStreamer and Swift - Part 1 of 6

Part 1 of 6: Audio/Video Streaming with Swift and GStreamer

When I started building ShowShark, a personal media server that streams to Apple devices, I needed a transcoding engine that could take any media file and produce H.264 or HEVC in real time. AVFoundation can decode media, but it cannot transcode arbitrary formats into a WebSocket-friendly stream. FFmpeg is the standard answer, but its C API is enormous and its licensing is complicated. GStreamer offered something neither could: a modular pipeline architecture where I could wire together demuxers, decoders, and encoders like plumbing, and pull individual encoded frames out of appsink elements for delivery over the network.

This post covers how ShowShark constructs its transcoding pipelines from Swift, including media format detection, codec-specific decoder selection, and the hard-won lessons about which GStreamer elements to trust with your audio.

The Architecture at 30,000 Feet

ShowShark's streaming path looks like this:

                          ShowShark Server (macOS)
  ┌────────────────────────────────────────────────────────────────┐
  │                                                                │
  │   Media File                                                   │
  │       │                                                        │
  │       ▼                                                        │
  │   ┌────────┐    ┌─────────┐    ┌─────────┐                     │
  │   │Demuxer │───▶│ Video   │───▶│ Video   │───▶ appsink         │─── WebSocket ──▶ Client
  │   │        │    │ Decoder │    │ Encoder │    (pull loop)      │   (H.264/HEVC)
  │   │        │    └─────────┘    └─────────┘                     │
  │   │        │                                                   │
  │   │        │    ┌─────────┐    ┌──────────────────────────┐    │
  │   │        │───▶│ Audio   │───▶│ PCM path (default)       │    │
  │   │        │    │ Decoder │    │   audioconvert ! volume  │───▶ appsink ──▶ Client
  │   └────────┘    └─────────┘    │                          │    │  (F32LE)
  │                                │ ── or ──                 │    │
  │                                │                          │    │
  │                                │ AAC path (client request)│    │
  │                                │   audioconvert ! volume  │───▶ appsink ──▶ Client
  │                                │   ! avenc_aac ! aacparse │    │  (ADTS)
  │                                └──────────────────────────┘    │
  │                                                                │
  └────────────────────────────────────────────────────────────────┘

The entire pipeline is a single GStreamer graph constructed as a string, parsed with gst_parse_launch, and set to PLAYING. Two pull loops (one for video, one for audio) run as Swift Tasks, pulling encoded frames and audio samples from their respective appsink elements. Those samples are serialized as Protocol Buffer messages and sent to the client over a WebSocket connection.

The client never sees GStreamer. It receives H.264/HEVC NAL units and PCM or AAC audio, decodes them with VideoToolbox and AVAudioEngine, and synchronizes playback locally. We will cover the pull loops in Part 2 and client-side synchronization in Part 4; this post focuses on how the pipeline itself gets built.

Step 1: Detecting the Source Media

Before constructing a pipeline, we need to know what we are working with. GStreamer provides GstDiscoverer, which probes a file and reports its container format, codecs, dimensions, frame rate, channel layout, and more.

func detectMediaFormat() -> DetectedMediaFormat {
    let uri = URL(fileURLWithPath: filePath).absoluteString

    var error: UnsafeMutablePointer<GError>?
    guard let discoverer = gst_discoverer_new(10 * 1_000_000_000, &error) else {
        return .fallback
    }
    defer { gst_object_unref(discoverer) }

    guard let info = gst_discoverer_discover_uri(discoverer, uri, &error) else {
        return .fallback
    }
    defer { g_object_unref(UnsafeMutableRawPointer(info)) }

    let container = detectContainerFormat(info)
    // ... extract video codec, audio codec, dimensions, frame rate, HDR flags
}

The returned DetectedMediaFormat struct captures everything the pipeline builder needs:

struct DetectedMediaFormat {
    let container: String        // "matroska", "mp4", "mpeg-ts", "avi", ...
    let videoCodec: String       // "h264", "hevc", "av1", "vp9", "mpeg4", ...
    let audioCodec: String       // "aac", "ac3", "dts", "truehd", "mp3", ...
    let audioChannels: Int       // 2, 6 (5.1), 8 (7.1), ...
    let videoWidth: Int
    let videoHeight: Int
    let videoFrameRate: Double
    let isHDRVideo: Bool
    let hasPackedBFrames: Bool   // Legacy MPEG-4 AVI concern
    let needsPARCorrection: Bool // Non-square pixels
    let prefersSoftwareH264Decoder: Bool
    // ...
}

A few things worth noting about this detection phase:

Pixel aspect ratio correction. Some media (especially DVD rips) stores video at one resolution but is meant to be displayed at another. A 720x480 NTSC DVD frame with a PAR of 32:27 should display at 853x480. We detect non-square pixels and calculate the corrected display width, ensuring the output has square pixels so the client does not need to worry about PAR.

Packed B-frames. Old MPEG-4 Part 2 content (DivX, Xvid) sometimes packs multiple frames into a single packet. This flag triggers a different compatibility profile downstream.

H.264 stream compatibility checks. Not all H.264 streams work with Apple's hardware decoder. We parse the AVC decoder configuration record from GStreamer's caps to check the NAL length size field; a value of 3 (invalid per the spec, but produced by some encoders) forces a software decoder fallback. We also check whether vtdec_hw can actually sink the discovered caps, catching malformed streams before they crash the hardware decoder.

Step 2: Choosing the Right Demuxer

GStreamer's decodebin can automatically select a demuxer and decoder for any format. In theory. In practice, automatic selection causes real problems.

func buildDemuxer(for container: String) -> String {
    switch container {
    case "matroska": return "matroskademux"
    case "mpeg-ps":  return "mpegpsdemux"
    case "mpeg-ts":  return "tsdemux"
    case "mp4":      return "qtdemux"
    case "avi":      return "avidemux"
    default:         return "decodebin"  // Last resort
    }
}

Using explicit demuxers matters for two reasons. First, it gives us named output pads (demux.video_0, demux.audio_1) that let the user select specific tracks. Second, it avoids decodebin's auto-plugging, which can select the wrong decoder for certain audio codecs. More on that in a moment.

MPEG transport streams (tsdemux) and MPEG program streams (mpegpsdemux) are special cases. Their output pads use PID-derived or hex stream-ID names instead of sequential video_0/audio_0 indices. For these containers, we cannot use indexed pad names; instead, we route by media type using capsfilter:

demux. ! queue ! capsfilter caps="video/x-h264" ! ...
demux. ! queue ! capsfilter caps="audio/x-ac3;audio/x-eac3;audio/x-dts" ! ...

Step 3: Video Decoder Selection

The decoder chain is codec-specific. For H.264 and HEVC, we strongly prefer Apple's hardware decoder (vtdec_hw), which operates on the GPU and outputs IOSurface-backed frames:

func buildVideoDecoderPipeline(for codec: String, ...) -> String {
    switch codec.lowercased() {
    case "h264":
        return "h264parse name=video_input_parse "
             + "! queue name=video_postparse_queue "
             + "! vtdec_hw name=video_decoder"

    case "hevc", "h265":
        return "h265parse name=video_input_parse "
             + "! queue name=video_postparse_queue "
             + "! vtdec_hw name=video_decoder"

    case "av1":
        if vtdecHwSupportsCodec("video/x-av1") {
            return "av1parse ! vtdec_hw"    // M3+ Apple Silicon
        } else {
            return "dav1ddec n-threads=4"   // Software fallback
        }

    case "vp9":  return "vp9dec"
    case "mpeg2": return "mpegvideoparse ! avdec_mpeg2video"
    case "mpeg4": return "mpeg4videoparse ! avdec_mpeg4"
    // ...
    }
}

AV1 hardware detection deserves a closer look. Apple added hardware AV1 decode starting with M3 chips, but there is no public API to query this. We use GStreamer's factory capability inspection:

func vtdecHwSupportsCodec(_ capsString: String) -> Bool {
    guard let factory = gst_element_factory_find("vtdec_hw") else { return false }
    defer { gst_object_unref(UnsafeMutableRawPointer(factory)) }
    guard let queryCaps = gst_caps_from_string(capsString) else { return false }
    defer { gst_caps_unref(queryCaps) }
    return gst_element_factory_can_sink_any_caps(factory, queryCaps) != 0
}

On an M3 or later Mac, vtdec_hw's sink pad capabilities include video/x-av1, so this returns true. On M1 and M2, it does not, and we fall back to dav1ddec (the dav1d software decoder). One wrinkle with dav1d: its default thread count creates one thread per logical core, which on an M1 Ultra means 20 threads. That degree of parallelism thrashes CPU caches and starves NWConnection's internal networking threads, dropping WebSocket throughput from ~11 Mbps to ~1.5 Mbps. Limiting dav1d to 4 threads solves this.

Step 4: The HEVC Performance Cliff

This one cost me days. When the source video uses a software decoder (VP9, AV1 on M1/M2, MPEG-4, MPEG-2), the decoded frames live in CPU memory. Encoding those CPU-resident frames to HEVC via vtenc_h265 achieves roughly 12 frames per second. The same frames encoded to H.264 via vtenc_h264 achieve approximately 162 fps.

The difference is a 13x performance gap, and it exists because HEVC encoding requires more compute per frame. When frames originate from vtdec_hw, they arrive as IOSurface objects already on the GPU; the encoder can process them without a CPU-to-GPU copy on every frame. Software-decoded frames do not get that benefit.

The fix is simple: if the selected decoder is not vtdec_hw, override the codec preference to H.264.

let usesHardwareDecoder = videoDecoder.contains("vtdec_hw")

if codecPreference == .hevc && !usesHardwareDecoder {
    logger.info(
        "Overriding HEVC -> H.264: decoder outputs CPU memory. "
        + "HEVC encoding from CPU memory is ~13x slower than H.264."
    )
    codecPreference = .h264
}

This check is intentionally coupled to the decoder selection logic. When buildVideoDecoderPipeline learns to use vtdec_hw for a new codec (say, AV1 on M3+), the override automatically stops triggering for that codec.

Step 5: Audio Decoder Chains (and Why decodebin Cannot Be Trusted)

This section exists because of a bug that took an unreasonable amount of time to find.

GStreamer's decodebin can auto-select an audio decoder. For most codecs, it works fine. For DTS audio, it does not. When decodebin auto-plugs a DTS decoder, the output is a constant DC signal at approximately 0.937 amplitude. Not silence; not noise; a flat, unwavering DC level that sounds like a loud hum. The decoded audio data has the right sample rate, the right channel count, and the right buffer timing. Everything looks correct except the actual audio samples, which are garbage.

The fix is explicit decoder chains for every audio codec:

switch codec.lowercased() {
case "aac":
    decoderChain = "aacparse ! avdec_aac"

case "dts", "dts-hd":
    decoderChain = "dcaparse ! avdec_dca"   // NEVER use decodebin for DTS

case "ac3":
    decoderChain = buildDolbyDecoderChain(for: "ac3", preferEac3: false)

case "truehd":
    decoderChain = "avdec_truehd"

case "mp3":
    decoderChain = "mpegaudioparse ! avdec_mp3"

case "opus":
    decoderChain = "opusparse ! opusdec"

case "flac":
    decoderChain = "flacparse ! flacdec"
// ...
}

Dolby audio (AC-3 and E-AC-3) has its own complications. Some Matroska files signal AC-3 in the stream caps but E-AC-3 in the container tags. The two codecs use different decoders; selecting the wrong one causes mid-stream decoder failure when the demuxer renegotiates caps. We handle this by checking both the caps and the tags during detection, and by ranking GStreamer's decoder factories appropriately:

func preferLibavDolbyDecodersIfAvailable(preferEac3: Bool) {
    let preferredEac3Rank: UInt32 = preferEac3 ? 1024 : 896
    let preferredAc3Rank: UInt32  = preferEac3 ? 896 : 1024
    setDecoderRank("avdec_eac3", rank: preferredEac3Rank)
    setDecoderRank("avdec_ac3",  rank: preferredAc3Rank)
    setDecoderRank("a52dec",     rank: 0)  // Demote; causes issues with DD+ hybrid
}

Step 6: The Audio Processing Chain

Once audio is decoded, it needs processing before it reaches the client. The pipeline segment looks like this:

decoder ! audioconvert ! audiowsinclimit ! audioresample ! volume ! caps ! appsink
            │                  │                             │
            │                  │                             └─ Channel-aware gain
            │                  └─ DC blocker (20 Hz FIR high-pass)
            └─ Downmix multichannel to stereo

Downmix gain compensation. When audioconvert downmixes 5.1 or 7.1 audio to stereo, it normalizes the mix matrix coefficients to prevent clipping. This makes the output significantly quieter than a native stereo track: about -8 dB for 5.1 and -9 dB for 7.1. We apply channel-aware gain to compensate:

static func stereoDownmixGain(sourceChannels: Int) -> Double {
    switch sourceChannels {
    case ...2:  return 1.0   // No downmix needed
    case 3...4: return 1.8   // ~+5 dB
    case 5...6: return 2.5   // ~+8 dB (compensates 5.1 normalization)
    default:    return 3.0   // ~+9.5 dB (compensates 7.1 normalization)
    }
}

DC offset removal. Downmixing introduces a DC offset that varies between buffers, producing audible clicks and pops. This needs a high-pass filter. The critical detail: use a FIR filter, not an IIR filter. GStreamer provides audiocheblimit (Chebyshev IIR); do not use it. IIR filters with feedback can accumulate numerical errors that grow without bound. In testing, the IIR filter worked for a few minutes and then the output diverged to infinity. The audiowsinclimit element is a windowed-sinc FIR filter; it is mathematically incapable of diverging.

audiowsinclimit mode=high-pass cutoff=20 length=501

The 20 Hz cutoff is below the range of human hearing. The 501-tap length gives a steep transition band without excessive latency.

Step 7: Video Processing Order

A subtle optimization: videoscale runs before videoconvert in the pipeline. Color format conversion is a per-pixel operation. If you convert color space on a 4K frame and then scale it to 1080p, you have processed 4x more pixels than necessary. Scaling first and converting second means videoconvert operates on the smaller frame.

decoder ! videoscale ! video/x-raw,width=1920,height=1080
        ! videoconvert ! video/x-raw,format=NV12,colorimetry=bt709
        ! vtenc_h264

When no scaling is needed and the decoder is vtdec_hw, we skip both videoscale and videoconvert entirely. Those are software elements that force a GPU-to-CPU-to-GPU round trip on every frame. Omitting them allows decoded IOSurface frames to flow directly into the hardware encoder without a copy.

Step 8: Assembling the Full Pipeline

With all the pieces selected, the pipeline is assembled as a string and parsed:

let pipelineString = """
    filesrc location="\(filePath)" ! \(demuxer) name=demux \
    demux.video_\(videoTrackIndex) ! \(videoDemuxQueue) ! \(videoDecoder) \
    ! \(videoPostDecodeQueue) \(videoProcessing) \
    ! \(videoEncoder) ! \(videoPostEncodeQueue) ! \(videoParser) \
    ! \(videoCaps) ! appsink name=videosink emit-signals=false sync=false \
    demux.audio_\(audioTrackIndex) ! \(audioDemuxQueue) ! \(audioPipeline) \
    ! appsink name=audiosink emit-signals=false sync=false
    """

var error: UnsafeMutablePointer<GError>?
guard let pipeline = gst_parse_launch(pipelineString, &error) else {
    // handle error
}

A few things to note about the appsink configuration:

  • sync=false: Both appsinks pull data as fast as the pipeline produces it. We handle real-time pacing ourselves in the pull loops (Part 2 covers this). Using sync=true creates pipeline backpressure that starves the audio branch through the shared demuxer.
  • emit-signals=false: We poll the appsinks with gst_app_sink_try_pull_sample rather than using signal callbacks. This gives us control over the pull cadence.
  • max-buffers=2 (video) vs max-buffers=512 (audio): Video uses tight backpressure to limit memory. Audio needs a deeper buffer because during startup, the H.264 encoder may not emit its first frame until it has received SPS/PPS data, and audio samples arriving in the meantime need somewhere to go.

After parsing, we extract named elements from the pipeline for later use:

let videoSink = gst_bin_get_by_name(bin, "videosink")
let audioSink = gst_bin_get_by_name(bin, "audiosink")
let encoder   = gst_bin_get_by_name(bin, "videoencoder")

The encoder reference matters for adaptive bitrate, which we will cover in Part 5.

Step 9: Dynamic Bitrate from Swift

VideoToolbox's H.264 and HEVC encoders support mid-session bitrate changes. GStreamer exposes this via g_object_set, but that function is variadic, and Swift cannot call variadic C functions. The workaround is g_object_set_property with an explicit GValue:

func setEncoderBitrate(_ kbps: Int) {
    guard let encoder = videoEncoderElement else { return }

    // G_TYPE_UINT = G_TYPE_MAKE_FUNDAMENTAL(7) = (7 << 2)
    let gTypeUInt: GType = 7 << 2

    var value = GValue()
    g_value_init(&value, gTypeUInt)
    g_value_set_uint(&value, guint(kbps))
    g_object_set_property(
        UnsafeMutablePointer<GObject>(OpaquePointer(encoder)),
        "bitrate",
        &value
    )
    g_value_unset(&value)
}

The GType calculation (7 << 2) is the macro expansion of G_TYPE_UINT. GLib's type system uses the lower bits of the type ID as a fundamental type shift, and G_TYPE_MAKE_FUNDAMENTAL(7) is 7 << G_TYPE_FUNDAMENTAL_SHIFT where the shift is 2. This is the kind of thing you discover when the compiler tells you it cannot import a C macro.

Step 10: Starting the Pipeline

The pipeline goes through a specific startup sequence:

PAUSED ──(wait for preroll)──▶ PLAYING ──▶ pull loops start

Setting the pipeline to PAUSED first is important. In the paused state, GStreamer negotiates all the element caps, allocates buffers, and performs preroll (each sink receives at least one buffer). This is where you find out if an element is missing or incompatible.

gst_element_set_state(pipeline, GST_STATE_PAUSED)

var currentState: GstState = GST_STATE_VOID_PENDING
let waitResult = gst_element_get_state(pipeline, &currentState, nil, 10 * GST_SECOND_NS)

if waitResult == GST_STATE_CHANGE_FAILURE {
    // Pipeline is broken; an element is missing or caps cannot negotiate
}

if waitResult == GST_STATE_CHANGE_ASYNC
   && currentState.rawValue <= GST_STATE_READY.rawValue {
    // Pipeline stuck in READY; likely a missing plugin
}

Once preroll succeeds, we perform an initial seek (if the user is resuming from a saved position), set the pipeline to PLAYING, and spawn the pull loop tasks. The pull loops are the subject of Part 2.

What We Have So Far

At this point, we have a running GStreamer pipeline tailored to the source media. The demuxer, decoder, processing elements, and encoder have all been selected based on the file's actual format rather than guesswork. The pipeline is producing encoded video frames and processed audio samples, and two appsinks are ready for the pull loops to consume them.

In the next post, we will look at how those pull loops work: how they pace output in real time, how they handle startup buffering, and why the entire thing is structured around a state-struct pattern that keeps 1,000-line async loops manageable.