Audio in Streaming: Downmixing, DC Offset, Gain Compensation, and Codec Pitfalls - Part 3 of 6
Part 3 of 6: Audio/Video Streaming with Swift and GStreamer
Audio is the part of a streaming system that people notice only when it goes wrong. A dropped video frame is invisible; a dropped audio sample is a click. A DC offset is inaudible at first, then becomes a loud pop every time the offset shifts. A badly chosen decoder produces a flat hum that sounds like sixty-cycle interference but is actually the decoder outputting a constant value for every sample.
This post covers the audio side of ShowShark's pipeline: how multichannel audio gets downmixed and compensated on the server, how it crosses the network, and how the client decodes and renders it through AVAudioEngine. Most of these topics were born from bugs.
The Server-Side Audio Chain
In Part 1, we showed the GStreamer audio pipeline segment. Here it is again with annotations:
┌─────────────────────────────────────────────────────────────────────┐
│ GStreamer Audio Pipeline │
│ │
│ Source Audio │
│ (DTS 5.1, AC-3, AAC, etc.) │
│ │ │
│ ▼ │
│ ┌──────────┐ Explicit decoder, NOT decodebin │
│ │ Decoder │ e.g., dcaparse ! avdec_dca │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ Downmixes 5.1/7.1 to stereo │
│ │ audioconvert │ Normalizes mix matrix → volume loss │
│ └────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌────────────────────────────────────────────┐ │
│ │ audiowsinclimit mode=high-pass cutoff=20 │ FIR DC blocker │
│ │ length=501 │ (NOT IIR) │
│ └────┬───────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────┐ │
│ │ audioresample │ Resamples to 48 kHz │
│ └────┬──────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────┐ Channel-aware gain compensation │
│ │ volume volume=G │ G = f(source channel count) │
│ └────┬────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ F32LE interleaved stereo @ 48 kHz │
│ │ appsink │ (or AAC ADTS for watchOS/HLS) │
│ └──────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Every element in this chain exists because of a specific problem we encountered. Let's walk through them.
Downmix Gain Compensation
When GStreamer's audioconvert downmixes 5.1 surround to stereo, it applies a normalization matrix to prevent clipping. The standard downmix equation for 5.1 is roughly:
L_out = L + 0.707 * C + 0.707 * Ls
R_out = R + 0.707 * C + 0.707 * Rs
The sum of coefficients on each output channel exceeds 1.0, so audioconvert divides by the coefficient sum to keep the output within [-1.0, 1.0]. This prevents clipping but reduces the output level by about 8 dB for 5.1 and 9 dB for 7.1 content. The result: a movie with a 5.1 DTS soundtrack plays noticeably quieter than one with a native stereo track.
We compensate with a channel-aware gain stage:
static func stereoDownmixGain(sourceChannels: Int) -> Double {
switch sourceChannels {
case ...2: return 1.0 // No downmix, no boost
case 3...4: return 1.8 // ~+5 dB
case 5...6: return 2.5 // ~+8 dB
default: return 3.0 // ~+9.5 dB
}
}
These values were tuned by measuring the RMS level of the same audio track in stereo and 5.1 mixes. The 2.5x gain for 5.1 restores the output to approximately the same perceptual loudness as a native stereo track. It is not perfect for every mix (dialogue-heavy content may differ from action sequences), but it eliminates the most common complaint: "why is this movie so quiet?"
An earlier version used a fixed 1.35x gain for all multichannel content. That was not enough for 7.1 and too much for 3.0 surround variants.
The DC Offset Problem
Downmixing introduces a DC offset. This is not a theoretical concern; it manifests as audible artifacts. The DC level shifts between GStreamer buffers, and each shift produces a click or pop at the buffer boundary.
The root cause is numerical: audioconvert performs floating-point matrix multiplication on each buffer independently. Rounding errors accumulate slightly differently depending on the input values, and the resulting DC component varies from buffer to buffer. The shifts are small (a few millivolts equivalent), but speakers and headphones reproduce them faithfully as transient clicks.
The fix is a high-pass filter that removes everything below 20 Hz:
audiowsinclimit mode=high-pass cutoff=20 length=501
The choice of filter type matters more than it might seem. GStreamer provides two high-pass filter elements:
| Element | Type | Stability |
|---|---|---|
audiocheblimit |
IIR (Chebyshev) | Feedback loop can diverge |
audiowsinclimit |
FIR (windowed sinc) | Mathematically stable |
We originally used audiocheblimit. It worked for about ten minutes of playback and then the output diverged to infinity. IIR filters use feedback (the output of previous samples feeds back into the filter equation), and numerical errors in that feedback path can grow without bound. The divergence was not gradual; the audio went from fine to unusable in under a second.
audiowsinclimit is a finite impulse response filter. Its output depends only on the current and previous input samples; there is no feedback path. It cannot diverge. The 501-tap length gives a steep enough transition band to remove DC without audibly affecting bass content. The filter adds about 5ms of latency, which is negligible for streaming.
Client-Side Audio Playback
The server sends audio as either interleaved Float32 PCM (the default) or AAC in ADTS framing (for watchOS/HLS). The client's AudioDecoder handles both.
PCM Playback
AVAudioEngine's AVAudioPlayerNode expects non-interleaved audio (each channel in its own buffer), but the server sends interleaved data (samples alternate between channels: L0, R0, L1, R1, ...). The decoder de-interleaves on the fly:
for frame in 0..<frameCount {
for channel in 0..<channelCount {
let sourceIndex = frame * channelCount + channel
let sample = sourcePointer[sourceIndex] * attenuation
channelBuffers[channel][frame] = max(-1.0, min(1.0, sample))
}
}
The attenuation factor comes from a peak guard pass that scans the incoming buffer first:
var bufferPeakAbsSample: Float = 0
for i in 0..<sampleCount {
let absValue = abs(sourcePointer[i])
if absValue > bufferPeakAbsSample {
bufferPeakAbsSample = absValue
}
}
let attenuation: Float
if bufferPeakAbsSample > 1.0 {
attenuation = min(1.0, 0.98 / bufferPeakAbsSample)
} else {
attenuation = 1.0
}
The server's gain compensation can occasionally push samples above 1.0 on particularly loud passages. The peak guard attenuates the entire buffer to keep the loudest sample at 0.98, preserving the relative dynamics while preventing clipping. A final hard clamp to [-1.0, 1.0] catches any remaining edge cases.
AAC Decoding
For AAC content, the server wraps encoded audio in ADTS (Audio Data Transport Stream) framing. The client parses ADTS headers to extract individual AAC frames:
func parseADTSFrames(from data: Data) -> [(data: Data, sampleRate: Int, channels: Int)] {
var frames: [(Data, Int, Int)] = []
var offset = 0
while offset + 7 <= data.count {
// Validate sync word (0xFFF)
guard data[offset] == 0xFF,
(data[offset + 1] & 0xF0) == 0xF0 else { break }
// Extract frame length from ADTS header (13 bits across bytes 3-5)
let frameLength = (Int(data[offset + 3] & 0x03) << 11)
| (Int(data[offset + 4]) << 3)
| (Int(data[offset + 5] >> 5))
guard frameLength >= 7, offset + frameLength <= data.count else { break }
// Extract sample rate index and channel configuration
let samplingIndex = Int((data[offset + 2] >> 2) & 0x0F)
let channelConfig = Int((data[offset + 2] & 0x01) << 2)
| Int((data[offset + 3] >> 6) & 0x03)
frames.append((data[offset..<offset + frameLength], sampleRate, channels))
offset += frameLength
}
return frames
}
Each parsed frame is wrapped in an AVAudioCompressedBuffer with packet descriptions and fed to AVAudioConverter for decoding. The converter outputs PCM buffers that follow the same scheduling path as direct PCM input.
Buffer Scheduling and the Audio Clock
Buffers are scheduled on the AVAudioPlayerNode with specific timestamps computed from PTS values:
let sampleTime = AVAudioFramePosition(pts * sampleRate)
let audioTime = AVAudioTime(sampleTime: sampleTime, atRate: sampleRate)
playerNode.scheduleBuffer(buffer, at: audioTime)
The player node's internal clock then becomes the master clock for A/V synchronization. The currentPlaybackTime property combines the node's base time with its reported player time:
var currentPlaybackTime: Double {
guard let nodeTime = playerNode.lastRenderTime,
let playerTime = playerNode.playerTime(forNodeTime: nodeTime) else {
return 0
}
return baseTime + (Double(playerTime.sampleTime) / playerTime.sampleRate)
}
This value is what the sync controller (Part 4) queries to decide whether a video frame should be displayed, held, or dropped.
Underrun Detection
The decoder monitors gaps between consecutive buffers. If the end time of one buffer does not align with the start time of the next (within 1ms tolerance), it logs an underrun warning. Underruns indicate that the server is not sending audio fast enough, either because of network congestion or because the transcoding pipeline has stalled. The adaptive bitrate controller (Part 5) uses dropped frame counts from the sync controller, but audio underruns are logged separately for diagnostics.
The DTS Decoder Disaster
This story was mentioned in Part 1, but it deserves a fuller treatment because the debugging process illustrates a category of problem unique to media pipelines.
DTS (Digital Theater Systems) audio is common in Blu-ray and DVD rips. GStreamer supports it through several decoder elements: avdec_dca (libav), dtsdec (libdca), and a52dec (for AC-3, which DTS sometimes wraps). When decodebin is given a DTS stream, it auto-selects one of these decoders based on plugin rankings.
The decoder it selected produced output with the correct sample rate (48 kHz), the correct channel count (2, after downmix), and the correct buffer durations. gst_pad_get_current_caps() reported valid audio caps. The pipeline ran without errors. But every sample in every buffer was approximately 0.937. Not silence (0.0), not noise, not the original audio. A constant DC value.
Diagnosing this took longer than it should have because every standard check passed. The caps were correct. The buffer sizes matched the expected sample count. The pipeline reported no errors or warnings. We added logging at every stage of the audio chain and confirmed that the DC signal was present at the very first buffer out of the decoder, before audioconvert or any other processing touched it.
The fix was to bypass decodebin for DTS entirely and use an explicit decoder chain:
dcaparse ! avdec_dca
The dcaparse element handles DTS bitstream parsing (frame alignment, sync word detection), and avdec_dca is libav's DTS decoder, which has been reliable across every DTS variant we have tested: DTS, DTS-HD Master Audio, and DTS-HD High Resolution.
After this experience, we moved every audio codec to an explicit decoder chain. The full table:
AAC → aacparse ! avdec_aac
DTS/DTS-HD → dcaparse ! avdec_dca
AC-3 → ac3parse ! avdec_ac3 (with rank management)
E-AC-3 → ac3parse ! avdec_eac3 (with rank management)
TrueHD → avdec_truehd
MP3 → mpegaudioparse ! avdec_mp3
MP2 → mpegaudioparse ! avdec_mp2float
Opus → opusparse ! opusdec
Vorbis → vorbisparse ! vorbisdec
FLAC → flacparse ! flacdec
decodebin is only used as a last resort for unknown codecs. The explicit chains add a few lines of code but eliminate an entire class of silent failures.
Dolby Audio Complications
AC-3 (Dolby Digital) and E-AC-3 (Dolby Digital Plus) share enough of their bitstream format that GStreamer's cap detection sometimes gets confused. Two specific problems:
Problem 1: Caps/tag mismatch. Some Matroska files signal AC-3 in the stream caps but report E-AC-3 in the container's codec tag. If we select an AC-3-only decoder, it fails when the demuxer renegotiates caps to E-AC-3 mid-stream.
Problem 2: IEC 61937 alignment. Some Dolby streams in Matroska are exposed with IEC 61937 alignment in their AC-3 caps. This alignment is used for S/PDIF passthrough and signals that the bitstream may contain E-AC-3 content wrapped in an AC-3 compatible framing.
We handle both cases during detection:
// Problem 1: Check both caps and tags
if container == "matroska" && audioCodec == "ac3" {
if tagString.contains("e-ac-3") || tagString.contains("eac3") {
audioCodec = "eac3" // Override
}
}
// Problem 2: IEC 61937 signals possible E-AC-3
if mediaType == "audio/x-ac3"
&& capsString.contains("alignment=iec61937") {
return "eac3" // Treat as E-AC-3
}
We also manage GStreamer's decoder rankings so that decodebin (when it must be used for Dolby) prefers the libav decoders over a52dec, which has caused intermittent decode-block errors on DD+ hybrid streams.
Client-Side DC Blocking
The server's FIR filter handles DC from downmixing, but the client adds its own protection. AVAudioEngine's AVAudioUnitEQ is configured as a 20 Hz high-pass filter on the audio player node:
let eq = AVAudioUnitEQ(numberOfBands: 1)
eq.bands[0].filterType = .highPass
eq.bands[0].frequency = 20.0
eq.bands[0].bandwidth = 1.0
eq.bands[0].bypass = false
engine.attach(eq)
engine.connect(playerNode, to: eq, format: format)
engine.connect(eq, to: engine.mainMixerNode, format: format)
This is a belt-and-suspenders measure. If the server's filter is ever bypassed (audio-only sessions use a different pipeline path), the client's EQ catches any residual DC. It also handles the edge case where network-induced buffer gaps create discontinuities in the audio stream that the server's filter would not have seen.
Audio Packet Coalescing
In Part 2, we described audio packet coalescing: batching small GStreamer buffers into 200ms bundles to reduce packet overhead. Here are the numbers:
Without coalescing:
Buffer duration: ~23ms (1,024 frames @ 44.1 kHz)
Packets/second: ~43
Payload per send: ~8 KB
With coalescing (200ms target):
Bundle duration: ~200ms (8-9 buffers merged)
Packets/second: ~5
Payload per send: ~70 KB
The reduction from 43 to 5 packets per second eliminates significant overhead in protobuf serialization, WebSocket framing, and TCP segment processing. The coalescing logic maintains PTS continuity by flushing immediately on any PTS discontinuity greater than 3ms:
if let pendingEndPTS = state.pendingAudioEndPTS {
let gap = abs(sampleStartPTS - pendingEndPTS)
if gap > 0.003 { // 3ms threshold
flushPendingAudioSample(state: &state, reason: "pts_discontinuity")
}
}
This prevents merging audio from different timelines (which would happen after a seek or stream reset) into a single packet.
What Comes Next
We now have a server that produces properly decoded, gain-compensated, DC-filtered audio alongside encoded video, and a client that decodes both and schedules audio on AVAudioEngine with precise timestamps. The remaining question: how does the client decide when to display each video frame relative to audio playback? That is the synchronization problem, and it gets considerably more interesting when the audio and video timestamps come from different PTS domains. Part 4 covers it.