The Full Teardown Pattern: Simplifying Pause, Seek, Resume, and End-of-Stream - Part 6 of 6
Part 6 of 6: Audio/Video Streaming with Swift and GStreamer
Over the course of this series, we have built a transcoding pipeline (Part 1), paced its output with pull loops (Part 2), solved the audio processing chain (Part 3), synchronized audio and video across PTS domains (Part 4), and adapted bitrate to network conditions (Part 5). This final post covers the architectural decision that simplified all of it: the full teardown pattern for pause, seek, and resume.
The Problem with Keeping Pipelines Alive
The conventional approach to pause and seek in a media pipeline is to keep it running. Pause means setting the pipeline to GST_STATE_PAUSED. Seek means calling gst_element_seek_simple and waiting for the pipeline to flush and restart at the new position. Resume means setting the pipeline back to GST_STATE_PLAYING. This preserves decoder state, encoder state, and internal buffers.
It also creates an enormous surface area for bugs.
ShowShark's first implementation followed this conventional approach. Over several months, we accumulated a list of issues:
- Pausing and resuming sometimes produced a 0.5-second audio pop
- Seeking near the end of a file occasionally left the pipeline in a wedged state
- Seeking backward in certain HEVC streams caused persistent A/V desynchronization that did not recover
- Resuming after a long pause (minutes) sometimes caused the encoder to emit frames with incorrect timing
- Seeking during the startup buffer phase could race with the pull loops, causing duplicate frames
gst_element_set_state(GST_STATE_NULL)occasionally blocked indefinitely due to VideoToolbox encoder cleanup- The pause/resume code path and the seek code path were separate, with different state management, different cleanup logic, and different race conditions
- The reconnection-after-network-loss code path was a third variant
Each of these bugs required its own fix, its own state flag, its own edge case handling. The pause/resume and seek implementation grew to approximately 1,700 lines of code, much of it devoted to managing transient states between "paused" and "playing" or between "seeking" and "done seeking."
The Insight
The fundamental insight was this: starting a fresh playback session from a given position is a solved problem. We do it every time the user presses play. If pause-then-resume is just "stop, then play from the saved position," and seek is just "stop, then play from the new position," then every playback lifecycle event collapses into a single code path.
Before (three code paths):
┌─────────────────────┐ ┌──────────────────┐ ┌───────────────────┐
│ Pause/Resume │ │ Seek │ │ Reconnect │
│ - GST_STATE_PAUSED │ │ - gst_seek() │ │ - detect loss │
│ - preserve decoder │ │ - flush pipeline │ │ - save position │
│ - preserve encoder │ │ - wait for flush │ │ - reconnect │
│ - handle audio pop │ │ - re-sync A/V │ │ - restart session │
│ - handle wedge │ │ - handle race │ │ - resume from pos │
│ ~600 lines │ │ ~550 lines │ │ ~500 lines │
└─────────────────────┘ └──────────────────┘ └───────────────────┘
After (one code path):
┌──────────────────────────────────────────────────────────┐
│ startPlayback(from: position) │
│ - send StartPlaybackRequest to server │
│ - server creates fresh pipeline at position │
│ - client initializes decoders from StreamInitialization │
│ - startup buffering → playing │
│ ~170 lines (plus shared teardown) │
└──────────────────────────────────────────────────────────┘
The net result: approximately 1,700 lines of complex state management removed, approximately 170 lines of straightforward start-from-position logic added. Every bug in the original pause, seek, and reconnection code paths was eliminated, because the code that contained them no longer exists.
The Teardown Sequence
When the user pauses, seeks, or loses the network connection, the client executes the same teardown sequence:
func teardownPlaybackState() {
// 1. Immediately reject all in-flight packets
currentSessionID = nil
// 2. Stop consumer loops
videoConsumerTask?.cancel()
audioConsumerTask?.cancel()
// 3. Stop display link and audio position tracking
stopDisplayLink()
stopAudioPositionTracking()
// 4. Gracefully shut down audio (prevents pop)
audioDecoder?.pause()
audioDecoder?.shutdown()
// 5. Shut down video decoder
videoDecoder?.shutdown()
// 6. Finish AsyncStream continuations
videoStreamContinuation?.finish()
audioStreamContinuation?.finish()
// 7. Clear all buffer state
videoFrameQueue.removeAll()
videoBufferStartPTS = nil
videoBufferEndPTS = nil
audioBufferStartPTS = nil
audioBufferEndPTS = nil
videoPTSOffset = nil
// 8. Reset EOS flags
eosReceived = false
videoConsumerDone = false
audioConsumerDone = false
}
Step 1 is the most important. Setting currentSessionID to nil immediately causes all incoming packets to be rejected by the ingress filter:
func shouldAcceptIngressPacket(sessionID: String) -> Bool {
guard let currentID = currentSessionID else { return false }
return sessionID == currentID
}
This eliminates an entire class of race conditions. When the user seeks, the server may still be sending frames from the old position. Those frames are in flight on the network and will arrive over the next few hundred milliseconds. Without immediate session ID invalidation, those stale frames would be decoded and displayed, causing a brief flash of the old position before the new frames arrive. With session ID invalidation, they are silently dropped before touching any decoder.
Step 4 is subtle. Calling audioDecoder.pause() before shutdown() drains the audio player node gracefully. Shutting down the audio engine while buffers are still playing produces an audible click. Pausing first lets the current audio buffer finish rendering, then shutdown clears the engine state cleanly.
Pause and Resume
Pause saves the current position, tears down, and sends a stop request to the server:
func pause() {
pausedPosition = currentPosition
playbackState = .paused
teardownPlaybackState()
sendStopPlaybackRequest()
}
Resume starts a new session from the saved position:
func resume() {
startPlayback(from: pausedPosition)
}
The display layer (AVSampleBufferDisplayLayer) is not torn down during pause. The last displayed frame remains visible as a frozen image. This is important for the user experience; tearing down the display layer would show a black screen during pause, which feels broken. The layer survives across sessions and is reused when the new session begins emitting frames.
Seek
Seek is nearly identical to pause-then-resume, with an additional reentrancy guard:
func seekToPosition(_ targetPosition: Double) {
guard !isSeekInProgress else { return }
isSeekInProgress = true
teardownPlaybackState()
sendStopPlaybackRequest()
resetIngressBuffer()
startPlayback(from: targetPosition)
isSeekInProgress = false
}
The isSeekInProgress flag prevents duplicate seeks during the async operations inside startPlayback. Without it, a user scrubbing the timeline quickly could trigger overlapping seek operations that race to create sessions.
resetIngressBuffer creates a fresh StreamIngressMailbox, discarding any packets from the old session that might still be queued. This is belt-and-suspenders alongside the session ID check; it ensures the mailbox is empty when the new session starts.
Scrubbing Optimization
When the user scrubs the timeline (dragging the playback position rapidly), we do not want to start a full new session for every intermediate position. The client distinguishes between "freeze for scrubbing" and "seek to final position":
func freezeForScrubbing() {
// Lightweight: stop display link and consumer loops, pause audio
// Do NOT tear down decoders or stop server session
}
func unfreezeFromScrubbing() {
// If the seek distance is small, just resume without restarting
}
When the user lifts their finger from the scrubber, a single seek to the final position triggers the full teardown-and-restart. This keeps scrubbing responsive without generating dozens of server sessions.
Reconnection
Network disconnection follows the same pattern:
func handleConnectionLost() {
pausedPosition = currentPosition
playbackState = .reconnecting
teardownPlaybackState()
}
func handleReconnected() {
startPlayback(from: pausedPosition)
}
The client monitors the WebSocket connection state. On disconnection, it saves the position and tears down. When the connection is reestablished and authentication succeeds, it starts a fresh session from the saved position. The user sees a brief buffering indicator and then playback continues from where it left off.
This approach handles an edge case that the original implementation struggled with: if the network drops during a seek, the old implementation had to track whether the seek had completed on the server, whether the pipeline was in a seeking or paused state, and whether the decoder had been reinitialized. The teardown pattern does not care; whatever state the system was in, it tears everything down and starts fresh.
End-of-Stream Buffer Draining
End-of-stream (EOS) handling is the one area where "tear down and restart" does not apply, because there is no restart. The stream is finished, and the remaining buffered content needs to play out completely.
The naive approach is to transition to the .completed state when the server sends StreamEndOfStream. But the client has 1-3 seconds of buffered video frames and scheduled audio buffers at that point. Completing immediately would cut off the end of the movie.
Server-Side EOS
The server waits for both the video and audio pull loops to finish before sending StreamEndOfStream. Video typically finishes first (it hits the end of the stream and gst_app_sink_try_pull_sample returns nil). Audio may continue for another second or two. The server gives audio a 2-second timeout after video ends:
Video pull loop ends ─────┐
│
├─ Wait up to 2 seconds
│
Audio pull loop ends ─────┤
│
▼
Send StreamEndOfStream
Client-Side EOS Drain
When the client receives StreamEndOfStream, it does not immediately complete playback. Instead, it sets a flag and finishes the AsyncStream continuations:
func didReceiveStreamEnd(sessionID: String, ...) {
guard sessionID == currentSessionID else { return }
eosReceived = true
videoStreamContinuation?.finish()
audioStreamContinuation?.finish()
}
Finishing the continuations causes the consumer loops to drain naturally. The video consumer loop iterates for await packet in videoStream; when the stream finishes, the loop exits and sets videoConsumerDone = true. The audio consumer does the same.
The display link callback monitors three conditions:
// In displayLinkCallback(), after normal sync logic:
if eosReceived && videoConsumerDone && videoFrameQueue.isEmpty {
completeAfterBufferDrain()
}
Only when all three are true does playback actually complete. This means every decoded frame is displayed and every scheduled audio buffer finishes rendering.
Audio-Only EOS
For audio-only content (music streaming), there is no display link. The audio position tracking timer (which fires 4 times per second) handles EOS instead:
// In audio position tracking timer:
if eosReceived && audioConsumerDone {
if !audioDecoder.hasScheduledBuffersRemaining {
completeAfterBufferDrain()
}
}
The hasScheduledBuffersRemaining check is important. The audio consumer loop finishes when it has submitted all buffers to the player node. But the player node may still be rendering the last few buffers. Completing before they finish would clip the last second of a song. The check polls the player node's buffer completion count and waits until all submitted buffers have been rendered.
State Reset
All EOS-related flags (eosReceived, videoConsumerDone, audioConsumerDone) are reset in three places:
startPlayback()— beginning of a new sessionstopPlayback()— explicit stopteardownPlaybackState()— pause, seek, disconnect
This ensures clean state for the next playback session regardless of how the previous one ended.
The Cost
The full teardown pattern has one genuine drawback: seek latency. Starting a new server session, negotiating codecs, waiting for the first keyframe, and buffering enough data to begin playback takes 1-2 seconds. An in-pipeline seek can be nearly instantaneous for formats with dense keyframes.
For ShowShark, this tradeoff is acceptable. The 1-2 second seek time is consistent and predictable. The alternative was an in-pipeline seek that was fast when it worked but produced desynchronized audio, wedged pipelines, or corrupted output when it did not. Predictable and correct beats fast and fragile.
The tradeoff also simplifies the ABR controller (Part 5). Since each seek creates a new session, the controller starts fresh with a 2 Mbps baseline. There is no need to carry over bitrate state from a session that was playing different content at a different position.
The Broader Principle
The full teardown pattern is an application of a broader principle: when the cost of reconstructing state is low enough, do not try to preserve it through transitions. GStreamer pipelines are expensive to create (10-50ms depending on the codec), but that cost is dwarfed by the network round trip and startup buffering that dominate seek latency. Since we are already paying for a 1-2 second delay, the incremental cost of creating a fresh pipeline is negligible.
The pattern also reflects a property of the system's architecture. Because the server's startPlayback handler already handles arbitrary start positions, codec negotiation, and stream initialization, there is no new code needed to support pause, resume, seek, or reconnection. Each of those operations reduces to "stop; start from position X."
Series Conclusion
Over six posts, we have covered:
- Pipeline construction — detecting source media, selecting codecs, and assembling GStreamer elements into a transcoding graph
- Pull-based streaming — extracting frames with software pacing, startup burst buffering, and the state-struct pattern
- Audio processing — downmix compensation, DC offset removal, codec-specific decoder selection, and the DTS corruption story
- A/V synchronization — segment stream time alignment for Blu-ray, startup audio gating, and client-side sync with audio as master clock
- Adaptive bitrate — why client throughput is meaningless in a 1x-paced system, and how server-side send timing provides the real congestion signal
- Full teardown — collapsing pause, seek, resume, and reconnection into a single code path by destroying and recreating state rather than preserving it
Each of these topics emerged from production problems. The DTS decoder that outputs DC. The IIR filter that diverges. The BDMV PTS epoch that is off by two seconds. The ABR controller that oscillates. None of these problems are described in textbooks, and most of them are invisible until you encounter specific media files on specific hardware.
ShowShark streams video to iPhones, iPads, Apple TVs, Macs, and Vision Pro. The architecture described in this series handles everything from a 480p AVI with MPEG-4 Part 2 video and MP3 audio to a 4K HDR Blu-ray remux with DTS-HD Master Audio. The same pipeline construction, the same pull loops, the same sync controller. The complexity lives in the details; the architecture stays simple.