research·May 9, 2026·12 min read

Audio Watermarking: Echo Hiding, Cepstral Detection, and What Audio Seal Changed

Working echo-hiding watermark in your browser with cepstral analysis. The math, the perceptual psychology, and the academic landscape — from Bender 1996 to Meta's Audio Seal (2024). Plus a live round-trip that produces a .wav.

The third article in our watermarking trilogy (after image and text) — and the most fun one to implement, because audio gives you cepstral analysis, which is one of the most elegant pieces of signal-processing math you'll meet.

We just shipped a working echo-hiding watermark at /check/audio. Drop any audio file. Click Embed + download watermarked .wav. Re-upload the .wav. Watch the cepstral peak materialize at 1 ms quefrency, and the z-score jump from ≈ 0 to well over 4. The code is at lib/watermark-audio.ts — about 280 lines, including the WAV encoder.

The perceptual physics

Echo hiding exploits a quirk of human hearing called the precedence effect (or Haas effect, 1949). When two sounds arrive at your ears within ~5 ms of each other, your auditory system fuses them into a single perception and uses only the first arrival to localize the source. The trailing copy is heard as part of the "tonal character" of the sound, not as a separate echo.

This means you can add a tiny delayed copy of an audio signal to itself with delay τ in the 0.5–2 ms range and amplitude α around -40 to -50 dB, and the result is essentially indistinguishable from the original to a listener. Bender, Gruhl, Morimoto, and Lu turned this into a watermarking scheme in 1996.

The scheme in one equation

The embedding is dead simple:

y[n] = x[n] + α · x[n − τ]

Where:

x[n] is the original audio sample at index n
τ is the echo delay in samples (e.g., 44 samples for 1 ms at 44.1 kHz)
α is the echo amplitude (we use 0.005, about -46 dB)

If you want to encode bits, you vary τ: bit 0 = 0.5 ms delay, bit 1 = 1.0 ms delay (for example). The receiver detects which delay is present and reads the bit.

Our demo uses a single bit (the presence of a 1 ms echo) across the entire audio. That's enough for the simplest "watermarked / not" check. Multi-bit schemes segment the audio into frames and encode one bit per frame.

Detection — cepstral analysis

If you've never seen the cepstrum before, this is the moment to pay attention. It's one of the most beautiful inventions in signal processing.

A real echo at delay τ in the time domain produces a multiplicative ripple in the magnitude spectrum:

|Y(f)|² ≈ |X(f)|² · |1 + α · e^(-2πifτ)|²

Take the log:

log|Y(f)|² ≈ log|X(f)|² + log|1 + α · e^(-2πifτ)|²

The echo becomes an additive sinusoidal ripple in the log-magnitude spectrum, at frequency τ (in Hz units of the spectrum). To detect that ripple, take the FFT of the log-magnitude spectrum. The result is called the cepstrum (an anagram of "spectrum" — Bogert, Healy, Tukey coined it in 1963 explicitly because the new domain "behaves like a spectrum of a spectrum"). The horizontal axis of the cepstrum is "quefrency," not frequency, measured in time units.

In the cepstrum, an echo at delay τ shows up as a peak at quefrency = τ.

The full detector:

1. Window x[n] with a Hann window
2. Compute |FFT(x · w)|
3. Take log: log|FFT(x · w)|
4. FFT again (or IFFT — symmetric output anyway)
5. Look for peaks in the real part
6. A peak at quefrency τ = an echo at delay τ

Average across many windows of the audio and the peak rises out of the noise floor like a flare.

This works because the cepstrum has a critical property: it separates source (vocal fold vibration, instrument resonance) from system (the room, the recording chain, the watermark echo). The watermark — being a fixed delay added to all of the audio — lives in the system domain. The original audio content lives in the source domain. They occupy different regions of the cepstrum, so the watermark peak doesn't fight with the audio's natural cepstrum.

In our implementation:

const cepstrum = realCepstrum(samples);
const peakQ = findPeakNear(cepstrum, expectedQuefrency);
const zScore = (peakValue - baselineMean) / baselineStdDev;

For a typical 5-second audio clip, the baseline cepstrum has stddev ≈ 0.001. An embedded -46 dB echo produces a peak of ≈ 0.02–0.05 at the right quefrency — a 20-50σ signal. The detection is, statistically speaking, ridiculously strong.

What about the audio quality?

This is the audiophile question. The honest answer:

For most music, α = 0.005 (-46 dB) with τ = 1 ms is completely inaudible. Professional listeners on critical-quality monitors fail to detect the watermark in blind A/B tests at that strength.
For very clean, sparse audio (solo piano, isolated voice), trained ears can sometimes hear a slight "ambience" change at α > 0.01.
For dense music (rock, electronic), even α = 0.02 is inaudible.

The tradeoff frontier:

α small (≤ 0.005): inaudible, weaker detection signal, faster to break with re-compression
α large (≥ 0.02): may be audible to some, much stronger detection, survives more
τ small (< 0.5 ms): more inaudible, mixed with first reflection of natural rooms
τ larger (> 5 ms): perceptible as a separate echo

Our demo sits at the inaudible-and-still-detectable corner. Production schemes (Audio Seal, WavMark) train neural models to find better tradeoff points.

Survival under attacks

The classical attack landscape:

What survives:

✅ Re-encoding through MP3 at ~128 kbps and above
✅ Resampling within ±20% (the echo just moves to a different sample count)
✅ Mild equalization
✅ Most consumer-grade audio editing software's "normalize" operation
✅ Speaker-recording-mic loop, for clean recordings

What kills the watermark:

❌ Aggressive lowpass filtering below ~4 kHz (kills the high-quefrency components)
❌ Heavy compression (MP3 below 64 kbps mangles the spectral envelope)
❌ Time-stretching or pitch-shifting (the τ relationship breaks)
❌ Mixing with another loud signal at similar amplitude (drowns the echo)
❌ Adding a synthetic removal echo at -τ with the same α (anti-watermark attack)
❌ Vocoder-style re-synthesis (entire audio is reconstructed; watermark gone)

That last one is important: AI voice clones are often produced through neural vocoders, which means a real echo-hidden watermark on the original speaker's recordings would be wiped in the clone. So echo hiding doesn't help with "is this voice cloned?" — it only helps with "is this an unmodified copy of an authorized recording?"

For voice cloning, the right watermark is on the generated audio, embedded by the cloning tool — like Audio Seal does at the model side. Detection then becomes "is this a known AI voice synthesizer's output?"

Audio Seal (Meta, 2024)

The current state of the art for AI-generated audio watermarking is Audio Seal (Roman et al., Meta FAIR, 2024). Key differences from echo hiding:

Neural encoder/decoder pair. A small U-Net learns to embed a watermark imperceptibly across the audio. A separate detector network learns to read it back.
Sample-accurate localization. Audio Seal can identify which exact samples of an audio clip carry the watermark, not just "this clip is watermarked." Useful when only part of an audio file is AI-generated (e.g., a deepfake spliced into a real interview).
Survives aggressive editing. Audio Seal was trained against a strong augmentation pipeline (MP3, lowpass, noise addition, time stretch). The robustness profile is much better than echo hiding's.
Open-sourced. Meta released the model weights and detector under permissive license. This is the audio equivalent of SynthID-Text — the issuer (Meta) wants their watermark to be the de facto standard.
Capacity ~16 bits per audio chunk. Enough for "this is from voicebox" + "this is chunk #N" tracking.

The math is conceptually similar to our DCT image watermark and our Kirchenbauer text watermark: a small key-driven signal added to the data, detected by statistical correlation. The carrier is just neural-network-shaped instead of "additive echo."

WavMark (2023) and the open-source family

WavMark (Chen et al., 2023) is the open-source predecessor of Audio Seal — a learned watermark for speech with similar architecture. It's not as robust as Audio Seal but is well-documented and easy to integrate.

For DIY: WavMark + PyTorch is the path of least resistance. For production: Audio Seal is more battle-tested.

The detection-vs-provenance lesson, audio edition

The same meta-argument applies as for image and text: watermarking helps with cooperating issuers (you trust the AI lab to embed). For uncooperative issuers, watermarking does nothing. The long-term answer for audio is C2PA Audio Manifests, which carry cryptographic signatures from the recording or generation tool. Adobe Audition signs at export. Some VoIP platforms are testing manifest signing at the gateway.

Until C2PA-for-audio is universal, layered detection (cepstral check + spectral forensics + neural classifier ensemble) is the practical bridge.

What's in your dashboard

When you upload audio at /check/audio, the new "Echo-hiding watermark scan" panel does the cepstral analysis we just described and renders:

A cepstrum preview chart — the first 300 quefrencies. The expected echo quefrency is highlighted in amber; the strongest detected peak is in green. If green lines up with amber and is much taller than the rest, you have a watermarked audio file.
A z-score of the peak vs. the cepstrum baseline distribution. Threshold > 4 is a clean detection.
An embed + WAV download button that runs our embed pipeline (decode → embed echo → encode 16-bit PCM WAV → download). Re-upload the WAV and the panel flips to "detected."

The WAV encoder is included in lib/watermark-audio.ts (~50 lines). 16-bit PCM, mono, with a proper RIFF/WAVE/fmt /data chunk structure. Real WAV files, plays in any media player.

What's still missing

A few things we'd add to make this production-grade:

Multi-bit encoding. Currently we embed presence/absence of a single echo. Segmenting the audio into frames and encoding 32+ bits would let us carry an issuer + serial number.
Spread-spectrum watermarks in addition to echo hiding. Different carrier, different robustness profile, ensemble detection.
Audio Seal detection. Once a usable JS port of the detector exists, wire it into the panel like we did with c2pa-js for images.
Length warnings. Detection on < 2-second clips is unreliable; we should warn the user.
MP3 encoding for embed downloads. Currently we output WAV (large). MP3 from JS is doable but adds 200 KB of decoder.

These are next-phase work. The math, the round-trip, and the academic story are live now.

What this completes

This is the third of three watermarking implementations on the same statistical foundation:

Image (DCT mid-band, Cox et al. 1997) — walkthrough
Text (Green-list, Kirchenbauer 2023) — walkthrough
Audio (Echo hiding, Bender 1996) — this article

All three follow the same recipe: pick a signal carrier (DCT coefficients, vocabulary tokens, cepstral peaks), add a small key-driven perturbation, detect via statistical test on the key. The differences are about where you embed and how you read back. The math is the same family.

Every academic paper in any of these domains will, ultimately, be a variation on this theme. If you understand z-tests and key-driven biasing, you can read any watermarking paper.