research·May 10, 2026·12 min read

Video Watermarking: Per-Frame DCT, Temporal Schemes, and Stable Signature

Why video watermarking is the hardest modality, what Hartung & Girod started in 1998, what Stable Signature changed in 2024, and how our per-frame DCT scanner works on real videos in your browser.

This closes our watermarking trilogy. We've covered image, text, and audio. Video is the fourth and hardest. We just shipped a working per-frame watermark scanner at /check/video — it runs the same DCT mid-band detector from the image module on every sampled frame and aggregates into a video-level verdict.

The code is at lib/video-forensics.ts. The per-frame scan is the trivial part. The interesting story is why video watermarking is hard, what production schemes do differently, and where the field is going.

Why video is harder than image, text, or audio

For a still image, watermarking is a one-shot game. You embed a signal in one set of bytes; you detect it in one set of bytes. Done.

Text and audio are one-dimensional: tokens or samples in sequence. The math (Kirchenbauer-style hash, cepstral analysis) operates over a single dimension.

Video is two-dimensional in space + one-dimensional in time. The watermark needs to survive:

Per-frame compression (H.264, HEVC, AV1, VP9) — each frame is independently or predictively compressed, often heavily.
Inter-frame prediction — most frames in a video stream are predicted from neighbors, not stored fully. Embedding a strong watermark on every frame fights this prediction (the encoder spends bits compensating, blowing up file size).
Temporal attacks — frame drops, frame insertions, time stretching, slow motion, fast-forwarding. None of these affect a still image but all destroy synchronization-dependent video watermarks.
Spatial attacks — every spatial attack from image watermarking (crop, rotate, color shift, re-encode) applies, frame by frame.
Resolution and frame-rate scaling — common in transcoding pipelines (1080p → 720p, 60fps → 30fps).

The attack surface is roughly multiplicative across spatial + temporal axes. Any single attack class can defeat a scheme that didn't account for it.

Three families of approaches

Family 1: Per-frame spatial watermark

Treat each frame as an image; embed a watermark using your favorite image scheme; detect by majority vote across frames.

This is what our scanner does. Pros:

Reuses image watermarking literature wholesale
Robust to frame drops (you just have fewer detections; the remaining frames carry the signal)
Robust to time stretching (frames don't get distorted; just played at different timing)
Survives inter-frame compression if the watermark is encoded at low enough bitrate (high-frequency-domain hide)

Cons:

File size penalty: per-frame embedding fights inter-frame prediction
Doesn't survive cross-frame averaging (smoothing motion blur defeats it)
Doesn't catch temporal AI artifacts (the watermark is on individual frames, not on the temporal coherence)

Pioneered by Hartung and Girod, "Watermarking of uncompressed and compressed video," IEEE Trans. Signal Processing 1998.

Family 2: Temporal watermark

Embed the signal in the relationships between frames: inter-frame delays, motion vector statistics, scene cut patterns. The watermark lives in the time dimension specifically.

Pros:

Cheap on file size (the per-frame appearance is unmodified)
Resilient to per-frame compression
Can survive resolution changes

Cons:

Defeated by frame-rate conversion (re-interpolation destroys timing relationships)
Defeated by frame drops or duplications
Lower capacity (fewer bits per second of video)

Used in digital cinema for screener traceability (different cinemas get videos with slightly different motion vector patterns; a leaked screener can be traced).

Family 3: Latent-space watermark for generative video

Modern. Embed the watermark in the initial latent noise used to seed a video diffusion model (similar to how TreeRing works for images). The watermark propagates through the generation pipeline naturally.

Pros:

Built into generation, no post-hoc embedding step
Survives most spatial and temporal attacks
Can encode model identity, generation timestamp, prompt hash

Cons:

Only applicable to videos generated by that specific model
Detection requires access to (or inversion of) the generation pipeline
Doesn't help with real video that needs watermarking after the fact

Pioneered by Fernandez et al., "The Stable Signature: Rooting Watermarks in Latent Diffusion Models," ICCV 2023, and extended to video diffusion in Stable Signature Video (2024).

What our scanner does

The scanner at /check/video is family 1: per-frame DCT detection.

For each of the 12 evenly-spaced frames we sample from the video:

Convert to luminance
Block-DCT (8×8)
Extract mid-band coefficient F[3,4] from each block
Compute z-score against the keyed-sign expectation under our demo pattern
Mark frame as "detected" if z > 2.33

Aggregate into the video verdict:

Mean z across frames — the cumulative signal strength
Frames detected count — how many individual frames cleared threshold
Majority verdict — detected if > half the frames clear threshold

We also show:

A per-frame z bar chart (12 bars, color-coded: green = detected, cyan = small positive, rose = negative)
The key-frame sign-match heatmap (full-resolution DCT block grid from the mid-clip frame)

This is enough for the user to see what frame-level watermark detection looks like.

What we don't do (and why)

We don't embed. Re-encoding a video in the browser requires either ffmpeg.wasm (~20 MB Wasm bundle) or a JS-native codec (slow). Neither is justified for a demo. If you want to round-trip the watermark, extract a key frame from your video, watermark it on /check/image, and verify there. The math is identical.

We don't detect temporal watermarks. That would require either:

Knowing the specific scheme (we'd implement the inverse for one family)
Access to motion vector data (browsers don't expose this — we'd need to parse the video container ourselves)

We don't detect Stable Signature / latent watermarks. Those need the generator's decoder model (~500MB for a typical SVD model). Production detection happens server-side.

These omissions are honest: the path forward for video watermarking is the family 3 schemes integrated with C2PA video manifests for cryptographic provenance. Per-frame DCT is the educational reference, useful for triaging unsigned content.

The attack landscape

What survives our per-frame DCT scanner:

✅ H.264/H.265 re-compression at ~1 Mbps and above
✅ Cropping (block alignment isn't strict because we average across many blocks)
✅ Frame rate down-conversion (30fps → 24fps): you have fewer frames but each one still carries the signal
✅ Mild color/gamma adjustments
✅ Sub-clip extraction (e.g., 10s clip out of a 60s file)

What kills it:

❌ Heavy denoising / motion blur / temporal smoothing
❌ Re-rendering the video as a screen recording at 4× zoom
❌ Strong rotation or perspective distortion
❌ Format conversions that involve full re-encoding at low bitrates (< 500 Kbps)
❌ Adversarial removal: spectral subtraction at the watermarked DCT band
❌ Frame-blending interpolation (e.g., 24fps → 60fps via optical flow)
❌ Re-photographing the video off a monitor

The same fundamental message as for image watermarks: this is a cooperative-signal scheme. It survives reasonable transcoding. It doesn't survive an adversary who knows the scheme and is willing to degrade quality.

What production looks like in 2026

The current state of the art for video watermarking:

Adobe Premiere + C2PA — Premiere's "Content Credentials" pipeline signs video exports at the manifest level. The video bytes themselves don't carry a survivable watermark, but the manifest does. Adoption is growing in newsrooms.
Stable Signature Video (Meta, 2024) — extends the still-image latent-space watermark to video diffusion models. Open-sourced detector.
AudioSeal for the audio track — many "video watermarks" in 2026 are actually audio watermarks on the audio track, where the math is much cleaner and the attack landscape friendlier. We covered audio watermarking in detail in our previous article.
Per-frame neural watermarks — adaptations of StegaStamp and friends to per-frame video embedding. Survives compression and minor temporal attacks; loses to aggressive transcoding.
Hybrid schemes — combine per-frame watermark + temporal fingerprint + audio-track watermark + C2PA manifest. Detection runs all four; majority verdict wins.

The hybrid approach is where this is going, and where we'll head with the detector. Each individual scheme is brittle; combining them with statistical voting is robust.

What "real video watermarking" looks like at a Hollywood studio

Worth a paragraph on what serious deployment looks like. The film industry has been watermarking screeners since the 1990s. A typical major-studio pipeline:

Each authorized cinema receives a uniquely watermarked copy of the film. The watermark is a combination of:
- Subtle per-frame DCT bias (different key per cinema)
- Spread-spectrum audio watermark in the soundtrack
- Per-scene timing modulation (different cinemas have different inter-cut delays measured in milliseconds — invisible to viewers but recoverable from a leaked file)
If a screener leaks online, the studio's detection pipeline runs all three checks against the leaked file. Two of three agreeing is enough for a confidence call. The studio then knows which cinema leaked it.
The schemes are deliberately classified. Studios don't publish their watermark designs because that would tell pirates how to remove them. The cat-and-mouse continues, but the deterrent effect is real — most cinemas don't leak.

The lesson for content authentication on the open web: cryptographic provenance (C2PA) for the cooperative path, statistical watermarking for traceability, both layered behind detection pipelines that don't depend on any single scheme. Same lesson we've drawn for image, text, and audio.

The four-modality conclusion

Across all four modalities, watermarking is one of three pillars of digital content authentication:

| Pillar | Approach | Strength | |---|---|---| | Cryptographic provenance | C2PA, Content Credentials | Verifies who and when. Bulletproof when present. | | Statistical watermarking | Spread-spectrum, green-list, echo hiding | Survives transcoding, doesn't require key cooperation from carrier. | | Forensic detection | ELA, FFT, perplexity, spectral analysis | Catches unsigned content; statistical inference, error-prone. |

A real authentication system uses all three. Our forensic dashboard is the third pillar with hooks into the second. The first (C2PA) is integrated where present. This is the architecture; the rest is just implementation.

What we will keep building

Three near-term items:

Multi-key watermark scanning — extend each modality's scanner to test multiple known academic and production schemes per upload (not just our demo pattern). This is the high-ROI move because it directly increases what we can detect.
C2PA signing capability — let users sign their own content. This is the monetizable Pro feature.
API access — same scanners over HTTP, for fraud teams and moderation pipelines.

Items we deprioritized:

Real-time video watermark embed (file size + codec hell)
Browser-side video re-encoding (ffmpeg.wasm is too heavy for our use case)
Detection of proprietary closed-key schemes (we'd just be guessing at keys)

Why video is harder than image, text, or audio

Three families of approaches

Family 1: Per-frame spatial watermark

Family 2: Temporal watermark

Family 3: Latent-space watermark for generative video

What our scanner does

What we don't do (and why)

The attack landscape

What production looks like in 2026

What "real video watermarking" looks like at a Hollywood studio

The four-modality conclusion

What we will keep building

Further reading