guide·May 7, 2026·10 min read

How AI Video Forensics Work: From Frame Consistency to Face-Landmark Drift

Detecting AI video is harder than detecting AI images. The signal is in motion — temporal consistency, optical-flow flicker, face-landmark stability, and audio-visual sync. A practical guide.

A still image gives you one chance to fool a forensic check. A video gives you thousands. Every frame of an AI video has to maintain consistent geometry, lighting, hair, clothing, and identity — and most generators don't quite manage it. The tells are there if you know where to look. Here's the field.

Why video is harder to fake (and harder to detect)

Generators have to solve two problems at once: each frame must look photorealistic, and the frames must be temporally consistent. A character's left earlobe in frame 1 has to be the same earlobe in frame 60. A reflection in a coffee cup has to track the camera. A flickering streetlight has to flicker on a believable rhythm.

That extra constraint creates extra forensic surface. The five most reliable video tells are below — none is bulletproof, all combine well.

1. Frame-to-frame consistency

Take any short clip and extract every Nth frame. Pixel-diff adjacent frames after motion compensation. Real video, even at high frame rates, has predictable inter-frame differences (sensor noise + actual motion). AI video has too-clean inter-frame transitions in static regions, and too-noisy transitions in detail-heavy regions.

The simplest version of this check: look at a region of skin, count how many distinct micro-textures appear in 30 frames. Real skin has constant micro-flicker; AI skin often has fewer than 5 distinct textures cycling.

2. Optical-flow flicker

Compute optical flow between consecutive frames (Lucas-Kanade or Farnebäck — open-source). The flow field for a real video is locally smooth: pixels in a coherent moving object share similar motion vectors. AI video often has micro-flicker — small regions where the flow vector flips chaotically frame-to-frame.

You can sometimes see this with the naked eye on hair, fabric edges, or background foliage. A leaf in a real video sways smoothly. An AI leaf sometimes jitters in a way no wind could produce.

3. Face-landmark drift

Most synthetic faces are passable in any single frame. Across 100 frames, they drift. Use any standard face-landmark detector (MediaPipe is open-source and runs in-browser):

Track 68 facial landmarks across the clip
Plot inter-eye distance vs time
Plot jaw-line length vs time
Plot ear-to-ear width vs time

A real face's landmarks vary within a tight band (the face is rigid; only the camera moves). An AI face's landmarks wander — a few pixels every few frames, but the wander accumulates. By the end of a 5-second clip, the geometry has often shifted enough to fail a stability test.

4. Audio-visual sync

If the clip has talking, lip-sync is a heavy tell. Real lip movement leads audio by ~80–120ms because the mouth has to form the shape before the sound exits. AI generators often slip outside that window — sometimes audio leads, sometimes the lips are too synchronized (perfect alignment, which is also unnatural).

The math: cross-correlate mouth-aspect-ratio time series against audio energy time series. Real video: peak correlation at a small negative lag (mouth leads). AI video: peak at zero lag, or at random lag.

5. Compression and noise residual across frames

Apply our same image-forensic primitives to individual frames:

ELA on a frame should look like ELA on a real photo (high delta on edges, low on smooth regions)
Noise residual should reveal sensor noise that's consistent in pattern across all frames from the same source

AI video frames often produce ELA maps that are uniformly dark (born at one compression level) or noise residuals that are too clean. If you average noise residuals across many frames, real footage gives you a stable PRNU fingerprint; AI footage gives you mush.

What defeats every video detector

Re-shooting from a screen — the analog hop launders most digital artifacts
Heavy compression after generation — drops the high-frequency telltales
Short clips (under 2 seconds) — not enough frames to compute drift
Low resolution — fewer pixels means less forensic signal per frame

How to combine the signals

A single check on its own has a 20–30% false-positive rate at useful sensitivity. Five independent checks at the same threshold drop the joint false-positive rate dramatically — but only if the failures are independent. They usually are: lip-sync drift and face-landmark drift come from different generator components and fail together less often than chance.

The right way to use these is the same as for images: show the user every signal, let them see which fired and which didn't, and treat agreement across signals as the call.

What's coming

Three things are reshaping video forensics in 2026:

C2PA on video — Content Credentials work on video files just like images. Adobe Premiere can sign exports, news cameras can sign at capture. When present, signature beats every heuristic.
Audio-side forensics — vocoder fingerprinting, spectral analysis. Audio AI moves slower than video AI and often leaves cleaner tells.
Behavioral forensics — micro-expressions, blink patterns, breathing. Real humans blink every 2–10 seconds. Some AI faces blink on perfect rhythm. (Also: many AI faces don't blink enough.)

We don't run video forensics on the site yet — it's the next modality after audio. In the meantime, our image forensic dashboard lets you analyze any video frame as a still, and the C2PA panel reads signatures from any signed file, video included.

The honest framing

Video detection is statistically easier than image detection because you have more frames to play with — but it's harder operationally because the signal is spread across time and most users only have a single short clip. The treadmill is the same: every generator release weakens the heuristics, every C2PA adoption strengthens the alternative.

Until provenance is universal, layered forensics remains the right tool. Watch the motion. Watch the eyes. Watch the seams between frames.