research··12 min read

Laundering Attacks Against AI-Content Detectors — May 2026

What still defeats AI-detection systems in 2026: screenshots, recaptures, diffusion-purification, paraphrase, neural codecs, telephony pipelines, homoglyph substitution. Plus the residual-AUC numbers from RAID, StealthDiffusion, SIDA, ADD-C, Deepfake-Eval-2024, and DMF — and the laundering-robustness UI panel we add to every deep-scan result.

The short answer is that cheap laundering still breaks brittle detectors — but it no longer breaks every detector equally. The harder, more honest answer is that the attacks that win in 2026 are the ones that change the detector's preferred evidence without changing the user-visible meaning very much. Paraphrase wins on text. Telephony and pitch shifting win on audio. Recapture and diffusion-paraphrase win on image and video.

For a product that promises "AI detection that shows its work," the only defensible response is to never present a verdict as a single-point claim. Every deep-scan result should expose whether the verdict survives the ways content actually gets laundered in the wild. This post lays out what the residual-AUC numbers actually say, then specifies the laundering-robustness panel we add to every result page on /check/{image,audio,video,text}.

Image: where the headlines are wrong

The public evidence now separates mild platform-like corruption from true laundering.

Mild JPEG, resizing, and light Gaussian noise do reduce performance, but recent social-media-oriented detectors like SIDA keep AUC close to baseline:

| Attack | Detector | Clean AUC | After attack | |---|---|---:|---:| | JPEG 70 | SIDA-7B | 87.3 | 86.2 | | JPEG 80 | SIDA-7B | 87.3 | 85.8 | | Resize 0.5× | SIDA-7B | 87.3 | 86.8 | | Resize 0.75× | SIDA-7B | 87.3 | 87.1 | | Gaussian noise σ=5 | SIDA-7B | 87.3 | 85.3 | | Gaussian noise σ=10 | SIDA-7B | 87.3 | 84.1 |

But the bigger failures come from derivative-image creation, transferable adversarial examples, and diffusion-based laundering:

| Attack | Detector | Clean | Residual | |---|---|---:|---:| | Transferable adversarial ε=32/255 (RAID) | DINOv2 | 81.2 AUROC | 63.8 | | Transferable adversarial ε=32/255 (RAID) | CoDE | 87.5 AUROC | 75.4 | | Diffusion-based laundering (StealthDiffusion) | CNNSpot on GenImage | 95.49 ACC | 0.87 ACC | | Same, with TRIM robustness wrapper | CNNSpot + TRIM | 89.56 | recovered to 99.83 on StealthDiffusion | | Screenshot / photo of screen / print | SPAI | 91.0 AVG AUC | not in public table, authors flag as limitation |

Spectral detectors themselves explicitly call screenshots, memes, photos of screens, and printed material a limitation because those channels corrupt the frequency cues they rely on. RAID and StealthDiffusion show carefully crafted perturbations can still push strong detectors far toward failure.

Audio: noise is not the real enemy

The biggest 2026 mistake in audio is thinking "noise" is the threat. Recent work finds pitch shifting, time stretching, echo / reverb, codec compression, real telephony pipelines, and especially neural codecs are more damaging than ordinary additive noise.

| Attack | Detector | Clean | Residual | |---|---|---:|---:| | MP3 transcoding | RawNet3 + RandAug + F-SAT | 97.0 ACC | 96.6 | | AAC transcoding | RawNet3 + RandAug + F-SAT | 97.0 ACC | 96.8 | | Real comms (codecs + packet loss) | GMM / LCNN / AASIST on ADD-C | baseline | +5.30 EER, -3.16 AUC, -3.34 F1 average degradation | | Same, with targeted augmentation | ADD-C training strategy | baseline | EER essentially unchanged, AUC down only 0.1% | | Echo 0.1s | Wave2Vec2 on WaveFake | near-perfect | 0.558 ACC | | Echo 0.1s | HuBERT | near-perfect | 0.823 | | Echo 0.1s | Whisper | near-perfect | 0.900 | | Pitch / time stretch | 10-model benchmark | near-perfect | "severe deterioration" even for foundation models | | Neural codecs | 10-model benchmark | near-perfect | "the most substantial challenge" |

The RADAR Challenge 2026 built its full evaluation pipeline around real codecs, packet loss, echo, reverb, bandwidth limiting, 8 kHz resampling, and speech perturbation. The signal is clear: targeted augmentation can largely recover the loss, but plain detectors trained without it will fail under realistic delivery channels.

Video: the hardest evidence isn't FF++ anymore

The hardest public evidence is no longer "FF++ C23 vs C40." It's in-the-wild drift, screen recapture with Moiré, and cross-platform delivery conditions.

| Attack | Detector | Result | |---|---|---:| | In-the-wild 2024 distribution shift | Best off-the-shelf open-source video model on Deepfake-Eval-2024 | Max AUC 0.58 (vs near-1.0 in origin papers) | | In-the-wild adaptation | GenConViT fine-tuned on Deepfake-Eval-2024 | AUC 0.82 | | Commercial best on Deepfake-Eval-2024 | Best commercial video model in benchmark | AUC 0.79 | | Camera recapture / screen recording with Moiré | FTCN | 90.2 → 65.9 (LG) / 65.3 (BenQ) / 70.6 (Lenovo) / 68.9 (Samsung) | | Same | LipForensics | 90.6 → 80.3 / 80.8 / 84.4 / 79.8 | | Same | AltFreezing | 92.5 → 80.4 / 81.3 / 83.7 / 82.9 | | Capture-device variation | Family average | -9.5 AUC iPhone 13 / -12.0 Samsung S22+ / max -25.4 |

Deepfake-Eval-2024 halves open-source AUCs vs. older academic benchmarks. DMF then shows recording a video off a screen — a very practical user action — is strong enough to degrade strong detectors heavily and in hardware-dependent ways. A serious operational lesson for trust-and-safety, KYC, and newsroom workflows where recaptured clips are common.

Text: paraphrase + Unicode obfuscation

Text is the least standardized modality in the public 2025–2026 robustness literature, but the direction is clear:

  • Paraphrase / humanizer rewrites are still the dominant attack
  • Homoglyphs, character swaps, zero-width insertion are practical evasion routes now treated as first-class
  • The Query-of-Deviations (QOD) defense retains strong original performance while consistently improving robustness across paraphrase, homograph, random char swap, and zero-width insertion

Normalization is not optional. Neither is detector abstention when the verdict changes after normalization passes.

What still wins, and why

The attacks that win in 2026 are the ones that change the detector's preferred evidence without changing the user-visible meaning very much. That's why paraphrase works on text, telephony and time-scale edits work on audio, and recapture plus diffusion-based paraphrase work on image and video. They target the detector's feature dependency, not the content's semantics.

For images, derivative creation > true adversarial > diffusion paraphrase > pure mild corruption. For audio, channel laundering > pitch/time perturbation > additive noise. For video, delivery-laundering and recapture > pure synthetic-generation progress. For text, paraphrase + Unicode obfuscation > raw cleverness.

Defenses that survive

The defense that survives best is not a single detector. It's a pipeline combining provenance, modality-specific laundering checks, variant-based robustness testing, and calibrated abstention.

The practical defensive stack:

| Move | Why it survives | Confidence | |---|---|---| | Provenance-first, then content detection | OpenAI and Google both push the dual C2PA + SynthID model where rich metadata and durable watermarking reinforce each other | High | | Laundering-stress evaluation on normalized variants | Robust models are those whose verdict survives JPEG / resize / channel effects / recapture | High | | Side-channel laundering detectors | Recapture, replay, telephony, channel effects leave traces persisting even when "AI-ness" traces fade. Sightengine positions recapture and GenAI detection as complementary; Microsoft's Scoop treats recapture as a core attack on provenance | High | | Test-time robust wrappers / targeted augmentation | TRIM (images), F-SAT (audio), telephony-aware augmentation (ADD-C) all show robustness can be materially improved without giving up the original detector family | Medium | | Abstain on fragile cases | Deepfake-Eval-2024 makes clear many detectors fail under real-world drift; showing instability is more honest than hiding it behind a single score | High |

The variant set we run

For every deep scan, we run the detector on a variant set, not a single pass.

Image: original, JPEG 85, JPEG 70, shortest-edge 1024 resize, 0.75 scale, mild denoise, crop-preserving face/object region pass if salient subject.

Audio: original PCM canonicalization, Opus-like proxy, MP3/AAC proxy, 8 kHz telephony proxy, loudness-normalized proxy, small time-scale or room-effect stress.

Video: original keyframes, 720p H.264 proxy, 8 fps frame-decimated proxy, central crop and face crop streams, recapture/Moiré analysis pass.

Text: raw, normalized Unicode, whitespace-normalized, quote-preserved segmentation, paraphrase-sensitivity pass.

The Robustness Under Laundering panel

Every deep-scan result page gets a panel called Robustness under laundering. It answers: "does this verdict survive the ways content actually gets laundered in the wild?"

| Field | What we show | Why it matters | |---|---|---| | Stability badge | Stable · Fragile · Fails under laundering | Fast read before details | | Baseline score | Raw detector score on original input | Keeps the old mental model | | Median mild-variant score | Median across normal sharing/editing transforms | Prevents over-weighting a lucky pass | | Worst-case mild-variant score | Lowest score across mild transforms | Surfaces fragility immediately | | Flip rate | Fraction of mild variants that cross threshold | Measures verdict brittleness | | Detector agreement | Agreement among model families | A single detector can be wrong for the wrong reason | | Provenance status | C2PA found/not found, signature valid, watermark found | Highest-value positive evidence when available | | Side-channel laundering flags | Recapture, replay, telephony, Moiré, screenshot-like, Unicode obfuscation | Explains why the base detector may be unreliable | | Variant gallery | Each transform, preview/diff, score delta, evidence-view changes | The "shows its work" moment | | Decision note | One paragraph in plain English | Converts technical evidence into operator action |

The scoring rule is simple:

stability_score =
  0.40 * median_mild_score
+ 0.20 * worst_case_mild_score
+ 0.15 * detector_agreement
+ 0.15 * provenance_bonus
+ 0.10 * side_channel_consistency
- flip_rate_penalty

Mapped to copy:

  • Stable — verdict stays on the same side of threshold across nearly all mild variants, provenance or side channels support it, low disagreement
  • Fragile — verdict changes under one or more mild variants, or detector disagreement is high
  • Fails under laundering — original-only positive or negative, but median or worst-case mild-variant result flips

Standards context: C2PA + SynthID, in concert

OpenAI's May 2026 update explicitly framed C2PA content credentials and Google's SynthID as complementary: watermarking can survive transformations like screenshots more often, while metadata carries richer context when it survives. Google announced a parallel unification across Gemini, Search, and Chrome-facing experiences.

Directionally very good for products like ours — but treat it as supporting evidence, not a complete answer. Metadata can be stripped, platform labels are inconsistently surfaced, and widely distributed media often loses provenance visibility long before reaching an investigator.

Recapture is now important enough that vendors and researchers treat it as its own attack class. Sightengine's 2026 recapture docs call it one of the most common methods to bypass originality checks and strip provenance signals. Microsoft's 2025 Scoop work frames recapture as a major attack on provenance-based authentication and proposes depth-based mitigation.

Market implication: if you only do model-output detection, you'll lose the provenance war. If you only do provenance, you'll lose the laundering war. The winning bundle is both.

What we ship next

1. The laundering-stress harness, exposed in-product. A modal-agnostic "variant runner" in Cloudflare Queues + Workers. Each upload fans out into a fixed set of mild variants and a smaller set of aggressive probes. Transformed artifacts in R2, scores in D1, compact verdict in KV for fast rendering. UI shows original score, median, worst-case, flip rate, per-transform deltas. For text: normalized diffs. For image/video: thumbnails plus ELA / FFT / channel decomposition / noise residual per variant. Upgrades positioning from "we ran a model" to "we tested whether the verdict survives laundering."

2. Provenance + side-channel laundering as first-class citizens. Before the main detector runs: C2PA verification, watermark verification where available, modality-specific laundering checks. For image/video: recapture / Moiré / screen-photo suspicion. For audio: replay suspicion, channel family inference, bandwidth-limit detection, telephony-pipeline heuristics. For text: Unicode normalization, invisible-character flagging. In the deep-scan page and API response, separate "synthetic-content evidence" from "laundering / delivery evidence" from "provenance evidence."

3. A robustness-aware verdict policy with calibrated abstention. Replace one global threshold with a policy engine. A "strong AI-generated" result should require score stability over mild variants, acceptable detector agreement, and no evidence the observed delivery path plausibly erased the model's preferred features. "Fragile" should be the default whenever mild transforms cause flips, provenance is absent, or side-channel laundering flags are present. "Inconclusive after laundering" should be available and should not be treated as a product failure. API returns verdict, confidence, stability, flip_rate, variant_scores, provenance, laundering_flags.

Open questions

The biggest unresolved issue is public standardization. Image, audio, and video have much better robustness evidence than text, but even there the literature still mixes AUC, EER, accuracy, and attack-success rates — making clean cross-paper ranking hard.

For text, the evidence supports paraphrase and character-level obfuscation as major attack families, but there's no public 2025–2026 benchmark with comparable residual-AUC matrix.

For video, public reporting on x264 vs x265 vs AV1 vs platform-specific TikTok/Reels/YouTube pipelines is still thinner than it should be. The strongest public evidence is on in-the-wild drift and recapture with Moiré, not codec-isolated ROC tables.

For provenance, adoption is improving but ecosystem retention is uneven. C2PA + durable watermarking looks like the right direction, yet public evidence still suggests metadata visibility and retention across real distribution pipelines remain inconsistent.

The strategic open question that matters most: how fast attackers will optimize against published robustness panels. Assume they will — which means your internal transform set, thresholds, and aggregation logic should be versioned, partly hidden, and continually refreshed with in-the-wild samples.

That's the /check we're building.