research··11 min read

The Benchmarks an AI-Detection SaaS Should Publish in 2026 — Monthly

A practical, operator-grade shortlist of the AI-detection benchmarks worth publishing against in 2026: NTIRE Robust In-the-Wild, Synthbuster, RobustSora, ASVspoof 5, ADD-C, RAID, MAGE. Includes Cloudflare Workers AI vs Modal vs Replicate compute costs, license posture, and the 'no training on eval data' methods card every honest dashboard needs.

If you sell AI-detection in 2026, "99% accurate" is a lawsuit waiting to happen. The 2025 FTC enforcement against Workado made that explicit. The defensible play is the opposite: commit to a small, honest, monthly public dashboard against benchmarks that stress what actually matters — post-processing, unseen generators, telephony fraud, paraphrase laundering.

This is the shortlist we'd publish against, why each one, what they cost to run, and the methods card every credible scorecard needs.

The shortlist (5–8 total)

  1. NTIRE 2026 Robust In-the-Wild AI-Generated Image Detection — CVPR workshop hidden test, train ~500k, community reports include Flux / SD3.5 / Imagen3 / MJ7 / GPT-Image-1. Top entries ~92% balanced accuracy in-distribution; ~80% on robust track. Credible third-party yardstick.
  2. Synthbuster (2024–2026 rolling) — In-the-wild synthetic images scraped from social platforms + known generator corpora, emphasizes unknown-generator generalization. ~400k–800k images, hidden test server. SOTA 80–90% AUROC on hidden test; large drop for unseen generators.
  3. RobustSora (2026) — Text-to-video (Sora-class) plus Veo-class, heavy recompression and screen capture. 15k–30k clips. Early reports 65–80% AUC — few detectors generalize here yet. Forward-looking.
  4. ASVspoof 5 (Logical Access + telephony track) — 100k+ utterances, includes neural codec and diffusion TTS. EER 1–5% on LA, 5–12% on telephony. Gold standard for spoofed speech.
  5. ADD-C (telephony deepfake, 2025) — GSM/VoIP artifacts, ~50k–100k clips. EER 6–15% depending on channel. Directly relevant to KYC / insurance fraud.
  6. RAID (Robust AI Detection, 2025 refresh) — Multi-domain human vs LLM (GPT-4/5, Claude, Gemini 3, Llama 4), paraphrase and editing attacks, 200k–500k docs. SOTA 80–92% AUROC in-distribution; 60–75% under paraphrase attacks. Strong attack coverage.
  7. MAGE (2026) — Targets GPT-5 / Claude / Gemini 3 / Llama 4, includes editing chains and human-AI collaboration. Hidden eval server. Early leaderboard 70–85% AUROC. Current-gen coverage.
  8. (Optional 8th) DF40 (2025) — 40 manipulation pipelines including diffusion-video and reenactment. SOTA 75–88% AUC; strong domain shift. Broader video attack surface.

Rationale. Each has either a hidden server or strong robustness component. Together they cover social recompression, unseen generators, telephony fraud, and paraphrase attacks. Confidence in the shortlist: 0.7. The dissenting view says prioritize only hidden-test leaderboards (NTIRE, ASVspoof, MAGE) to avoid overfitting — at the cost of slower iteration and less transparency.

Compute cost — three platforms

The math for monthly runs assumes ~800k images, ~30k video clips (1 fps × 32 frames), ~150k audio utterances (3–10 s each), and ~700k text docs.

| Platform | Pros | Cons | Monthly cost | |---|---|---|---:| | Cloudflare Workers AI | Data locality with R2, zero egress, good for smaller models | Limited GPU SKUs; long video batches may queue | $2k–$6k | | Modal | A10 / A100 / H100 on demand, good for bursty monthly jobs, reproducible with pinned images | More expensive than Workers AI for small models | $3k–$8k (~8–20k GPU-hours aggregate) | | Replicate | Simple packaging, per-second billing | Less cluster locality control, pricier at scale | $4k–$10k depending on video throughput |

Confidence: 0.6. If you aggressively sub-sample video (keyframe + scene-cut detection) and cache embeddings, Workers AI runs <$3k. If you run diffusion-reconstruction detectors (DRCT-style) on every image, costs balloon materially.

Licensing — what you can publish

Commercial-friendly for reporting metrics: ASVspoof results, RAID and M4 (typically yes), NTIRE challenge scores, Synthbuster scores. Data redistribution is much narrower — most image / video sets are research-only. Maintain a clean separation: evaluation only, no training on restricted sets, and confirm per-dataset TOS before any commercial training claim.

Image benchmarks worth knowing

GenImage (2023–2025 updates). ~1.3–2M images, multi-generator (SD 1.x/2.x/XL, Midjourney v5/6; 2025 forks include SD3 / Flux). Balanced real vs. synthetic, multiple compression levels. SOTA 90–96% AUROC in-distribution; 70–85% under heavy JPEG / resize / crop. Reproducibility high.

WildFake (2025). ~250k images, real social vs. synthetic with heavy edits, filters, screenshots. NTIRE-affiliated. 75–88% AUROC; strong penalty for screenshot recompression. Closer to journalist intake conditions.

DRCT (Diffusion Reconstruction Consistency Test, 2025). Uses diffusion-prior reconstruction residuals. SDXL / SD3 / Flux subsets, ~200k samples. 85–93% AUROC reported. GPU-heavy. Aligns nicely with "show the evidence" residual visualizations.

RRDataset / RRBench (Real-vs-Rendered). Photoreal renders vs camera captures, CGI and NeRF renders. 150k–300k samples. 90%+ in controlled; 75–85% in the wild. Tests confusion between CGI and camera — relevant for insurance and KYC.

Video benchmarks worth knowing

FaceForensics++ (FF++). 1,000+ real videos, 4 manipulation methods, multiple compressions. Near-ceiling AUC on raw (>99%); 85–95% compressed. Legacy baseline — easy win but saturated.

DFDC. ~100k clips, archival (Meta / Kaggle). 90%+ AUC on public; private test unknown. Historical comparability only.

SAFE Benchmark (2025). Social-media-style edits, captions, overlays; multi-modal cues. 10k–20k curated clips. 70–85% balanced accuracy. Industry-academia collaboration with hidden test server. Journalist-realistic intake conditions.

Deepfake-Eval-2024 / GenVideo (rolling). Mixed legacy + diffusion-video coverage, 10k–50k clips. Supplementary.

Audio benchmarks worth knowing

MLAAD (Music & Long-form Audio AI Detection, 2025). Music from Suno / Udio-class models plus long-form podcasts with TTS inserts. ~20k tracks / clips. 70–90% track-level AUC; frame-level harder. Covers SONICS-style music detection.

Speech DF Arena (2026 leaderboard). Hidden test across vendors and open TTS (including GPT-5-TTS-class). High reproducibility for submissions. Third-party credibility lever.

In-the-Wild / CtrSVDD / Echoes 2026. Real-world captures with adversarial noise — stress tests for robustness.

Text benchmarks worth knowing

M4 (Massive Multi-Model Mixture, 2025). Many generators and temperatures, cross-domain. ~1M short / long texts. 85–95% in-distribution; cross-model drop notable.

HELM-Detect / vendor red-team sets (2025–2026). Proprietary but some public slices. Supplementary, not primary.

What we ship

Three concrete moves, ranked by ROI vs. effort.

1. Monthly Robustness Dashboard, anchored to NTIRE 2026 + ASVspoof 5 + RAID. ROI: high. Effort: medium. Build a monthly CI pipeline — Modal for video / image, Workers AI for text / audio. Per-artifact evidence (ELA, FFT, residual heatmaps, prosody features) in R2 with signed URLs. Publish aggregate AUROC / EER with confidence intervals and month-over-month drift deltas. Add a short methods card per benchmark clarifying "no training on eval data." This is the public-credibility move that competitors can't fake.

2. Telephony Fraud Pack (ADD-C + ASVspoof telephony). ROI: high for insurance / KYC. Effort: medium. Optimize a Conformer-based spoof detector for 8 kHz; add channel-robust features (CQCC, LFCC, phase). Provide an API flag telephony_mode=true and publish EER on ADD-C monthly. Pair with explainability plots (spectral tilt, phase inconsistency).

3. Unseen-Generator Stress Test (Synthbuster + RobustSora). ROI: strategic differentiation. Effort: medium-high. Maintain a small internal holdout of Flux / SD3.5 / Imagen3 / MJ7 / GPT-Image-1 and Sora / Veo-class samples collected under license. Report the gap between public and unseen sets — that gap is the most honest measure of detector durability against next-month's generators.

Open questions

  • Will 2026–2027 generator watermarking (C2PA / Content Credentials) reduce the need for forensic detection in mainstream pipelines, shifting the value to tamper detection instead?
  • How quickly do Sora / Veo-class models erase the temporal artifacts that current video detectors rely on?
  • Are paraphrase-robust text detectors approaching an irreducible ceiling without provenance signals?
  • Licensing risk: several "in-the-wild" sets have ambiguous redistribution terms. Maintain legal review before marketing claims.
  • Compute variance: hidden test servers can bottleneck submissions; plan buffer time before publishing monthly numbers.

That's the /check we're building a monthly dashboard around.