Benchmarks & honest limits

How AI detection is actually evaluated

Rigorous AI-content detectors are measured against public research datasets that deliberately try to break them. Below are the benchmarks the field leans on, what each one stresses — and a straight answer about where this tool stands relative to them.

Where Could This Be True? stands

We run fast, transparent forensic heuristics in your browser — not a machine-learning model trained on, or scored against, the benchmarks on this page. So we do not publish an accuracy percentage: we have not evaluated against a labelled benchmark set, and a number we hadn't measured would just be marketing.

What we give you instead is the evidence itself — error-level analysis, noise residuals, spectral and statistical signals, C2PA provenance — each shown with what it measures and where it fails. Treat every verdict as evidence, not proof.

How each signal works: read the methodology.

Public benchmarks, by modality

These are widely-used public research benchmarks. We summarize what each one stresses rather than quote scores — reported numbers vary by method and split, and belong to the teams that publish them.

Image benchmarks mix camera photos with outputs from many generators, then pile on compression, resizing and screenshots — the same laundering a picture survives on its way through social media.

  • GenImage2023

    Large multi-generator set (diffusion + GAN) with several compression levels — tests whether a detector holds up beyond one model family.

  • Synthbuster2023

    Generalization to unseen generators — the hard case where a detector trained on yesterday's models meets today's.

  • WildFake2024

    In-the-wild social images with heavy edits, filters and screenshot recompression — close to real journalist-intake conditions.

Video benchmarks span classic face-swap deepfakes through to recent diffusion video, usually at several compression levels because re-encoding destroys the subtle traces detectors rely on.

  • FaceForensics++2019

    The long-standing face-manipulation baseline: multiple forgery methods across raw and compressed video.

  • DFDC (Deepfake Detection Challenge)2020

    A large, varied deepfake corpus from a public challenge — historical comparability across detectors.

  • DF402024

    Forty distinct manipulation pipelines including diffusion video and reenactment — a broad, modern attack surface.

Speech benchmarks grew out of anti-spoofing for voice authentication; newer sets add neural text-to-speech, telephony channels and AI-generated music.

  • ASVspoof2015–2024 series

    The reference anti-spoofing challenge for synthetic and replayed speech, including telephony-channel conditions.

  • MLAAD2024

    Multi-language synthetic-speech detection — generalization across languages and TTS systems.

Text benchmarks pit human writing against many LLMs across domains, then attack the detector with paraphrasing and light editing — the cheapest, most common evasion.

  • RAID2024

    Multi-domain human-vs-LLM detection with explicit paraphrase and adversarial editing attacks.

  • M42024

    Multi-generator, multi-domain, multilingual machine-generated-text detection — cross-model generalization.

What a credible benchmark number requires

Robustness to laundering

A score that collapses after a screenshot or a re-encode is not useful. Serious benchmarks re-test under compression, resizing and recapture.

Unseen generators

New models ship constantly. The number that matters is performance on generators the detector never trained on — not the in-distribution best case.

Hidden test sets

Public splits get overfit. Credible leaderboards keep a hidden test server so reported numbers reflect generalization, not memorization.

Evaluation, not training, on the data

Scores only mean something when the eval set was kept strictly separate from training. Mixing the two inflates every metric.

Benchmark entries are short summaries of public research datasets, named for orientation only; we are not affiliated with them and we report no scores against them. Dataset scope and naming evolve — consult each project's own publication for authoritative details.