research·May 23, 2026·12 min read

The State of AI Audio and Voice Clone Detection — May 2026

What actually works for detecting AI-generated speech and music in 2026: ASVspoof 5 winners, Speech DF Arena rankings, the AudioSeal and SynthID watermark stack, why telephony codecs break detectors, and why Suno songs need a different model from ElevenLabs voices.

Audio deepfake detection is mature enough to ship but not mature enough to trust as a single universal score. The strongest public evidence in May 2026 still says the same thing: models that look excellent on one benchmark can degrade sharply on new generators, new codecs, telephony paths, replay, and full-song music generation.

In other words, a practical audio-checking pipeline should treat detection as a layered forensic workflow, not a one-model verdict. The most robust pattern today: provenance and watermark checks first, then a speech-specialized detector, then a separate music and singing route.

The current threat surface is broader than classic TTS. ElevenLabs ships Eleven v3 as its latest expressive speech model. OpenAI's Voice Engine was described as a custom-voice model in a limited preview (and explicitly not widely available as of mid-2024). Sesame is publicly focused on highly natural conversational voice. Suno is a mainstream AI music generator. Google ships Lyria 3 with SynthID audio watermarking.

So your detector has to handle expressive speech, cloned voices, conversational turn-taking, and end-to-end music generation — not only older vocoder artifacts.

What the latest benchmarks actually say

The latest major public challenge is ASVspoof 5, announced in 2024 and still the most relevant ASVspoof benchmark family in May 2026. (There's no separate public ASVspoof 2025 grand challenge result set as of this writing — when people say "ASVspoof 2024/2025" in practice they usually mean the 2024 workshop systems and the 2026 evaluation paper.)

ASVspoof 5 separates two realities.

In the closed condition, the best systems were still mostly classic anti-spoofing ensembles over waveform and spectrogram inputs — AASIST, RawNet2, ResNet, ConvViT-style backends. The top closed Track 1 system (T32) used a waveform transformer. Other top closed systems included T24 (waveform + mel-spectrogram, ResNet + AASIST + ConvViT-Base fusion) and T45 (waveform frontend with RawNet2 + AASIST backend).

In the open condition, the winners shifted decisively toward self-supervised frontends. T45 used wav2vec2-large with GAT, MFA-Res2Net, and LSTM backends. T36 used WavLM-Base + MLP. T27 used WavLM-Base + MHFA + WAP with logistic-regression calibration.

That split matters because it's the clearest public benchmark evidence that SSL frontends are now the default serious option for modern speech deepfake detection.

The uncomfortable part is generalization. In the post-challenge cross-dataset package, four top ASVspoof 5 open-condition systems scored ~3.30–4.33% EER on an ASVspoof 5 subset but jumped above 10% EER on ASVspoof 2015, 2019 LA, 2021 LA, 2021 DF, and the In-the-Wild set. The paper explicitly concluded that generalization remains a major unsolved problem.

A second benchmark, Speech DF Arena from late 2025, evaluates many models across many datasets and reports both average and pooled EER. Its strongest proprietary entry, Whispeak, reported 3.05% average / 3.00% pooled EER. Among open-source families: XLSR+SLS at 13.84 / 15.68%, XLSR-Mamba at 14.21 / 20.12%, Wav2Vec2-AASIST at 18.02 / 19.47%. Classic raw-waveform baselines like AASIST, RawGAT-ST, and RawNet2 were substantially worse on average.

Two takeaways: public SOTA has clearly moved to SSL encoders, and pooled-threshold generalization is still weak enough that calibration and modality routing matter a lot.

Model shortlist for a serious 2026 audio checker

Fake-Mamba (the technical favorite if licensing is cleared). Architecture: XLS-R frontend + PN-BiMamba backend. Cross-dataset results trained on ASVspoof 2019 LA: 0.97% EER on ASVspoof 2021 LA, 1.74% on 2021 DF, 5.85% on In-the-Wild. Real-time-factor numbers comfortably real-time (~0.0279 RTF for 1-second audio). The code is public but the captured GitHub page didn't expose an explicit OSS license, so commercial SaaS use needs clarification.

T27-style WavLM-Base (challenge-proven generalizer). Architecture: WavLM-Base frontend + MHFA and WAP backends + LR calibration. ASVspoof 5 post-challenge cross-dataset: 3.30% on the ASVspoof 5 subset, 10.40 / 17.33 / 18.7 / 10.63 / 13.37% on the other five sets. No production-ready public codebase, but the recipe is reproducible.

T36-style WavLM-Base + MLP (simpler WavLM route). 3.37% on the ASVspoof 5 subset, 10.8 / 16.27 / 15.73 / 11.57 / 14.71% on the others. Good benchmark credibility, lower implementation novelty.

XLSR+SLS (strong pooled open-source baseline). 13.84 / 15.68% in Speech DF Arena — one of the better public open-source results.

Nes2NetX (efficiency play, blurry licensing). Foundation-model-driven anti-spoofing with a nested Res2Net backend. Speech DF Arena: 16.11 / 17.04% EER.

wav2vec2 + AASIST (best clean-license starter, not best detector). MIT-licensed repo. Speech DF Arena: 18.02 / 19.47% EER. Weaker than newer XLS-R and Mamba systems but legally clean and has a large public user base.

For prosody-aware detection (matters specifically for Eleven v3 and Sesame-style conversational voices), ProSDD argues standard benchmark-trained systems fail on expressive and emotional attacks. Reports reducing ASVspoof 2024 EER from 39.62% to 7.38% when trained on 2024 data, with large gains on emotional spoof datasets.

Watermarks and provenance: your deterministic lane

The biggest product design mistake is treating watermark detection as a competitor to forensic detection. It isn't. A watermark or provenance hit is a high-confidence positive with limited coverage. A forensic classifier is broader but much noisier.

AudioSeal is the most important open watermarking project for audio right now. Paper: arXiv:2401.17264. Code and weights under MIT, and the repo explicitly allows commercial use. Jointly trained generator-detector with sample-level localization, optional 16-bit attribution, and streaming support in AudioSeal 0.2. In the paper's runtime table, watermark detection was ~3.25 ms for unwatermarked samples vs 1710.70 ms for WavMark. Robust under AAC, MP3, Encodec, resampling, speed change, noise, echo.

SynthID Audio is strategically important: Google watermarks audio generated or published through Lyria and NotebookLM podcast generation, says the watermark is imperceptible, and designs it to survive noise, MP3 compression, and speed changes. Google also launched SynthID Detector, a verification portal and audio check flow in Gemini.

Resemble is building toward the same layered story commercially. Its docs expose a multimodal detection product, a beta watermark API for upload-and-detect, and a source-tracing API that returns the likely AI platform used.

Always run provenance and watermark checks before the general detector. A positive watermark or provenance finding should outrank a soft forensic score in both UX and policy logic. Sparse high-confidence evidence is still some of the best evidence you can show.

What survives codec re-encoding (and what breaks in live calls)

Public evidence from the last 18 months points to a practical rule: don't overfit to ultra-high-frequency junk or narrow generator fingerprints if your files will pass through communications channels.

The most useful discriminative band in one of the few explicit interpretability studies (iWAX) was broadly the 128 Hz to 8 kHz region. The 0–128 Hz band was much less informative; 128 Hz–8 kHz actually slightly beat the full spectrum.

The same work shows why telephony is painful. The ADD-C benchmark was built for realistic communication scenarios using AMR-WB, EVS, IVAS, Opus, Speex, SILK, plus packet loss. Baseline detectors degrade by an average 5.30 EER points moving from clean to communication conditions. AUC and F1 also drop.

If your uploaded audio originated in VoIP, wireless calling, or conferencing, assume degraded detector reliability unless the model was trained or adapted on those exact codec paths.

Benchmark winners reinforce this. In ASVspoof 5, top closed and open submissions used codec augmentation, mp3/ogg augmentation, RawBoost, low-pass filtering, reverberation, and noise. The challenge analysis stratified attacks by codec groups: no codec, DSP 16 kHz, DSP 8 kHz, Encodec, MP3. Re-encoding robustness comes less from a magic architecture and more from architecture + realistic augmentation + calibration.

For live call / Zoom scenarios: feasible, but only with caution. Fake-Mamba is real-time on the reported GPU setup; AudioSeal supports streaming; Resemble markets sub-second operation for conferencing. But a newer paper on "first greeting" detection frames 0.5–2.0 second decisions under communication degradation as an active research problem, not a solved deployment fact. Trust real-time triage; don't hard-block accounts on the first second of speech.

Replay is the extra trap. Recent Interspeech work shows replay attacks remain damaging and that retraining with room impulse responses helps. Flag "synthetic voice," "possible replay," and "codec-degraded / low-confidence" as separate result types — not one collapsed scalar.

Music, singing, and Suno: a different problem

Speech deepfake detection is no longer enough.

SONICS is a large-scale end-to-end synthetic song dataset with 97,164 songs totaling 4,751 hours, including 49,074 fake songs generated by Suno and Udio and 48,090 real songs. Detectors trained on older singing-deepfake setups perform badly on these end-to-end synthetic songs.

The singing-only literature points the same way. CtrSVDD introduced a controlled singing-voice deepfake dataset with 47.64 hours of bona fide singing and 260.34 hours of fake singing across 14 methods and 164 singer identities. Valuable for a singing lane but doesn't solve the Suno/Udio problem — full-song generation changes accompaniment, arrangement, lyrics, and long-range structure all at once.

Echoes (2026) is the newest dataset worth watching. Designed for music deepfake detection under realistic, provider-diverse conditions. 3,577 tracks over 110 hours, ten popular AI music systems. Argues provider diversity + semantic alignment creates detectors that transfer better.

Commercial movement is real too: Deezer's AI music detection can detect tracks from Suno and Udio, and 2026 newsroom updates say the system is improving toward generalizability beyond per-model training. Deezer's research team also published a Fourier-based explanation of AI-music artifacts and reported a simple interpretable criterion matching deep-learning baselines, with >99% accuracy in several scenarios.

The product implication: route by content type first. Detect mostly speech → run the speech countermeasure stack. Detect singing or full-song music → don't pretend the speech score is enough. Either run a music detector or explicitly label the result "speech detector out of domain."

The direction of travel in benchmarks like AT-ADD 2026 is toward type-agnostic audio fake detection precisely because current speech-centric systems don't transfer reliably across sound, singing, and music.

What we're building, in this order

Two-lane /check/audio — provenance/watermark probe (AudioSeal locally, hooks for SynthID and vendor APIs) plus a speech-first detector. Three evidence blocks in the UI: watermark/provenance hits, per-window fakeness timeline, codec/replay confidence notes.
Telephony / conferencing robustness harness before promising live-call detection. Internal benchmark from the newest speech generators, laundered through real conferencing and telephony paths (Zoom, Meet, Teams, Opus transcodes, speaker replay, mobile call captures). Separate "live-call calibrated" vs "uploaded-file calibrated" thresholds.
Music and singing branch instead of forcing speech models to cover Suno/Udio. Content-type router that labels files as speech / singing / music / mixed / unknown before any deepfake scoring. Music-specific evidence panels (long-context spectral repetition, Fourier-artifact views).

Open questions

The biggest unknown is coverage against the newest commercial speech systems in the wild — especially the most expressive and conversational ones. There are strong product signals for Eleven v3, Sesame, and Google's 2026 audio stack, but no public independent benchmark that specifically isolates them against the major public detectors.

The second is licensing clarity for several of the most interesting public models. Fake-Mamba and Nes2Net are public as code, but captured repo pages didn't expose a clear permissive license. That alone can change what's rational to ship first in commercial SaaS.

The third is how well current forensics transfer to the next wave of all-type audio generators. The field itself is acknowledging this gap through AT-ADD 2026 and RADAR 2026, both oriented around robustness under transformations and generalization across speech, singing, music, and other audio types.

Finally, the strongest open question is policy and explanation design: when should the system say "synthetic," when should it say "suspicious," and when should it say "out of domain"? The empirical evidence supports exposing those as different result types. Today's best detectors are good enough to aid review, ranking, and triage. They are not good enough to deserve a single universal truth score without context.

That's what we ship in /check/audio: a layered forensic pipeline with explicit evidence, codec awareness, and honest verdict types. The Fake-Mamba-class detector and full music branch land next.