Client-Side AI Detection in 2026 — What's Actually Feasible in the Browser
A feasibility map for running AI-content detection entirely on-device in May 2026: WebGPU vs WASM SIMD vs Transformers.js, realistic latency on M-series Mac and mid-range Android, browser memory ceilings, the bundle budgets per modality (1–12 MB), and the phased rollout order (image first, video last) that maps to a 'fast private scan' vs 'deep cloud scan' UX.
The strongest privacy argument in AI-content detection — "your file never leaves your machine" — only works when the detection actually runs on the user's machine. By May 2026, that's increasingly viable, but only if you pick the right models for the right modality with the right runtime. This post is our working map of what's actually stable, what isn't, and the phased rollout we'd ship a privacy-first /check experience around.
State of browser inference in May 2026
WebGPU is enabled by default in Chrome, Edge, and Firefox stable. Safari (macOS and iOS) supports it via Metal, but with tighter memory ceilings and more frequent device loss under pressure. Production viable — if you implement graceful fallback to WASM.
ONNX Runtime Web (ORT Web). WebGPU execution provider is stable for CNNs, small transformers, and most ops in opset ≤ 18. INT8 quantization works reliably; 4-bit quantization still inconsistent. Cold-start cost is dominated by model fetch + compilation — caching via IndexedDB is essential.
Transformers.js. Good for small encoder models (DistilBERT class). Larger LLMs remain impractical in-browser except toy 100M class models. WebGPU backend improved, but memory fragmentation on Safari is still an operational risk.
Summary. Small CNNs and small transformers are stable client-side. Medium vision transformers (ViT Base) are marginal on Android. Anything > 150M parameters is not reliable cross-device.
What's viable by modality
Image — your best first client-side modality
Smallest credible footprint: 5M parameter CNN INT8 ≈ 5–8 MB model. Runtime memory peak ≈ 80–200 MB depending on resolution. Bundle budget: 8–12 MB per image model.
Realistic latency (1024px image):
| Device | Cold | Warm | |---|---|---| | M2 / M3 MacBook | 600–900 ms (incl. compile) | 80–150 ms | | Mid Android Chrome | 1.5–3 s | 250–600 ms | | Cheap Windows laptop (integrated GPU) | 1.2–2 s | 200–500 ms |
Best candidates: Small EfficientNet-B0-scale CNN trained on diffusion vs camera vs GAN residual artifacts. Patch-based residual detector (128×128 grid) for memory-bounded heatmap overlays. Pure C2PA verifier in WASM (OpenSSL or WebCrypto) — no ML required, very strong privacy story.
Confidence: high for CNN approach. Dissent: ViT-based detectors generalize better cross-generator, but are too heavy client-side.
Text — trivial to run client-side
Client-side LLM detectors are not credible in 2026. The better approach is stylometric feature extraction (perplexity ratio, burstiness, token entropy) + a shallow classifier (XGBoost or logistic regression) compiled to WASM.
Footprint: XGBoost model < 1 MB. Entire bundle < 2 MB. Latency: all devices < 50 ms for < 3k tokens.
The accuracy ceiling is lower than a server-scale transformer ensemble and more brittle against paraphrasing — but client-side text gives you an immediate privacy-forward story without any GPU dependency.
Confidence: high for feasibility. Dissent: without a strong encoder, cross-model generalization degrades quickly.
Audio anti-spoof — feasible under 30 seconds
Target model size: 1–5M parameter CNN on log-Mel spectrogram. Decode audio with WebCodecs, compute Mel spectrogram via WASM FFT, run CNN classifier via ORT Web.
Footprint: 3M params INT8 ≈ 3–4 MB.
Latency (10-second clip):
| Device | Cold | Warm | |---|---|---| | M-series Mac | 1.2 s | 200–400 ms | | Mid Android | 2–4 s | 500 ms – 1.2 s |
Safari mobile struggles with large AudioBuffers — you must chunk audio.
Confidence: moderate. Dissent: audio spoofing detection often needs larger models for robustness against new TTS families.
Video — frame-sampled heuristic only
Full video deepfake detection client-side isn't realistic cross-device. The feasible subset is frame sampling every N frames, run image detector per frame, aggregate temporal consistency score. Memory spikes, thermal throttling on Android, and user patience all bite quickly.
Bundle budget: reuse the image model.
Confidence: high. Dissent: high-end laptops can do more — but that excludes your mobile-heavy investigative users.
WebGPU constraints in 2026
Practical memory ceilings (not theoretical):
| Browser | Effective ceiling | |---|---| | Safari macOS | ~1.5–2 GB before device loss | | Safari iOS | ~700 MB – 1 GB | | Chrome desktop | 2–4 GB depending on GPU | | Android Chrome | 500 MB – 1.5 GB |
Model size ceiling for safety: keep any single model under 20 MB. Prefer under 10 MB.
Other operational notes: no background threads for GPU compilation beyond worker scope. Shader compilation adds cold latency. WebGPU is unavailable in some hardened enterprise environments. Fallback hierarchy: WebGPU → WASM SIMD → server.
Privacy and sales reality
The privacy story has different weight by audience:
| Audience | Privacy strength | |---|---| | Journalists | Strong — embargoed material, whistleblower leaks. Client-side scanning is a differentiator. | | Trust & Safety teams | Less sensitive. Already process content server-side. Throughput and audit logs matter more. | | KYC / insurance | Moderately strong. "File never leaves device unless you choose to escalate." | | Legal / eDiscovery | High sensitivity. Client-side preview analysis reduces chain-of-custody concerns. |
Bottom line: privacy matters most for journalists and legal buyers. Less decisive for platforms. Confidence: moderate-high. Dissent: some enterprise buyers distrust client-side because results are harder to centrally audit.
Browser extension as deployment surface
Pros: auto-analyze images on page, persistent model cache, better file-system access, lower friction for journalists. Cons: extension store review policies, enterprise lockdown environments, harder monetization gating.
Strategic view: extension is a strong secondary surface — after the web client stabilizes.
Concrete /check architecture
Default flow:
- User uploads file.
- Browser detects modality.
- Attempt client-side:
- C2PA verification first (instant — pure WASM).
- If image: lightweight CNN.
- If text: stylometric detector.
- If audio: small anti-spoof if < 30 s.
- If video: frame sample + image detector.
- If model too large, WebGPU unavailable, file exceeds threshold, or user requests "deep scan" → offer server-side escalation with explicit consent toggle.
UI framing:
- "Fast private scan (on your device)"
- "Deep scan (cloud ensemble)"
Bundle budgets
| Modality | Target | |---|---| | Text | 1–2 MB | | Image | 8–12 MB | | Audio | 4–6 MB | | C2PA verifier | 1–2 MB |
Lazy-load per modality route. Never ship all modalities at once.
Phased rollout order
- Image (highest ROI, most visual evidence alignment)
- Text stylometric
- C2PA verifier
- Audio
- Video sampling
What we ship next
1. Client-side image CNN + C2PA verifier. ROI: very high. Effort: moderate. Train or fine-tune a 5M parameter EfficientNet-class CNN on diffusion vs camera artifacts. Export to ONNX opset 17. Post-training INT8 quantization. Integrate via ONNX Runtime Web with WebGPU backend. Cache model in IndexedDB. Build patch-grid heatmap visualization. Add WASM C2PA signature verification before the ML stage.
2. Text stylometric WASM detector. ROI: high. Effort: low. Feature extraction in pure JS and WASM (token entropy, sentence-length variance, function-word frequency, perplexity against a small 20M parameter masked LM or precomputed heuristics). Train XGBoost classifier offline. Export to JSON. Load and run in WASM. Entire inference under 50 ms.
3. Client-first / server-fallback orchestration layer. ROI: high. Effort: moderate. Device-capability detection at runtime. Estimate available GPU memory via probe allocation. If below threshold, auto-select WASM or server. Add an explicit "Keep analysis local" vs "Run deep cloud scan" toggle. Log user choice for product analytics.
Open questions
- How well do 5M parameter image CNNs generalize to 2026 diffusion families without frequent retraining?
- Will Safari iOS WebGPU stability materially improve over the next 12 months?
- Is there a compact multimodal joint embedding model under 20M parameters that meaningfully improves cross-generator robustness?
- Will regulatory pressure require centralized audit logs that weaken the client-side privacy pitch?
- How tolerant are journalists to 1–2 second cold starts on large images?
That's the /check we're rebuilding around a "fast private scan" default.