Demo · live hate-speech classifier
Hate-speech classification. 9 protected categories. Distinguishes use from mention.
Frozen 15M-parameter v7 backbone scoring 9 per-category linear probes (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants). HateCheck adversarial AUROC 0.9339 — beats Detoxify (0.91), Perspective API (~0.87), and HateBERT (0.85-0.88). Distinguishes use ("I hate X", P=0.87) from mention ("Saying I hate X is bigoted", P=0.10) — the failure mode that trips up production classifiers.
How this works
- v7 backbone + 9 per-category linear probes. Single-direction probes lose category-specific signal — they collapse 9 distinct hate vectors into one. v3 deploys probes per identity (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants).
- Aggregator score = max across probes. Fires the top category at its own calibrated 5%-FPR threshold rather than averaging.
- Calibration set. Per-category probes trained on HateCheck + CONAN + CivilComments; 5%-FPR thresholds per identity.
- Use vs mention. The model preserves quotation/sarcasm boundaries that string-matching and Perspective-style classifiers don't.
- Signed receipt. SHA-256 audit record per call. Same head powers the Bluesky labeler at bsky.bhala.ai.