Demo · live hate-speech classifier

Hate-speech classification. 9 protected categories. Distinguishes use from mention.

Frozen 15M-parameter v7 backbone scoring 9 per-category linear probes (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants). HateCheck adversarial AUROC 0.9339 — beats Detoxify (0.91), Perspective API (~0.87), and HateBERT (0.85-0.88). Distinguishes use ("I hate X", P=0.87) from mention ("Saying I hate X is bigoted", P=0.10) — the failure mode that trips up production classifiers.

How this works

  • v7 backbone + 9 per-category linear probes. Single-direction probes lose category-specific signal — they collapse 9 distinct hate vectors into one. v3 deploys probes per identity (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants).
  • Aggregator score = max across probes. Fires the top category at its own calibrated 5%-FPR threshold rather than averaging.
  • Calibration set. Per-category probes trained on HateCheck + CONAN + CivilComments; 5%-FPR thresholds per identity.
  • Use vs mention. The model preserves quotation/sarcasm boundaries that string-matching and Perspective-style classifiers don't.
  • Signed receipt. SHA-256 audit record per call. Same head powers the Bluesky labeler at bsky.bhala.ai.