Demo · live hate-speech classifier

Hate-speech classification. 9 protected categories. Distinguishes use from mention.

Nine per-category detectors, one per protected identity (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants). HateCheck adversarial AUROC 0.9339 — beats Detoxify (0.91), Perspective API (~0.87), and HateBERT (0.85-0.88). Distinguishes use ("I hate X", P=0.87) from mention ("Saying I hate X is bigoted", P=0.10) — the failure mode that trips up production classifiers.

How this works

Nine per-category detectors. A single detector loses category-specific signal — it collapses 9 distinct hate directions into one. Bhala calibrates one detector per identity (women, LGBTQ+, jews, muslims, black, asian, latino, disabled, migrants).
Aggregator score = max across probes. Fires the top category at its own calibrated 5%-FPR threshold rather than averaging.
Calibration set. Per-category probes trained on HateCheck + CONAN + CivilComments; 5%-FPR thresholds per identity.
Use vs mention. The model preserves quotation/sarcasm boundaries that string-matching and Perspective-style classifiers don't.
Signed receipt. SHA-256 audit record per call. Same head powers the Bluesky labeler at bsky.bhala.ai.

Other live demos