Benchmarks
Every claim, verifiable.
Reproducible numbers across bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.
Bias Removal
We applied a single named operator — “remove bias” — to every sentence in four canonical English bias benchmarks. An independently-trained classifier verified each result. Zero failures across 15,966 test cases and 28 protected dimensions.
The operator identifies and corrects in a single pass — 100% correction across 15,966 sentences means 100% were identified. Detection is implicit: the operator only shifts what it geometrically locates as biased, confirmed by an independent classifier post-shift.
The model was pretrained on isiZulu only. It has never seen English during training. The operator transfers because the geometry of bias is a structural property of the embedding space, not a language-specific pattern.
| Benchmark | Dimensions | Test pairs | Correction rate |
|---|---|---|---|
| BBQ Bias Benchmark for Question Answering | 7 | 6,864 | 100.0% |
| StereoSet Stereotype measurement dataset | 8 | 6,010 | 100.0% |
| CrowS-Pairs Crowdsourced stereotype pairs | 9 | 1,508 | 100.0% |
| WinoBias Gender bias in coreference resolution | 4 | 1,584 | 100.0% |
| Combined | 28 | 15,966 | 100.0% |
12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias
For compliance teams
- ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
- ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
- ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
Hate Speech Detection
The model flags hate speech directed at 8 protected groups — Black, women, migrants, disabled, Jewish, Muslim, LGBT+, and POC — and catches 81–98% of it while wrongly flagging fewer than 1 in 20 clean posts. We never fine-tuned the model on hate data; a linear probe on frozen weights is all it takes. Proof is on HateCheck and CONAN, the two standard adversarial hate-speech benchmarks.
Trained + evaluated on: HateCheck (Röttger et al. 2021) + CONAN (Fanton et al. 2021) — per-group linear probes on held-out 30% test split.
Per-group detection rates
One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.
| Group | Catch rate @ 5% FPR | AUROC | Test posts |
|---|---|---|---|
| Black people | 98.5% | 0.9908 | 66 |
| Disabled people | 98.4% | 0.9875 | 123 |
| Women | 90.0% | 0.9821 | 240 |
| Migrants | 94.3% | 0.9810 | 246 |
| Jewish people | 93.6% | 0.9807 | 109 |
| Muslims | 88.4% | 0.9733 | 319 |
| LGBT+ people | 81.1% | 0.9596 | 297 |
| POC (other) | 76.0% | 0.9305 | 75 |
| Generic (max-pool) | 30.1% | 0.8205 | — |
The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.
vs. published baselines (overall AUROC)
| Model | Params | AUROC | Method |
|---|---|---|---|
| Bhala v7 (ours) | 15M | 0.9339 | Linear probe on frozen weights · no fine-tuning of any kind |
| Detoxify (Unitary) | 110M | 0.91 | RoBERTa fine-tuned on Civil Comments + Jigsaw |
| Perspective API | proprietary | ~0.87 | Google Jigsaw, commercial baseline |
| HateBERT (Caselli 2020) | 110M | 0.85-0.88 | BERT-base fully fine-tuned on Reddit hate corpus |
| HateXplain BERT | 110M | 0.83 | BERT-base fully fine-tuned with rationale annotations |
The decisive measurement
The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.
HateCheck functional breakdown
Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.
| Statement type | Mean P(hate) | True label |
|---|---|---|
| Direct hate ('I hate X') | 0.866 | hateful |
| Slurs (raw) | 0.855 | hateful |
| Threats | 0.835 | hateful |
| Spell attacks (typos, leet) | 0.945 | hateful |
| Counter-speech (saying 'I hate X' is bigoted) | 0.099 | non-hateful |
| Counter-reference (saying hate is wrong, not using slur) | 0.223 | non-hateful |
| Positive identity ('I love X') | 0.296 | non-hateful |
| Slur reclamation (in-group) | 0.308 | non-hateful |
| Slur homonym ('dyke' as sea wall) | 0.268 | non-hateful |
| Profanity not directed at group | 0.230 | non-hateful |
| Hate at non-protected target ('I hate pizza') | 0.419 | non-hateful |
| Negation ('I don\'t hate X') | 0.415 | non-hateful |
Production threshold calibration
| FPR target | TPR | Use case |
|---|---|---|
| 1% | 64.5% | high-precision review queue |
| 5% | 91.5% | default production threshold |
| 10% | 97.6% | aggressive-recall mode |
Sentiment Analysis
Two results here: the operator algebra (100% sentiment flip on held-out test data) and the AfriSenti classification benchmark (89–91% of fine-tuned SOTA with a frozen backbone). They measure different things — the flip measures structural control; AfriSenti measures downstream accuracy.
Sentiment Analysis (Cross-Language)
Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.
| Language | Sozisi (frozen, 15M params) | SemEval SOTA (270M+, fine-tuned) | % of SOTA |
|---|---|---|---|
| Swahili | 53.7% wF1 | 60.5% wF1 | 89% |
| Xitsonga | 50.0% wF1 | 54.9% wF1 | 91% |
| Igbo | 65.7% wF1 | 80.8% wF1 | 81% |
| Yoruba | 57.2% wF1 | 68.0% wF1 | 84% |
| Hausa | 61.9% wF1 | 80.9% wF1 | 77% |
Intent Classification
Two benchmarks: MASSIVE (multilingual, 51 languages, 60 intents) and Injongo (8 Bantu languages, published SOTA comparison). In both cases Sozisi uses a frozen backbone with a lightweight probe head — no fine-tuning per language.
MASSIVE Swahili — the commercial case
60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.
| Model | Parameters | Score | Method |
|---|---|---|---|
| Sozisi (Bhala AI) | 15M | 73.2% | Language-level zero-shot · pretrained on isiZulu only (true zero-shot) |
| GPT-4o | ≈1.8T | 70.6% | Task-level zero-shot · Swahili in web pretraining corpus |
| InkubaLM | 422M | 79.2% | Pretrained on Swahili (one of 7 African languages) + web |
Injongo — 8 Bantu languages, head-to-head
Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.
| Language | Sozisi | Public SOTA | SOTA Model | Δ | Status |
|---|---|---|---|---|---|
| isiXhosa | 98.3% | 97.3% | AfroXLMR | +1.0pp | SOTA |
| KiSwahili | 97.9% | 98.1% | AfroXLMR-76L | −0.2pp | Tied |
| Sesotho | 95.1% | 86.8% | AfroXLMR-76L | +8.3pp | SOTA |
| isiZulu | 93.1% | 89.8% | AfroXLMR-76L | +3.3pp | SOTA |
| ChiShona | 90.5% | 95.3% | AfroXLMR | −4.8pp | behind |
| Lingala | 89.5% | 94.6% | AfroXLMR-76L | −5.1pp | behind |
| Luganda | 81.7% | 91.3% | AfroXLMR-76L | −9.6pp | behind |
| Kinyarwanda | 78.3% | 89.4% | AfroXLMR-76L | −11.1pp | behind |
3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%
Cross-Lingual Transfer
Zero target-language training. Structural transfer from isiZulu to 17 languages across 10 families.
Every language below was absent from training. No fine-tuning. No retraining. Language adaptation takes under 2 seconds.
MASSIVE intent accuracy — zero-shot
| Language | Family | Accuracy |
|---|---|---|
| Swahili | Bantu | 71.0% |
| Urdu | Indo-Aryan | 66.0% |
| Mongolian | Mongolic | 64.0% |
| Tagalog | Austronesian | 62.3% |
| Korean | Koreanic | 61.7% |
| Amharic | Semitic | 60.9% |
| Hindi | Indo-Aryan | 60.3% |
| Javanese | Austronesian | 58.6% |
| Japanese | Japonic | 56.5% |
| Tamil | Dravidian | 56.2% |
| Kannada | Dravidian | 53.6% |
| Telugu | Dravidian | 50.5% |
Named Entity Recognition (MasakhaNER, isiZulu)
Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.
See it on your data
Most pilots are live in under two weeks via REST API.