Benchmarks

Every claim, verifiable.

Reproducible numbers across bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.

100% · 28 protected dimensions · 15,966 pairs · 2026-04-25

Bias Removal

We applied a single named operator — “remove bias” — to every sentence in four canonical English bias benchmarks. An independently-trained classifier verified each result. Zero failures across 15,966 test cases and 28 protected dimensions.

The operator identifies and corrects in a single pass — 100% correction across 15,966 sentences means 100% were identified. Detection is implicit: the operator only shifts what it geometrically locates as biased, confirmed by an independent classifier post-shift.

The model was pretrained on isiZulu only. It has never seen English during training. The operator transfers because the geometry of bias is a structural property of the embedding space, not a language-specific pattern.

BenchmarkDimensionsTest pairsCorrection rate
BBQ
Bias Benchmark for Question Answering
76,864100.0%
StereoSet
Stereotype measurement dataset
86,010100.0%
CrowS-Pairs
Crowdsourced stereotype pairs
91,508100.0%
WinoBias
Gender bias in coreference resolution
41,584100.0%
Combined2815,966100.0%

12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias

AgeDisabilityGender identityGender · occupational stereotypeGender · pronoun coreferenceNationalityPhysical appearanceProfessionRace / colorReligionSexual orientationSocioeconomic status

For compliance teams

  • ·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
  • ·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
  • ·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.
8 protected groups · catches 9 in 10 hate posts · 5% false-positive rate · 2026-04-28

Hate Speech Detection

The model flags hate speech directed at 8 protected groups — Black, women, migrants, disabled, Jewish, Muslim, LGBT+, and POC — and catches 81–98% of it while wrongly flagging fewer than 1 in 20 clean posts. We never fine-tuned the model on hate data; a linear probe on frozen weights is all it takes. Proof is on HateCheck and CONAN, the two standard adversarial hate-speech benchmarks.

Trained + evaluated on: HateCheck (Röttger et al. 2021) + CONAN (Fanton et al. 2021) — per-group linear probes on held-out 30% test split.

8
Protected groups covered
Black · women · migrants · disabled · Jewish · Muslim · LGBT+ · POC
81–98%
Hate caught per group
At 5% false-positive rate — our default threshold
<1 in 20
Clean posts flagged
False-positive rate at default threshold
0
Fine-tuning steps
Frozen backbone — linear probe only, 13KB weights per group

Per-group detection rates

One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.

GroupCatch rate @ 5% FPRAUROCTest posts
Black people98.5%0.990866
Disabled people98.4%0.9875123
Women90.0%0.9821240
Migrants94.3%0.9810246
Jewish people93.6%0.9807109
Muslims88.4%0.9733319
LGBT+ people81.1%0.9596297
POC (other)76.0%0.930575
Generic (max-pool)30.1%0.8205

The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.

vs. published baselines (overall AUROC)

ModelParamsAUROCMethod
Bhala v7 (ours)15M0.9339Linear probe on frozen weights · no fine-tuning of any kind
Detoxify (Unitary)110M0.91RoBERTa fine-tuned on Civil Comments + Jigsaw
Perspective APIproprietary~0.87Google Jigsaw, commercial baseline
HateBERT (Caselli 2020)110M0.85-0.88BERT-base fully fine-tuned on Reddit hate corpus
HateXplain BERT110M0.83BERT-base fully fine-tuned with rationale annotations

The decisive measurement

The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.

HateCheck functional breakdown

Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.

Statement typeMean P(hate)True label
Direct hate ('I hate X')0.866hateful
Slurs (raw)0.855hateful
Threats0.835hateful
Spell attacks (typos, leet)0.945hateful
Counter-speech (saying 'I hate X' is bigoted)0.099non-hateful
Counter-reference (saying hate is wrong, not using slur)0.223non-hateful
Positive identity ('I love X')0.296non-hateful
Slur reclamation (in-group)0.308non-hateful
Slur homonym ('dyke' as sea wall)0.268non-hateful
Profanity not directed at group0.230non-hateful
Hate at non-protected target ('I hate pizza')0.419non-hateful
Negation ('I don\'t hate X')0.415non-hateful

Production threshold calibration

FPR targetTPRUse case
1%64.5%high-precision review queue
5%91.5%default production threshold
10%97.6%aggressive-recall mode

Caveats

  • ·HateCheck + CONAN are adversarial benchmark corpora — strong signal on robustness but not a substitute for live firehose validation. Bluesky production eval is in progress.
  • ·Per-group probes each use a separate linear head trained on that group's examples. The generic max-pool probe (AUROC 0.82) shows per-group specialization is worth the marginal cost.
  • ·Probe training takes ~5 minutes on a laptop CPU. Re-training against your own labeled data requires no GPU.
100% flip · Zulu + Swahili · 89–91% of fine-tuned SOTA

Sentiment Analysis

Two results here: the operator algebra (100% sentiment flip on held-out test data) and the AfriSenti classification benchmark (89–91% of fine-tuned SOTA with a frozen backbone). They measure different things — the flip measures structural control; AfriSenti measures downstream accuracy.

100%
Sentiment flip — isiZulu
547 held-out test sentences
100%
Sentiment flip — KiSwahili
Zero-shot. Model never saw Swahili.
77%
Sentiment flip — English
Zero-shot cross-family transfer

Sentiment Analysis (Cross-Language)

Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.

LanguageSozisi (frozen, 15M params)SemEval SOTA (270M+, fine-tuned)% of SOTA
Swahili53.7% wF160.5% wF189%
Xitsonga50.0% wF154.9% wF191%
Igbo65.7% wF180.8% wF181%
Yoruba57.2% wF168.0% wF184%
Hausa61.9% wF180.9% wF177%

Operator algebra — flip accuracy

A named operator applied at inference time. An independent classifier verifies the shift took effect.

TaskZuluSwahiliEnglishTest cases
Sentiment shift (negative → positive)100%100%263
Intent redirect (12 categories)94%77%1,969

Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.

Beats GPT-4o · SOTA on 4 of 8 Bantu · 18× smaller

Intent Classification

Two benchmarks: MASSIVE (multilingual, 51 languages, 60 intents) and Injongo (8 Bantu languages, published SOTA comparison). In both cases Sozisi uses a frozen backbone with a lightweight probe head — no fine-tuning per language.

MASSIVE Swahili — the commercial case

60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.

ModelParametersScoreMethod
Sozisi (Bhala AI)15M73.2%Language-level zero-shot · pretrained on isiZulu only (true zero-shot)
GPT-4o≈1.8T70.6%Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM422M79.2%Pretrained on Swahili (one of 7 African languages) + web

Injongo — 8 Bantu languages, head-to-head

Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.

Sozisi (ours)
15M params
Frozen backbone, isiZulu pretraining only
AfroXLMR-76L
270M
Fine-tuned per target language
Efficiency
18×
Smaller model, matches or beats on 4 of 8 languages
LanguageSozisiPublic SOTASOTA ModelΔStatus
isiXhosa98.3%97.3%AfroXLMR+1.0ppSOTA
KiSwahili97.9%98.1%AfroXLMR-76L−0.2ppTied
Sesotho95.1%86.8%AfroXLMR-76L+8.3ppSOTA
isiZulu93.1%89.8%AfroXLMR-76L+3.3ppSOTA
ChiShona90.5%95.3%AfroXLMR−4.8ppbehind
Lingala89.5%94.6%AfroXLMR-76L−5.1ppbehind
Luganda81.7%91.3%AfroXLMR-76L−9.6ppbehind
Kinyarwanda78.3%89.4%AfroXLMR-76L−11.1ppbehind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

40+ languages · 10 families · zero retraining

Cross-Lingual Transfer

Zero target-language training. Structural transfer from isiZulu to 17 languages across 10 families.

Every language below was absent from training. No fine-tuning. No retraining. Language adaptation takes under 2 seconds.

MASSIVE intent accuracy — zero-shot

LanguageFamilyAccuracy
SwahiliBantu71.0%
UrduIndo-Aryan66.0%
MongolianMongolic64.0%
TagalogAustronesian62.3%
KoreanKoreanic61.7%
AmharicSemitic60.9%
HindiIndo-Aryan60.3%
JavaneseAustronesian58.6%
JapaneseJaponic56.5%
TamilDravidian56.2%
KannadaDravidian53.6%
TeluguDravidian50.5%

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.

96.6%
Token Accuracy
77.7%
Span F1
78.2%
Precision
77.2%
Recall

See it on your data

Most pilots are live in under two weeks via REST API.