Benchmarks

Every claim, verifiable.

Reproducible numbers across bias removal, hate speech, sentiment, intent, and cross-lingual transfer. All benchmarks use public datasets. Every result is reproducible in under 90 seconds on a laptop GPU.

100% · 28 protected dimensions · 15,966 pairs · 2026-04-25

Bias Removal

We applied a single named operator — “remove bias” — to every sentence in four canonical English bias benchmarks. An independently-trained classifier verified each result. Zero failures across 15,966 test cases and 28 protected dimensions.

The operator identifies and corrects in a single pass — 100% correction across 15,966 sentences means 100% were identified. Detection is implicit: the operator only shifts what it geometrically locates as biased, confirmed by an independent classifier post-shift.

The model was pretrained on isiZulu only. It has never seen English during training. The operator transfers because the geometry of bias is a structural property of the embedding space, not a language-specific pattern.

Benchmark	Dimensions	Test pairs	Correction rate
BBQ Bias Benchmark for Question Answering	7	6,864	100.0%
StereoSet Stereotype measurement dataset	8	6,010	100.0%
CrowS-Pairs Crowdsourced stereotype pairs	9	1,508	100.0%
WinoBias Gender bias in coreference resolution	4	1,584	100.0%
Combined	28	15,966	100.0%

12 categories · 28 test dimensions across BBQ, StereoSet, CrowS-Pairs, WinoBias

AgeDisabilityGender identityGender · occupational stereotypeGender · pronoun coreferenceNationalityPhysical appearanceProfessionRace / colorReligionSexual orientationSocioeconomic status

For compliance teams

·These are published academic benchmarks. Production deployment into your bank or health system requires validation on your own internal text (loan memos, credit decisions, clinical notes) — which we conduct together during pilot.
·Two bias-removal methods were tested on identical data: a statistical baseline (published 2016) and our patented learned method. Both are available in production. The statistical baseline achieved zero failures on all 28 categories.
·Results reproducible by anyone with access to our model and the four public benchmarks. Full methodology is available under NDA.

8 protected groups · catches 9 in 10 hate posts · 5% false-positive rate · 2026-04-28

Hate Speech Detection

The model flags hate speech directed at 8 protected groups — Black, women, migrants, disabled, Jewish, Muslim, LGBT+, and POC — and catches 81–98% of it while wrongly flagging fewer than 1 in 20 clean posts. We never fine-tuned the model on hate data; a linear probe on frozen weights is all it takes. Proof is on HateCheck and CONAN, the two standard adversarial hate-speech benchmarks.

Trained + evaluated on: HateCheck (Röttger et al. 2021) + CONAN (Fanton et al. 2021) — per-group linear probes on held-out 30% test split.

Protected groups covered

Black · women · migrants · disabled · Jewish · Muslim · LGBT+ · POC

81–98%

Hate caught per group

At 5% false-positive rate — our default threshold

<1 in 20

Clean posts flagged

False-positive rate at default threshold

Fine-tuning steps

Frozen backbone — linear probe only, 13KB weights per group

Per-group detection rates

One probe per group, trained on HateCheck + CONAN examples for that group. Catch rate = % of hate posts flagged at 5% false-positive rate. AUROC is the area under the full precision-recall curve — 1.0 is perfect, 0.5 is random.

Group	Catch rate @ 5% FPR	AUROC	Test posts
Black people	98.5%	0.9908	66
Disabled people	98.4%	0.9875	123
Women	90.0%	0.9821	240
Migrants	94.3%	0.9810	246
Jewish people	93.6%	0.9807	109
Muslims	88.4%	0.9733	319
LGBT+ people	81.1%	0.9596	297
POC (other)	76.0%	0.9305	75
Generic (max-pool)	30.1%	0.8205	—

The generic max-pool row is a single-probe aggregate baseline — per-group probes outperform it on every group.

vs. published baselines (overall AUROC)

Model	Params	AUROC	Method
Bhala v7 (ours)	15M	0.9339	Linear probe on frozen weights · no fine-tuning of any kind
Detoxify (Unitary)	110M	0.91	RoBERTa fine-tuned on Civil Comments + Jigsaw
Perspective API	proprietary	~0.87	Google Jigsaw, commercial baseline
HateBERT (Caselli 2020)	110M	0.85-0.88	BERT-base fully fine-tuned on Reddit hate corpus
HateXplain BERT	110M	0.83	BERT-base fully fine-tuned with rationale annotations

The decisive measurement

The decisive measurement: 'I hate X' (P=0.87) and 'Saying I hate X is bigoted' (P=0.10) share 80% of their surface tokens but receive 9x different hate scores. That use/mention distinction emerged from frozen weights — without a single hate-labeled training example.

HateCheck functional breakdown

Mean P(hate) per statement type — shows what the model distinguishes, not just whether it scores correctly.

Statement type	Mean P(hate)	True label
Direct hate ('I hate X')	0.866	hateful
Slurs (raw)	0.855	hateful
Threats	0.835	hateful
Spell attacks (typos, leet)	0.945	hateful
Counter-speech (saying 'I hate X' is bigoted)	0.099	non-hateful
Counter-reference (saying hate is wrong, not using slur)	0.223	non-hateful
Positive identity ('I love X')	0.296	non-hateful
Slur reclamation (in-group)	0.308	non-hateful
Slur homonym ('dyke' as sea wall)	0.268	non-hateful
Profanity not directed at group	0.230	non-hateful
Hate at non-protected target ('I hate pizza')	0.419	non-hateful
Negation ('I don\'t hate X')	0.415	non-hateful

Production threshold calibration

FPR target	TPR	Use case
1%	64.5%	high-precision review queue
5%	91.5%	default production threshold
10%	97.6%	aggressive-recall mode

Caveats

·HateCheck + CONAN are adversarial benchmark corpora — strong signal on robustness but not a substitute for live firehose validation. Bluesky production eval is in progress.
·Per-group probes each use a separate linear head trained on that group's examples. The generic max-pool probe (AUROC 0.82) shows per-group specialization is worth the marginal cost.
·Probe training takes ~5 minutes on a laptop CPU. Re-training against your own labeled data requires no GPU.

100% flip · Zulu + Swahili · 89–91% of fine-tuned SOTA

Sentiment Analysis

Two results here: the operator algebra (100% sentiment flip on held-out test data) and the AfriSenti classification benchmark (89–91% of fine-tuned SOTA with a frozen backbone). They measure different things — the flip measures structural control; AfriSenti measures downstream accuracy.

100%

Sentiment flip — isiZulu

547 held-out test sentences

100%

Sentiment flip — KiSwahili

Zero-shot. Model never saw Swahili.

77%

Sentiment flip — English

Zero-shot cross-family transfer

Sentiment Analysis (Cross-Language)

Frozen backbone, no fine-tuning. 89–91% of fine-tuned SOTA on languages never seen in pretraining.

Language	Sozisi (frozen, 15M params)	SemEval SOTA (270M+, fine-tuned)	% of SOTA
Swahili	53.7% wF1	60.5% wF1	89%
Xitsonga	50.0% wF1	54.9% wF1	91%
Igbo	65.7% wF1	80.8% wF1	81%
Yoruba	57.2% wF1	68.0% wF1	84%
Hausa	61.9% wF1	80.9% wF1	77%

Operator algebra — flip accuracy

A named operator applied at inference time. An independent classifier verifies the shift took effect.

Task	Zulu	Swahili	English	Test cases
Sentiment shift (negative → positive)	100%	100%	—	263
Intent redirect (12 categories)	—	94%	77%	1,969

Today's model was pretrained almost entirely on isiZulu. English results come from generalization — applying learned structure to a language the model never saw at scale. We are now training the English-native version, and expect English to match or exceed the 94% Swahili number.

Beats GPT-4o · SOTA on 4 of 8 Bantu · 18× smaller

Intent Classification

Two benchmarks: MASSIVE (multilingual, 51 languages, 60 intents) and Injongo (8 Bantu languages, published SOTA comparison). In both cases Sozisi uses a frozen backbone with a lightweight probe head — no fine-tuning per language.

MASSIVE Swahili — the commercial case

60-intent benchmark. GPT-4o and InkubaLM trained on Swahili; Sozisi trained on isiZulu only (true zero-shot). Sozisi beats both.

Model	Parameters	Score	Method
Sozisi (Bhala AI)	15M	73.2%	Language-level zero-shot · pretrained on isiZulu only (true zero-shot)
GPT-4o	≈1.8T	70.6%	Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM	422M	79.2%	Pretrained on Swahili (one of 7 African languages) + web

Injongo — 8 Bantu languages, head-to-head

Sozisi (frozen backbone) vs AfroXLMR-76L (270M, fine-tuned per language). We match or beat them on 4 of 8.

Sozisi (ours)

15M params

Frozen backbone, isiZulu pretraining only

AfroXLMR-76L

270M

Fine-tuned per target language

Efficiency

18×

Smaller model, matches or beats on 4 of 8 languages

Language	Sozisi	Public SOTA	SOTA Model	Δ	Status
isiXhosa	98.3%	97.3%	AfroXLMR	+1.0pp	SOTA
KiSwahili	97.9%	98.1%	AfroXLMR-76L	−0.2pp	Tied
Sesotho	95.1%	86.8%	AfroXLMR-76L	+8.3pp	SOTA
isiZulu	93.1%	89.8%	AfroXLMR-76L	+3.3pp	SOTA
ChiShona	90.5%	95.3%	AfroXLMR	−4.8pp	behind
Lingala	89.5%	94.6%	AfroXLMR-76L	−5.1pp	behind
Luganda	81.7%	91.3%	AfroXLMR-76L	−9.6pp	behind
Kinyarwanda	78.3%	89.4%	AfroXLMR-76L	−11.1pp	behind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

40+ languages · 10 families · zero retraining

Cross-Lingual Transfer

Zero target-language training. Structural transfer from isiZulu to 17 languages across 10 families.

Every language below was absent from training. No fine-tuning. No retraining. Language adaptation takes under 2 seconds.

MASSIVE intent accuracy — zero-shot

Language	Family	Accuracy
Swahili	Bantu	71.0%
Urdu	Indo-Aryan	66.0%
Mongolian	Mongolic	64.0%
Tagalog	Austronesian	62.3%
Korean	Koreanic	61.7%
Amharic	Semitic	60.9%
Hindi	Indo-Aryan	60.3%
Javanese	Austronesian	58.6%
Japanese	Japonic	56.5%
Tamil	Dravidian	56.2%
Kannada	Dravidian	53.6%
Telugu	Dravidian	50.5%

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen backbone + CRF head. 96.6% token accuracy across people, places, organizations, and dates.

96.6%

Token Accuracy

77.7%

Span F1

78.2%

Precision

77.2%

Recall

See it on your data

Most pilots are live in under two weeks via REST API.

Try the live demo Talk to our team