We audited six open LLMs for bias. The 0.4B model beat every 7B we tested — and RLHF wasn't the reason.
The short version
We took the 28-axis fairness battery from our Gemma 4 audit — BBQ, StereoSet, CrowS-Pairs, WinoBias, ~16,000 sentence pairs — and applied the output-level version of the same probe to six open-weight LLMs: LLaMA-2 7B base and chat, Mistral 7B base and instruct, Phi-2 2.7B, and InkubaLM 0.4B.
The metric is pseudo-log-likelihood. For each pair (stereotype, anti-stereotype) we ask: which version does the model assign higher probability to? Average over 50 pairs per axis, 28 axes, and you get a stereotype-preference rate per model — how often the model's own output distribution prefers the stereotyped completion. A perfectly clean model sits at exactly 0.500 (random). A maximally biased one sits at 1.0 (always stereotype) or 0.0 (always anti).
InkubaLM 0.4B has fewer than half the biased axes of any 7B Western model we tested. Size and RLHF don't explain the gap.
| Model | Params | Stereotype-pref rate | Biased axes (≥60% pref) |
|---|---|---|---|
| InkubaLM 0.4B | 0.4B | 0.550 | 7 / 28 |
| LLaMA-2 7B base | 7B | 0.599 | 15 / 28 |
| LLaMA-2 7B chat | 7B | 0.611 | 16 / 28 |
| Mistral 7B base | 7B | 0.611 | 17 / 28 |
| Phi-2 2.7B | 2.7B | 0.612 | 17 / 28 |
| Mistral 7B instruct | 7B | 0.618 | 18 / 28 |
InkubaLM's preference rate sits 5 points above random; the 7B models sit 10–12 points above. On the count of axes where the model preferred the stereotype version on at least 60% of pairs, the 7B Western models cluster at 15–18. InkubaLM is at 7.
Four things worth saying about this — including a frank note on what the audit can't tell you.
Update (May 1) — encoder-probe results just landed for InkubaLM
When the post above first published we said the encoder-level audit was "coming soon" for the five non-Gemma models. We just finished it for InkubaLM, the most interesting case (cleanest output, smallest model). The result inverts the surface story:
At the encoder level, InkubaLM has bias on 28 of 28 axes — the same count Gemma 4 produced. Output cleanness does not translate to encoder cleanness.
We ran the same 8-lens geometric probe described in the Gemma 4 audit — linear probe, MLP probe, whitened linear, operator-direction consistency, centroid distance, subspace residual, effective rank, Bhattacharyya overlap — across all 28 axes, every probed layer.
The three regimes match Gemma's almost line-for-line:
| Benchmark family | What fired | Reading |
|---|---|---|
| BBQ (7 axes) | 5 of 7 axes hit linear + MLP at 0.99–1.00 throughout; the other two (Religion, Sexual_orientation) show a nonlinear gap — Religion lin 0.84 / mlp 0.99 (+16pp), Sexual_orientation lin 0.77 / mlp 1.00 (+22pp) | Pair-level + curved structure — same fingerprint as Gemma 4 |
| CrowS-Pairs (9 axes) | Linear near chance, subspace residual 0.70–0.81 — anti-stereotype embeddings live in a different manifold from stereotype | Manifold-separated, invisible to a standard linear-probe audit. CrowS race-color: subspace 0.81 |
| StereoSet + WinoBias (12 axes) | Linear probe below chance on WinoBias (0.10–0.12), but subspace residual 0.71–0.76 | Pure manifold-level encoding. The same "linear probe gives a green light, subspace says structurally severe" pattern Gemma 4 showed on race-color |
One contrast worth flagging: Gemma 4's BBQ axes show RLHF suppression — linear probe peaks at 0.80–0.95 in early layers, then collapses to 0.26–0.35 by the final layer. InkubaLM's BBQ axes show no such collapse — linear probe stays at peak 0.77–1.00 through the final layer (depending on axis). That's consistent with InkubaLM not having gone through the same RLHF regime. The bias structure is the same; the surface polish that hides it from output-level audits isn't. That's why InkubaLM looks cleanest at the output (no RLHF surface effort, but also no RLHF worsening the encoder geometry beyond what pretraining wrote in).
The deployment implication is the sharper version of Case 2 above: if you tap any of these encoders for retrieval, semantic search, or classifier features, you inherit the encoded bias. The output-level table at the top of this post tells you which model is least likely to say something biased. The encoder-level result tells you no model in the test escapes encoded bias entirely.
LLaMA-2 / Mistral / Phi-2 encoder probes are queued and will be added when they finish. Based on the Gemma + InkubaLM pattern — two models with very different sizes, languages, and alignment regimes producing the same 28/28 BIAS_PRESENT verdict — we expect the remaining four to land in the same place. Open the audit JSON yourself: logs/inkubalm_full_algebra.json (mirroring under the existing Gemma audit dataset).
What this audit is measuring (and what the Gemma post measured)
The Gemma 4 post measured encoder geometry: linear and MLP probes applied to hidden states at every layer. That's what the model knows internally, regardless of what it chooses to say.
This post measures what the model says — pseudo-log-likelihood of stereotype vs anti-stereotype completions. That's where RLHF and alignment training have the most leverage. It's the level at which a model gets called "biased" or "clean" in most published audits.
These are complementary tests of the same 28 axes. The Gemma work found that for Gemma 4, the output level looked clean while the encoder-level structure was intact across all 28 axes. That distinction matters because RAG, semantic search, classifier deployments, and any embedding-flavored downstream pipeline taps the encoder directly — bypassing the output gate entirely. A model that passes a PLL audit can still be unsafe when its hidden states get pulled into a retrieval index.
For the other five models in this post, the encoder-level audit is rolling out one at a time. InkubaLM finished first — see the update section below. It produced the same 28/28 BIAS_PRESENT verdict as Gemma 4 despite being the cleanest model at the output level. LLaMA-2, Mistral, and Phi-2 encoder probes are queued. Treat the table above as necessary but not sufficient: a clean output level is a precondition for safe deployment, not proof of safety, especially for embedding-based systems.
Why is InkubaLM cleaner? Three candidates, weighed honestly.
Candidate 1: smaller models encode less bias because they have less capacity.
If this were the explanation, Phi-2 (2.7B, ~7× InkubaLM's size, well below the 7B group) should sit between InkubaLM and the 7B models. It doesn't. Phi-2 is at 0.612 — indistinguishable from the 7B group, far above InkubaLM. Reject.
Candidate 2: RLHF cleans up output bias.
RLHF should push stereotype preference down — that's the alignment story. Looking at base-vs-tuned pairs:
- LLaMA-2 7B base 0.599 → chat 0.611 (+0.012, worse)
- Mistral 7B base 0.611 → instruct 0.618 (+0.007, worse)
Both chat-tuned models are slightly more stereotype-preferring than their bases. That doesn't mean RLHF made them more biased on the way the world cares — RLHF reduces what the model says in conversation, which isn't quite what PLL measures. But it does mean RLHF as currently practiced isn't moving stereotype-preference rates downward in any consistent way at the level a fairness benchmark scores. Doesn't explain the InkubaLM gap.
Candidate 3: pretraining data composition.
InkubaLM's pretraining mix was scoped to African languages — substantially different statistical regularities than Common Crawl-heavy Western training. Stereotypes, when they're learned, are downstream of which patterns occur frequently in training. A more specialized, geographically and linguistically narrower corpus produces a different bias surface. InkubaLM is biased on 25% of axes (7 of 28); the 7B Western group is biased on 54–64% (15–18 of 28). That ~30-percentage-point gap (in the fraction of axes flagged) is far larger than RLHF's marginal effect (~3 points on stereotype-preference rate) and orthogonal to scale (Phi-2 doesn't help).
This is the most defensible explanation. It also reverses a common assumption in alignment work — that the path to safer models runs through bigger pretraining + better RLHF on top. The data here suggests pretraining composition matters more than either pretraining scale or post-hoc alignment, at least at the metric we can directly measure.
What this means for African NLP teams (briefly)
If you're shipping Bantu intent, sentiment, or classification in production, the practical question is whether a small purpose-built encoder can do the job — not which large Western LLM is least bad. Our BENCHMARKS page has the head-to-head numbers; the short version is that for Bantu specifically, frozen-encoder probing on a 15M-parameter model trained on the right data is competitive with or beats both 0.4B African and 7B+ Western LLMs on intent and sentiment.
That's a deployment-pattern observation, not a competitive flex. The interesting research finding here is the data-composition story — which would be just as true if a different team had built InkubaLM, or if Bhala's encoder didn't exist.
What this audit can't tell you — and what's next
The output-level audit is necessary but not sufficient. A model with a clean PLL profile can still have bias structurally encoded in its representations (we know this is the case for Gemma 4). If your product wraps any of the audited models for retrieval, semantic search, or classifier features, you need the encoder-level audit too. Those runs are queued.
The 50-pair-per-axis sample is small. All six runs use n=50 from each of the 28 source benchmarks. That's enough to see large effects (the InkubaLM gap is about 8× larger than the run-to-run noise) but not for fine-grained per-axis claims.
The instruction-tuned variants tell a confusing story. Both LLaMA-2 chat and Mistral instruct came out marginally worse than their bases on this metric. We don't have a confident explanation; possibilities include the instruct templates interacting with the PLL scoring, the chat models being slightly more confident in their next-token predictions in general, or genuine RLHF non-monotonicity. Worth a separate post once we've controlled for the templating effect.
Reproduce
The 28-axis output-level audit is a few hundred lines of Python; the methodology is published with the Gemma 4 audit. The benchmark sources are public:
- BBQ — Parrish et al. 2022
- CrowS-Pairs — Nangia et al. 2020
- StereoSet — Nadeem et al. 2021
- WinoBias — Zhao et al. 2018
Per-model audit JSON (28-axis stereotype-preference rates, mean log-prob gaps, per-axis breakdowns) will be released alongside the encoder-probe follow-up. If you want to run the same audit on your own deployment, talk to us.
This work was conducted by the Bhala research team in late April 2026. Inference for the LLaMA, Mistral, and Phi audits ran on vLLM; InkubaLM ran on local transformers due to a known issue with its custom attention path. The PLL math is identical across both engines (mean log P(token | left context)) and the per-axis numbers are reported from each model's own audit JSON without rescaling.
Continue reading
A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it
A 15M-parameter encoder pretrained on isiZulu — and nothing else — reaches 73.2% on Swahili intent (above GPT-4o zero-shot at 70.6%) and 72.5% on Korean using only a linear probe on the frozen encoder. Korean has nothing in common with Zulu — different family, different script, never seen in pretraining. By the strictest version of the field's gold-standard test (frozen encoder + linear probe + zero target-language data), this is the strongest published cross-lingual transfer result we know of.
Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.
Every production NLP team is sitting on silver-labeled training data — auto-tagged at scale, noisy by design. Bhala's audit tool surfaces the real mislabels in those corpora using just 100 hand-curated seeds and zero sentiment supervision. Top-10 precision on held-out validation: 100%. AUROC: 0.732. The same seeds curated in one Bantu language transfer cross-lingually to surface clear errors and policy-boundary cases in another Bantu language with no extra supervision. The product play: 5–10× reviewer-time multiplier across the AI lifecycle.
We audited Gemma 4. The bias didn't go away — it went into hiding.
Standard fairness audits call Gemma 4 clean. We ran a stronger one and found bias intact in all 28 protected dimensions we tested. Here's what it means for your deployment, how to audit any open model the same way, and a live API you can paste into a terminal right now to flip a biased sentence.