Back to Blog
·Bhala AI Team·9 min read

RLHF makes LLM bias invisible. Here's what it actually looks like underneath.

researchfairnessinterpretabilityllmsopen-models

The short version

This post is a tour. We're going to show you what bias actually looks like inside an LLM — not what the model says, what its encoder encodes. Most fairness audits stop at the output: ask the model a leading question, see what it says, score it on a stereotype scale. That tells you what the model has been trained to say. It doesn't tell you what's still there underneath.

We probed six open-weight LLMs at both levels: LLaMA-2 7B (base and chat), Mistral 7B (base and instruct), Phi-2 2.7B, and InkubaLM 0.4B. First the surface — what the models say. Then we go inside, layer by layer, and show you the geometry the model uses to decide. The two pictures don't match. The model that looks cleanest at the output is just as biased at the encoder as the worst-looking one. RLHF is doing a real thing, but it isn't what most people think it's doing.

We started at the output, with the standard pseudo-log-likelihood test. For each (stereotype, anti-stereotype) sentence pair, we ask: which one does the model assign higher probability to? Average over 50 pairs per axis, 28 axes, you get a stereotype-preference rate per model. A perfectly clean model sits at exactly 0.500 (random); a maximally biased one sits at 1.0.

At the surface, InkubaLM 0.4B looks cleanest — fewer than half the biased axes of any larger model we tested. Hold that thought, because at the encoder level, the picture inverts.

Model Params Stereotype-pref rate Biased axes (≥60% pref)
InkubaLM 0.4B 0.4B 0.550 7 / 28
LLaMA-2 7B base 7B 0.599 15 / 28
LLaMA-2 7B chat 7B 0.611 16 / 28
Mistral 7B base 7B 0.611 17 / 28
Phi-2 2.7B 2.7B 0.612 17 / 28
Mistral 7B instruct 7B 0.618 18 / 28

InkubaLM's preference rate sits 5 points above random; the larger models sit 10–12 points above. On the count of axes where the model preferred the stereotype version on at least 60% of pairs, the larger models cluster at 15–18. InkubaLM is at 7.

But that's only what the models say. We also probed what they encode, and the picture flips.

What the encoder tells you that the output doesn't

When we ran the encoder-level probe on InkubaLM — the same 8-lens geometric audit (linear probe, MLP probe, whitened linear, operator-direction consistency, centroid distance, subspace residual, effective rank, Bhattacharyya overlap) we used on Gemma 4 — the surface story inverted.

These eight lenses are mostly well-established interpretability techniques: linear and MLP probes (Alain & Bengio 2016), whitened-linear variants (Hewitt & Liang 2019), subspace residual via PCA, effective rank (Roy & Vetterli 2007), and Bhattacharyya overlap (1943). The contribution is the audit framework — applying all eight systematically across 28 axes and every probed layer, and the rules for reading them together (linear-vs-MLP gap = nonlinear bias structure, high subspace residual = manifold-separated bias the linear probe can't see). Anyone with access to the model can reimplement and verify.

The result:

At the encoder level, InkubaLM has bias on 28 of 28 axes — the same count Gemma 4 produced. Output cleanness does not translate to encoder cleanness.

The three regimes match Gemma's almost line-for-line:

Benchmark family What fired Reading
BBQ (7 axes) 5 of 7 axes hit linear + MLP at 0.99–1.00 throughout; the other two (Religion, Sexual_orientation) show a nonlinear gap — Religion lin 0.84 / mlp 0.99 (+16pp), Sexual_orientation lin 0.77 / mlp 1.00 (+22pp) Pair-level + curved structure — same fingerprint as Gemma 4
CrowS-Pairs (9 axes) Linear near chance, subspace residual 0.70–0.81 — anti-stereotype embeddings live in a different manifold from stereotype Manifold-separated, invisible to a standard linear-probe audit. CrowS race-color: subspace 0.81
StereoSet + WinoBias (12 axes) Linear probe below chance on WinoBias (0.10–0.12), but subspace residual 0.71–0.76 Pure manifold-level encoding. The same "linear probe gives a green light, subspace says structurally severe" pattern Gemma 4 showed on race-color

One contrast worth flagging: Gemma 4's BBQ axes show RLHF suppression — linear probe peaks at 0.80–0.95 in early layers, then collapses to 0.26–0.35 by the final layer. InkubaLM's BBQ axes show no such collapse — linear probe stays at peak 0.77–1.00 through the final layer (depending on axis). That's consistent with InkubaLM not having gone through the same RLHF regime. The bias structure is the same; the surface polish that hides it from output-level audits isn't. That's why InkubaLM looks cleanest at the output (no RLHF surface effort, but also no RLHF worsening the encoder geometry beyond what pretraining wrote in).

The deployment implication is sharp: if you tap any of these encoders for retrieval, semantic search, or classifier features, you inherit the encoded bias regardless of how clean the output looks. The output-level table tells you which model is least likely to say something biased. The encoder-level result tells you no model in the test escapes encoded bias entirely. LLaMA-2 / Mistral / Phi-2 encoder probes are queued; based on the Gemma + InkubaLM pattern — two models with very different sizes, languages, and alignment regimes producing the same 28/28 BIAS_PRESENT verdict — we expect the remaining four to land in the same place. Download the per-axis, per-layer probe results: inkubalm_full_algebra.json (76 KB) — and the companion gemma4_full_algebra.json from the Gemma 4 audit.

What each level of this audit is measuring

The Gemma 4 post measured encoder geometry: linear and MLP probes applied to hidden states at every layer. That's what the model knows internally, regardless of what it chooses to say.

This post measures what the model says — pseudo-log-likelihood of stereotype vs anti-stereotype completions. That's where RLHF and alignment training have the most leverage. It's the level at which a model gets called "biased" or "clean" in most published audits.

These are complementary tests of the same 28 axes. The Gemma work found that for Gemma 4, the output level looked clean while the encoder-level structure was intact across all 28 axes. That distinction matters because RAG, semantic search, classifier deployments, and any embedding-flavored downstream pipeline taps the encoder directly — bypassing the output gate entirely. A model that passes a PLL audit can still be unsafe when its hidden states get pulled into a retrieval index.

For the other five models in this post, the encoder-level audit is rolling out one at a time — and we already showed above what InkubaLM looks like at that level (28/28 BIAS_PRESENT, despite being the cleanest at the output). LLaMA-2, Mistral, and Phi-2 encoder probes are queued. Treat the output-level table at the top as necessary but not sufficient: a clean output level is a precondition for safe deployment, not proof of safety, especially for embedding-based systems.

Why is InkubaLM cleaner? Three candidates, weighed honestly.

Candidate 1: smaller models encode less bias because they have less capacity.

If this were the explanation, Phi-2 (2.7B, ~7× InkubaLM's size, well below the 7B group) should sit between InkubaLM and the larger models. It doesn't. Phi-2 is at 0.612 — indistinguishable from the larger models, far above InkubaLM. Reject.

Candidate 2: RLHF cleans up output bias.

RLHF should push stereotype preference down — that's the alignment story. Looking at base-vs-tuned pairs:

  • LLaMA-2 7B base 0.599 → chat 0.611 (+0.012, worse)
  • Mistral 7B base 0.611 → instruct 0.618 (+0.007, worse)

Both chat-tuned models are slightly more stereotype-preferring than their bases. That doesn't mean RLHF made them more biased on the way the world cares — RLHF reduces what the model says in conversation, which isn't quite what PLL measures. But it does mean RLHF as currently practiced isn't moving stereotype-preference rates downward in any consistent way at the level a fairness benchmark scores. Doesn't explain the InkubaLM gap.

Candidate 3: pretraining data composition.

InkubaLM's pretraining mix is heavily weighted toward African languages — roughly 1.9B tokens of African-language data and 360M tokens of English (per the InkubaLM paper, arXiv:2408.17024). The larger models from frontier labs are predominantly English-language web text with a long tail of other languages. Stereotypes, when they're learned, are downstream of which patterns occur frequently in training — and which patterns dominate the corpus. A pretraining mix dominated by a different set of languages and registers produces a different bias surface. InkubaLM is biased on 25% of axes (7 of 28); the larger models are biased on 54–64% (15–18 of 28). That ~30-percentage-point gap (in the fraction of axes flagged) is far larger than RLHF's marginal effect (~3 points on stereotype-preference rate) and orthogonal to scale (Phi-2 doesn't help).

This is the most defensible explanation. It also reverses a common assumption in alignment work — that the path to safer models runs through bigger pretraining + better RLHF on top. The data here suggests pretraining composition matters more than either pretraining scale or post-hoc alignment, at least at the metric we can directly measure.

What this means for African NLP teams (briefly)

If you're shipping Bantu intent, sentiment, or classification in production, the practical question is whether a small purpose-built encoder can do the job — not which large general-purpose LLM is least bad. Our BENCHMARKS page has the head-to-head numbers; the short version is that for Bantu specifically, frozen-encoder probing on a 15M-parameter model trained on the right data is competitive with or beats both InkubaLM and the larger frontier-lab LLMs on intent and sentiment.

That's a deployment-pattern observation, not a competitive flex. The interesting research finding here is the data-composition story — which would be just as true if a different team had built InkubaLM, or if Bhala's encoder didn't exist.

What this audit can't tell you — and what's next

The output-level audit is necessary but not sufficient. A model with a clean PLL profile can still have bias structurally encoded in its representations (we know this is the case for Gemma 4). If your product wraps any of the audited models for retrieval, semantic search, or classifier features, you need the encoder-level audit too. Those runs are queued.

The 50-pair-per-axis sample is small. All six runs use n=50 from each of the 28 source benchmarks. That's enough to see large effects (the InkubaLM gap is about 8× larger than the run-to-run noise) but not for fine-grained per-axis claims.

The instruction-tuned variants tell a confusing story. Both LLaMA-2 chat and Mistral instruct came out marginally worse than their bases on this metric. We don't have a confident explanation; possibilities include the instruct templates interacting with the PLL scoring, the chat models being slightly more confident in their next-token predictions in general, or genuine RLHF non-monotonicity. Worth a separate post once we've controlled for the templating effect.

Reproduce

The 28-axis output-level audit is a few hundred lines of Python; the methodology is described in the Gemma 4 audit post. The benchmark sources are public:

The full per-axis, per-layer audit results — every probe metric (linear, MLP, whitened linear, centroid, operator-direction consistency, effective rank, subspace residual, Bhattacharyya), every layer probed, every axis tested — are downloadable directly:

If you arrive at different conclusions on the same data, we want to hear about it. If you want to run the same audit on your own deployment, talk to us.


This work was conducted by the Bhala research team in late April 2026. Inference for the LLaMA, Mistral, and Phi audits ran on vLLM; InkubaLM ran on local transformers due to a known issue with its custom attention path. The PLL math is identical across both engines (mean log P(token | left context)) and the per-axis numbers are reported from each model's own audit JSON without rescaling.

Continue reading

·22 min read

A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it

A 15M-parameter encoder pretrained on isiZulu — and nothing else — reaches 73.2% on Swahili intent (above GPT-4o zero-shot at 70.6%) and 72.5% on Korean using only a linear probe on the frozen encoder. Korean has nothing in common with Zulu — different family, different script, never seen in pretraining. By the strictest version of the field's gold-standard test (frozen encoder + linear probe + zero target-language data), this is the strongest published cross-lingual transfer result we know of.

·10 min read

Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.

Every production NLP team is sitting on silver-labeled training data — auto-tagged at scale, noisy by design. Bhala's audit tool surfaces the real mislabels in those corpora using just 100 hand-curated seeds and zero sentiment supervision. Top-10 precision on held-out validation: 100%. AUROC: 0.732. The same seeds curated in one Bantu language transfer cross-lingually to surface clear errors and policy-boundary cases in another Bantu language with no extra supervision. The product play: 5–10× reviewer-time multiplier across the AI lifecycle.

·8 min read

We audited Gemma 4. The bias didn't go away — it went into hiding.

Standard fairness audits call Gemma 4 clean. We ran a stronger one and found bias intact in all 28 protected dimensions we tested. Here's what it means for your deployment, how to audit any open model the same way, and a live API you can paste into a terminal right now to flip a biased sentence.