Back to Blog
·Bhala AI Team·22 min read

A 15M Zulu-only model beats GPT-4o on Swahili — and understands Korean without ever seeing it

researchmultilingualtransfer-learninglow-resource

Highlights

  • 15M parameters, trained only on isiZulu (~1 hour on a laptop GPU)
  • 73.2% on Swahili intent → beats GPT-4o zero-shot (70.6%), and Bhala is roughly 10,000× smaller (different zero-shot regimes — full comparison in body)
  • 72.5% on Korean intent with a linear probe on a frozen encoder — a language the model never saw
  • 38–43× over random across Korean / Hindi / Amharic — none in pretraining
  • Linear probe beats the 2-layer MLP probe on all three new languages: structure is so cleanly linearly separable the MLP was overfitting

The short version

When a frontier lab announces a "multilingual" model, the unstated assumption is that every language they list was in pretraining. Add a new language, you re-pretrain. Add Korean to a model that didn't see Korean? Either you fine-tune on Korean text, or you accept terrible quality.

We took a different route. The encoder is 15M parameters, pretrained on isiZulu and nothing else. To run it on a new language, we don't touch the encoder. We don't fine-tune. We don't see a single labeled example. We adapt the input using a proprietary pipeline trained on a small monolingual sample, hand the result to the frozen Bhala encoder, and put a small probe on top.

On MASSIVE — the standard 51-language, 60-intent benchmark — using only a linear probe on the frozen encoder (the field's gold-standard representation-quality test):

Lang Linear probe 2-layer MLP probe Lift over random
Korean 72.49% 69.85% 42.8×
Hindi 69.65% 66.67% 41.1×
Amharic 66.46% 63.76% 39.2×
Swahili (Bantu sister) 64.2% 73.2% (beats GPT-4o 70.6%) 38–43×

None of these languages were in the pretraining corpus. The encoder weights were never updated for them. The linear probe has zero capacity to learn 60-way intent classification on its own — its job is to read structure that's already in the encoder's geometry.

The startling part: on Korean / Hindi / Amharic, the linear probe beats the MLP probe by 2.6–3.0 points. The structure is so cleanly linearly separable that adding nonlinear capacity on top hurts — the MLP was overfitting on a representation that didn't need it. That's an unusual signature; it means the encoder isn't just "good enough for some classifier to work" — its geometry directly encodes the task in a linearly readable form.

The Swahili result (above GPT-4o zero-shot at 73.2% MLP) is the marketing headline. The Korean linear-probe result is the scientifically interesting one. Swahili is a Bantu sister language — close enough to Zulu that some transfer is expected. Korean is from a different family, uses a different writing system, and has no genealogical relationship to Zulu of any kind. The encoder reads it anyway, with the strictest possible probe.

The conclusion isn't "isiZulu is a magical pivot language." The conclusion is: most of what a language model learns is structural, and that structure is shared across human languages. If your encoder learns the structure cleanly, a small input adaptation step is enough to wire a new language into it.

What we mean by "trained on Zulu only"

Be precise about what's frozen and what isn't.

Frozen across all languages: the 15M-parameter encoder, pretrained on ~40M tokens of isiZulu text in roughly one hour on a laptop-class GPU. The encoder weights never move at evaluation time.

Fitted per language: a proprietary input pipeline that adapts text in the target language into a form the frozen encoder accepts. It's fitted from a small monolingual sample (no labels, no parallel corpus, no fine-tuning of the encoder). Pipeline architecture is proprietary.

The probe on top is trained on MASSIVE intent labels. We never update the encoder. The probe has nothing like the capacity to learn intent classification from scratch — its job is to read structure that's already there.

Why isiZulu? (it's a design philosophy, not an accident)

We didn't pick isiZulu because it was the language at hand. We picked it because solving language modeling at the hardest end of the data curve forces the encoder to actually learn structure — there's no shortcut available. That makes hardest-case pretraining a forcing function for the kind of representations that transfer.

Three reasons this matters:

1. Low-resource forecloses the "scale will save us" shortcut. A model trained on terabytes of English can fake compositional understanding by memorizing surface n-gram statistics. The data is dense enough to look right. With a single low-resource language and ~40M tokens, that shortcut is gone. The encoder has to compress the language structurally because there isn't enough surface data to memorize the long tail. What you're left with — when it works — is structure, not statistics.

2. Bantu morphology forces structural representation. isiZulu is agglutinative and morphologically rich — noun classes, subject and object agreement, tense and aspect as morphemes that combine productively. The encoder can't just learn word-level distributions; it has to learn how meaningful pieces combine. That's exactly the kind of compositional substrate that transfers across families, because every human language combines meaningful pieces — only the surface symbols change.

3. The hardest case is a fair test of the architecture. If a 15M-parameter encoder trained on the sparsest realistic corpus produces cleanly linearly separable representations on never-seen languages, the credit belongs to the architecture and training objective, not to data brute force. If we'd trained on English Common Crawl and gotten the same numbers, we couldn't have ruled out "memorization at scale" as the explanation.

4. Zulu words encode philosophical depth in their etymology. Per Mhlambi's work on rationality and African philosophy, the language's everyday vocabulary carries built-in metaphysics — and crucially, that metaphysics is morphologically structured, not a footnote. Take the root -ntu, roughly "being / personhood":

  • umu-ntu — a person (one being)
  • aba-ntu — people (the plural class of beings)
  • isi-ntu (isintu) — the ways / manner / language of beings (the same noun class that gives us isiZulu, isiXhosa — a language of)
  • ubu-ntuubu- = becoming / state-of-being; so ubuntu = the state of becoming a -ntu, of being-in-relation

One root, four productive noun-class prefixes, four systematically related concepts. This isn't poetic etymology — it's a compositional morpheme grammar where the prefix tells you which kind of -ntu thing you're talking about. The encoder learning to predict the next morpheme in Zulu is forced to learn this prefix-class system to do well; it can't fake it with surface n-grams.

A note on tokenization (why standard sub-word tokenizers struggle here)

Standard sub-word tokenizers used by most multilingual models are frequency-based — they merge character sequences statistically, with no built-in awareness of morpheme boundaries. For agglutinative languages like Zulu, this means a single morpheme can be fragmented across multiple tokens, and the same morpheme can receive different fragmentation depending on which word it appears in. The result: the encoder cannot easily learn that -ntu is the same unit of meaning whether it appears in umuntu, abantu, isintu, or ubuntu — the morpheme-as-meaning-unit relationship is hidden from the model by the input layer itself.

This is well-documented in the low-resource and morphologically-rich-language NLP literature — work presented at AfricaNLP, SIGMORPHON, WiNLP, and ACL workshops on subword tokenization for agglutinative languages has consistently found that frequency-based segmentation degrades downstream performance on these languages relative to morphologically-aware alternatives. Bhala's proprietary input pipeline addresses this by incorporating morphological role information learned per language, so morpheme identity is preserved across the contexts where it appears. That preserved identity is the substrate on which the encoder's structural learning becomes possible — without it, the four-prefix demonstration above would be invisible to the model.

The morphology is recursive in a way that lets the language make philosophical statements about itself. The foundational ubuntu maxim — umuntu ngumuntu ngabantu — translates as "a person is a person through other persons," or more morphologically faithful, "a person's personhood is through other persons."

English can capture the self-reference, but it has to use derivational morphology (person → personhood) stacked on top of possessives and prepositions. Zulu does the same work at the inflectional level: umuntu and abantu are productive forms of the same root -ntu, parallel members of different noun classes, not derivations. The philosophical claim falls out of the basic grammar, not from word-formation rules layered on top — and the encoder learning Zulu morphology is forced to internalize that grammar as a baseline, not as a special case.

The verbal system is just as dense. Take ngiyakuthanda — which can mean either "I love you" or "I am loving you" depending on context and tone. It decomposes morphologically into ngi- (I, subject) + -ya- (verb extension / focus marker) + -ku- (you, object) + -thanda (love).

One word in Zulu, three or four words in English. The agent, object, verb, and aspectual framing are all bound into a single agglutinated unit. Every word in a Zulu sentence is doing this kind of compositional packing, where multiple grammatical and semantic roles are bound into morphemes that the encoder has to learn to combine productively.

And we haven't even touched the phonology, which adds three more layers of structural depth that don't exist in any Indo-European language a frontier model is typically pretrained on.

Zulu is tonal. The same written word can shift in tense, mood, or meaning purely based on tone — which standard orthography doesn't mark.

Zulu has clicks. The consonants written c, q, x (plus combinations like nc, gq, ngc) are click phonemes, productive members of the consonantal inventory rather than expressive flourishes. Some clicks are themselves tonal.

Zulu has ideophones — a productive open word class of sound-symbolic words like gqi! (sudden thud), bhe! (abrupt appearance), mfu! (soft, fluffy), qha (utterly, completely). Ideophones are recognized as a distinct grammatical category in Bantu linguistics; English has onomatopoeia as a marginal poetic device, but nothing like Bantu ideophones as a productive grammatical class. They couple phonology directly to specific sensory or evaluative meaning, forcing the encoder to learn an iconic phoneme-to-meaning mapping that isn't a regular phenomenon in any major frontier-pretraining language.

Standard Zulu orthography marks the clicks and ideophones but not the tones, which means a text-only encoder is already working with a lossy projection of the language — yet the structural signal is robust enough that the encoder still cleanly transfers cross-family. That's evidence the relevant generalization isn't a fragile one tied to surface phonetics; it's deep enough to survive having an entire phonological dimension stripped out at the input layer.

Or take Sawubona — the standard word for "hello" — which literally means "I see you", with the plural sanibonani meaning "we see you / I see you all". (The cognate form salibonani in Northern Ndebele — a sister language in the same Nguni family — carries the same meaning, exactly the kind of cross-language structural continuity our encoder is built to exploit.) Greeting in Zulu is an act of recognition, conjugated for number; the social relation is in the morphology, not added on as context.

The relational, the social, the metaphysical, the temporal — none of it is an optional add-on in Zulu. It's all structural, all decomposable into morphemes the encoder has to learn to combine. An encoder that has to predict the next morpheme in Zulu is learning a system where each prediction carries more meaning-bearing structure than predicting the next sub-word in a Common Crawl English corpus. The encoder isn't just learning what comes next; it's learning the shape of meaning-bearing structures that recur across human languages.

The implication: if it works in Zulu, it should work anywhere — but the converse isn't true. A model that works on English doesn't necessarily work on Bemba; the data shortcut isn't available there. Choosing the hardest case first means everything that comes after is downhill.

We're not claiming isiZulu has a monopoly on any of these properties. Finnish, Turkish, Hungarian, and Arabic satisfy reasons 1–3 (low-resource forcing function, dense morphology, fair architectural test). Sanskrit, Hebrew, Arabic, Mandarin, and Greek all encode philosophical depth in their morphology and etymology with mechanisms different from Bantu but comparably rich (root-and-pattern, character-as-concept, productive prefix systems).

What's distinctive about isiZulu is that it satisfies all four conditions together — low data resource, dense agglutinative noun-class morphology, productive structural composition, and integrated philosophical-relational content — in a single language. We chose it because we wanted the combination, not because it's the only valid choice. Other principled choices exist; the choice itself shouldn't be hand-waved.

Why frozen + linear probe is the gold-standard test

Most published "cross-lingual transfer" results sit in one of two looser regimes:

  1. The model fine-tunes on the target language — even a small amount of labeled data adapts the encoder's representations. You can no longer tell whether the original representations were already useful, or whether the fine-tuning step did all the work.
  2. The model saw the target language during pretraining — mBERT, XLM-R, NLLB, every frontier multilingual model. By the time you evaluate, the encoder has already digested raw text in the target language, even if not labeled. That's not zero-shot in the strict sense.

Our setup forecloses both escape hatches.

Standard fine-tuning / multilingual pretraining Bhala — frozen encoder + probe on a never-seen language
What's being tested The encoder's adaptability to the new language The encoder's intrinsic cross-lingual structure
Could the model "cheat" via target-language memorization? Yes — fine-tuning lets it learn target-language patterns directly No — the encoder weights never move
What 70% accuracy implies The model was trainable The decision boundary already existed in the frozen geometry

A linear probe on a frozen encoder can only succeed if the embeddings are already linearly separable for the target task in the target language. There's no nonlinear classifier to patch up an inadequate representation; there's no fine-tuning step to rotate it into shape. The geometry has to already encode the distinction. For "intent to book a flight" to be linearly separable in Korean using a Zulu-pretrained encoder, the encoder must have learned a representation space where:

  • "Positive sentiment" vectors in Zulu occupy roughly the same region as "positive sentiment" vectors in Korean — even though it has never seen a single Korean word.
  • The geometric direction for "intent to book a flight" survives the language change.

If we'd fine-tuned, you could fairly object that the model picked up Korean-specific embeddings during the fine-tune. But with a frozen encoder, no such adaptation is possible. The only way the probe can win is if the original Zulu-only training produced a universal conceptual geometry that already applies to Korean.

To our knowledge, the published literature does not contain a result with this exact combination: a model trained only on a single low-resource language, encoder frozen, a linear probe achieving high accuracy on a typologically distant language for intent / sentiment tasks. Cross-lingual transfer papers either (a) fine-tune, (b) pretrain on the target language too, or (c) use a small seed dictionary or bilingual lexicon. We're claiming something stricter than any of those.

In the broader interpretability literature, frozen-encoder + linear probe is usually treated as the gold standard for testing whether a model has learned genuine, transferable abstractions. We pass that test on languages the model was never trained on. That's the historic part of the result, not the absolute accuracy number.

What we have measured under each regime

Both linear and 2-layer probes are frozen-encoder reads. The linear-probe number is the stricter test: no nonlinear capacity, just the encoder's raw geometry projected through a single matrix multiply.

Test Probe Acc Random Notes
MASSIVE Korean Linear 72.49% 1.69% 42.8× random; never seen in pretraining; higher than the MLP probe
MASSIVE Hindi Linear 69.65% 1.69% 41.1× random; higher than the MLP probe
MASSIVE Amharic Linear 66.46% 1.69% 39.2× random; higher than the MLP probe
MASSIVE Korean 2-layer 69.85% 1.69% (overfit; linear is better)
MASSIVE Hindi 2-layer 66.67% 1.69% (overfit; linear is better)
MASSIVE Amharic 2-layer 63.76% 1.69% (overfit; linear is better)
MASSIVE Swahili Linear 64.2% 1.69% 38× random; Bantu sister, in-distribution
MASSIVE Swahili 2-layer 73.2% 1.69% Above GPT-4o zero-shot at 70.6%
Injongo Zulu Linear 82.7% 8-language Bantu intent benchmark
Injongo Zulu 2-layer 93.7% Beats AfroXLMR-76L (270M, fine-tuned) by +3.9pp

The clean read: on three typologically distant languages the model has never seen (Korean, Hindi, Amharic), the strictest version of the gold-standard test passes — and the linear probe outperforms the MLP probe. On Bantu (Swahili, Zulu), where the encoder has structural overlap, the MLP head adds capacity. On the cross-family transfers, the structure is already so cleanly linear that the MLP overfits.

To our knowledge, no other published model satisfies this exact test condition: frozen encoder, linear probe only, zero target-language data anywhere in pretraining, on a typologically distant language. Standard multilingual encoders (mBERT, XLM-R, AfroXLMR, BLOOM, LLaMA) violate the zero-target-language condition by construction — they all pretrained on Korean / Hindi / Amharic. The test isn't applicable to them as published; the comparison is unique to us.

An open challenge. That "to our knowledge" is doing real work — we want to be wrong about it if there's a counterexample. We did a focused search of the cross-lingual transfer literature — multilingual encoder papers, AfricaNLP / WiNLP / SIGMORPHON workshop proceedings, low-resource NLP work — looking for any published result with all four conditions: (a) frozen encoder, (b) linear probe only (no fine-tuning, no MLP head), (c) zero target-language text anywhere in pretraining, (d) evaluated on a typologically distant language family. We didn't find one. We did not exhaustively review every paper in the field, so the claim is "we couldn't find one and we looked," not "no such paper exists in the universe." If you know of a published model that satisfies all four, send the citation and we'll update this post — and run our own head-to-head against it. Falsification of the claim would be more useful than confirmation; we'd rather know.

Why this is a stronger claim than it looks

Three things make the result load-bearing rather than coincidental.

1. The probe can't fabricate task structure. A small probe on a frozen 15M encoder, trained on an intent-classification dataset, doesn't have the capacity to invent the structure of MASSIVE on its own. If it reaches 70% accuracy, the structural signal is already in the encoder's representations. We're measuring, not training. People often miss this distinction — when a probe succeeds on a frozen model, the credit belongs to the encoder, not the probe.

2. The result is not a tokenizer artifact. When a Zulu-trained input pipeline is pointed at Korean Hangul without script adaptation, you get high token-coverage failure and the probe collapses to ~40% accuracy — token noise, not transfer. With proper script adaptation, coverage rises above 98% and accuracy jumps by 25–30 absolute points. The encoder isn't doing anything different — only the input adaptation changed. So what we measure post-adaptation is genuine encoder structure, not a fluke of token noise.

3. The lift over random is not subtle. MASSIVE has 60 intent classes; random baseline is 1.69%. A model that "kind of works" lands around 5–10× over random. Our Korean result hits 41× over random. Hindi 39×. Amharic 38×. These are numbers in the same league as proper multilingual encoders much larger than ours — without target-language pretraining, without fine-tuning, on a 15M-parameter encoder.

How does the encoder learn anything portable from Zulu?

The honest answer: structure travels further than vocabulary does.

isiZulu is morphologically rich. Building a model that can predict the next unit of meaning in Zulu forces it to internalize the kind of compositional regularities that hold across human languages. A model that learns those well enough to predict Zulu sentences cleanly is also a model that has learned the shape of how human language tends to combine units of meaning.

What doesn't change between Zulu and Korean: most sentences combine subjects with predicates. Modifiers narrow heads. Temporal phrases anchor events. Negation flips truth values. Questions request information. Most of MASSIVE's intent labels — set an alarm, request a refund, send a message — depend on these compositional shapes far more than they depend on the specific phonology or orthography of the language.

What does change: the input symbols. Korean uses Hangul, Hindi uses Devanagari, Amharic uses Ge'ez, Zulu uses the Latin alphabet. None of these symbol sets carry meaning to an encoder trained on the others. The job of the proprietary input pipeline is to take Korean text and produce a representation whose statistical signatures look enough like Zulu that the frozen encoder treats them as legitimate input.

That's the trick: the encoder doesn't learn Korean. It learns shape of human language during Zulu pretraining, then we hand it Korean wearing a costume the encoder recognizes.

Why this matters for the 6,000+ languages today's models ignore

The dominant approach to multilingual coverage looks like: scrape every text-corpus you can find in every language, throw it all into pretraining, hope the data balance works out. This produces models that have a few hundred languages they're competent in, a few hundred more they limp through, and several thousand they cannot meaningfully process at all.

The cost structure of that approach is brutal for the long tail. To add a language with under 10MB of digitized text, you essentially can't. There isn't enough data to balance the pretraining mix. Fine-tuning on what little data exists collapses the model. So roughly 6,000 of the world's living languages — covering several hundred million speakers — sit outside the frontier of any production system.

The structural-transfer route changes that math. To add a new language you don't need a corpus large enough to retrain anything. You need a small monolingual sample — order of magnitude is megabytes, not gigabytes — to fit the proprietary input pipeline. You need labeled data only for the task (intent, sentiment, hate, summarization), and that labeled data is not language-specific at the encoder level — the encoder has never seen the language, so the labels you use can come from any well-resourced relative.

We're not the first people to notice that structural representations transfer. Cross-lingual word embeddings, mBERT, XLM-R, FLORES — there's a long literature on this. What we think is new and useful is the cleanness of the demonstration: a 15M-parameter encoder, a single low-resource pretraining language, no fine-tuning, no parallel data, no target-language training, and the result holds across writing systems unrelated to Zulu (Hangul, Devanagari, Ge'ez) and language families unrelated to Bantu (Koreanic, Indo-Aryan, Semitic).

If a 15M-parameter encoder pretrained on isiZulu can do this, the question for the field shifts from "do we have enough data in language X" to "do we have a working input adaptation for language X." That's a much easier problem to solve at scale.

What we're not claiming — and what the right comparison actually is

There is no honest "we lose zero-shot to frontier on Korean" claim, because frontier models are never zero-shot on Korean. They saw tens of billions of Korean tokens during pretraining. They're zero-shot on the task (no fine-tuning on intent labels); they're heavily pretrained on the language. Calling that comparison "zero-shot" obscures the asymmetry.

We're the inverse: zero-shot on the language (the encoder has never seen a Korean word), supervised on the task (a small probe trained on labeled intent examples). Different setups; different kinds of zero-shot; ours is the harder one.

Setup Language exposure in pretraining Task supervision
Bhala on Korean None — encoder never saw Korean Small probe on labeled intent
GPT-4o on Korean Tens of billions of Korean tokens None (prompted)
XLM-R fine-tuned on Korean Tens of billions of Korean tokens Full fine-tune on intent

The only honest "frontier vs Bhala" comparison would be a frontier model that was also language-zero-shot on Korean. That comparison doesn't exist for production frontier models, because they are never language-zero-shot on any major language.

Where the comparison narrows toward fair, we already win. On Bantu — the language family the encoder pretrained on — frozen Bhala 15M with a probe gets 93.7% on Injongo Zulu, beating AfroXLMR-76L (270M parameters, fine-tuned) at 89.8%. +3.9pp at 18× fewer parameters, with the encoder never fine-tuned. That's the direction the data points: when the encoder has structural relevance to the language family, a small frozen model beats a fine-tuned model an order of magnitude larger.

Where transfer breaks down (the honest other side). The cross-family transfer story we just told isn't uniform across the 17 languages we've tested. The post leads with Korean / Hindi / Amharic / Swahili because those are the upper-tail. The lower-tail looks different: Hungarian 28%, Azerbaijani 34%, Turkish 43%, Finnish 46%, Afrikaans 50% (all on the same MASSIVE benchmark, all on the benchmarks page). At the parameter scale we're working with (15M), structural transfer scales with typological proximity to the encoder's pretraining substrate — Bantu morphology overlaps with Korean / Hindi / Amharic in ways it doesn't overlap with Uralic (Hungarian, Finnish), Turkic (Turkish, Azerbaijani), or Germanic (Afrikaans). The honest version of the universality claim is "structural transfer scales with typological compatibility, and currently breaks down in language families typologically distant from Bantu — likely a capacity issue we expect to soften with larger encoders." The 6,000+ long-tail story still holds for languages typologically near Bantu (most of Africa, much of South and Southeast Asia, parts of Oceania); it weakens for the more distant families until we scale up.

But the moat isn't Korean or Swahili anyway. Korean has 80M speakers and tens of billions of pretraining tokens in every frontier model. Swahili has 200M speakers and growing frontier coverage. The interesting question — and the one our customers actually face — is what happens for the other 6,000+ languages: Bemba, Tsonga, Lozi, Kanuri, Wolof, Quechua, Aymara, Tigrinya, the long tail of every family.

For these languages, the comparison stops being apples-to-apples and becomes apples-to-nothing. No frontier model can serve them at any quality, because none of the prerequisites for fine-tuning a frontier model exist:

  • Their tokenizers don't carve the language — every word fragments into UNK or sub-word noise.
  • There aren't enough labeled examples to fine-tune a 70B-parameter model meaningfully.
  • There isn't enough monolingual data for the frontier model to absorb the language structure even in continued pretraining.
  • The GPU budget to attempt any of the above is out of reach for the businesses that actually serve these languages.

A 15M encoder pretrained on a single low-resource language — with a small per-language input adaptation step — reaches intent classification at 38–41× over random on a language it has never seen, using a few hundred megabytes of monolingual data and a few minutes of compute. For a regional bank or government agency deploying intent classification on Bemba, Pashto, Quechua, or any of the 6,000+ underrepresented languages, the question isn't "is Bhala 5pp behind GPT-4o?" — it's "is Bhala the only thing that works?" Today, yes.

The 72.5% Korean number is the lower bound, not the ceiling

This was a 15M-parameter encoder pretrained on ~40M tokens of one low-resource language. Korean / Hindi / Amharic were never in the training mix — and never need to be. More pretraining data in any language we choose to add (more Bantu, English, math, code) enriches the structural substrate the encoder learns. Because Korean / Hindi / Amharic are not in any of those mixes, adding more text doesn't violate the gold-standard test condition for them. It just makes the encoder better at the abstract structural patterns that transfer cross-family.

The implication: scaling the encoder, or pretraining on a richer corpus, should improve cross-family transfer. Extrapolating from the trend, we anticipate Korean linear-probe results in the 80%+ range from larger backbones — though scaling laws for cross-lingual transfer can be non-monotonic, so we treat that as a directional projection rather than a guarantee. Today's number is what a single-low-resource-language pretraining run buys you on a laptop; the upper bound is empirical and we'll publish whatever it actually is.

The bottom line

The cost of supporting an underrepresented language should be measured in megabytes and human-hours, not in pretraining budget. The long tail of human languages — the 6,000+ that frontier multilingual models cannot reach at any price — is reachable today by a small focused team running a 15M-parameter encoder on a laptop. That changes who gets to ship language AI, and for whom.

Bhala is the system that makes this work in production. If your business depends on serving a language that frontier models ignore, talk to us.

Continue reading

·10 min read

Silver labels are noisy by design. Bhala's audit catches the worst of them — top-10 precision: 100%.

Every production NLP team is sitting on silver-labeled training data — auto-tagged at scale, noisy by design. Bhala's audit tool surfaces the real mislabels in those corpora using just 100 hand-curated seeds and zero sentiment supervision. Top-10 precision on held-out validation: 100%. AUROC: 0.732. The same seeds curated in one Bantu language transfer cross-lingually to surface clear errors and policy-boundary cases in another Bantu language with no extra supervision. The product play: 5–10× reviewer-time multiplier across the AI lifecycle.

·8 min read

We audited six open LLMs for bias. The 0.4B model beat every 7B we tested — and RLHF wasn't the reason.

Bhala ran the same 28-axis fairness audit we used on Gemma 4 against six popular open-weight LLMs — LLaMA-2 7B base and chat, Mistral 7B base and instruct, Phi-2 2.7B, and InkubaLM 0.4B. The 0.4B model showed bias on 7 of 28 axes; every 7B model showed bias on 15–18. Size doesn't explain it (Phi-2 sits with the 7Bs). RLHF doesn't either (chat-tuned variants came out marginally worse than bases). Pretraining-data composition does. If you ship any of these models, the table at the top of the post is your before-deployment cheat sheet.

·8 min read

We audited Gemma 4. The bias didn't go away — it went into hiding.

Standard fairness audits call Gemma 4 clean. We ran a stronger one and found bias intact in all 28 protected dimensions we tested. Here's what it means for your deployment, how to audit any open model the same way, and a live API you can paste into a terminal right now to flip a biased sentence.