Research
Sozisi: 15M Parameters, No Compromises
A 15M-parameter model. New SOTA on isiZulu, isiXhosa, and Sesotho intent detection (Injongo), beating AfroXLMR-76L. Competitive cross-family transfer against models 28× larger. All benchmarks verified, reproducible, and run with frozen backbone + MLP probe.
Model Overview
Most AI memorizes words. We model the scaffolding underneath them.
Every language is different. The structure underneath them isn’t — morphology, noun-class agreement, argument composition. Teach a model vocabulary and it knows one language. Teach it structure and it knows thousands.
Over 6,000 languages are underserved by today’s AI (Stanford HAI, 2024). We pretrained a 15M-parameter model on one morphologically-rich language (isiZulu) and it works zero-shot across 40+ more — because the scaffolding is what they share.
Every benchmark below follows the same rule: the language under test was not in our training data. We say so explicitly each time, because the normal assumption is “big multilingual model beat a specialist.” Here the specialist is the 422M-parameter competitor that trained on the target language. We’re the 15M- parameter model that didn’t.
“Doesn’t GPT-4 already learn structure from web data?”
Incidentally, yes — for languages with abundant training data. English, Chinese, Spanish, French: a huge web corpus forces any sufficiently large model to absorb morphology and composition by statistical accident. For everything else, “structure” is whatever the model can infer from a sliver of the corpus. That’s why GPT-4o scores 71.2% on isiZulu intent and 70.6% on KiSwahili, even though both languages are in its pretraining corpus — competent, not native.
We take the opposite path. Structural features — morphological roles, noun-class agreement, argument composition — are explicit inputs during pretraining. The model isn’t hoping to pick up grammar; it’s taught it. That’s why a 15M-parameter model pretrained on isiZulu alone matches or beats a 270M competitor that was fine-tuned in-language on isiZulu, isiXhosa, and Sesotho — results that wouldn’t be possible if everyone already modelled structure well enough.
On English, the giant models win. We never claimed to beat English on English. On the 6,000 languages they barely touch, the math changes.
MASSIVE Swahili — the commercial case
MASSIVE is the standard intent-classification benchmark (60 intents, Amazon-curated). Only one of the three models below had zero Swahili text in its pretraining — ours. GPT-4o and InkubaLM both ingested Swahili during their pretraining; their 'zero-shot' is task-level (no fine-tuning on MASSIVE) but language-level exposure is built in. Our result is language-level zero-shot: the model transferred from isiZulu alone.
Be precise about what “zero-shot” means. In LLM parlance “zero-shot” usually means no task-specific fine-tuning — but the model has still read the target language during its web-scale pretraining. GPT-4o ingested billions of Swahili tokens; InkubaLM was explicitly pretrained on Swahili as one of seven African languages. Our Sozisi result is a stricter claim: the pretraining corpus was isiZulu and nothing else. Swahili appeared for the first time at inference, through a small MLP probe on a frozen backbone.
So read the table honestly. InkubaLM is 28× larger than us, pretrained on Swahili plus general web text, and wins the head-to-head by six points. That is the expected result for a specialist model at 28× the scale. What isn’t expected: a 15M model that has never seen Swahili at all lands within six points of it, and beats GPT-4o (1.8T parameters, Swahili all over its pretraining) by 2.6. That gap is a function of training budget, not architectural ceiling. More data closes it.
| Model | Parameters | Score | Method |
|---|---|---|---|
| Sozisi (Bhala AI) | 15M | 73.2% | Language-level zero-shot · pretrained on isiZulu only |
| GPT-4o | ≈1.8T | 70.6% | Task-level zero-shot · Swahili in web pretraining corpus |
| InkubaLM | 422M | 79.2% | Pretrained on Swahili (one of 7 African languages) + web |
Injongo — 8 Bantu languages, head-to-head
We pretrained on one language (isiZulu). AfroXLMR-76L (270M params) was fine-tuned in-language on each target. We run frozen + a light probe. On four of eight languages we match or beat a model 18× our size. On the languages where AfroXLMR wins, it does so with an 18× parameter advantage and target-language fine-tuning we do not do.
Sozisi vs published SOTA · head-to-head per language
Source: Adelani et al. 2025 (ACL Long Paper, arXiv:2502.09814). Baselines fine-tuned on Injongo training data per language; Sozisi uses frozen backbone + lightweight probe.
| Language | Sozisi | Public SOTA | SOTA Model | Δ | Status |
|---|---|---|---|---|---|
| isiXhosa | 98.3% | 97.3% | AfroXLMR | +1.0pp | SOTA |
| KiSwahili | 97.9% | 98.1% | AfroXLMR-76L | −0.2pp | Tied |
| Sesotho | 95.1% | 86.8% | AfroXLMR-76L | +8.3pp | SOTA |
| isiZulu | 93.1% | 89.8% | AfroXLMR-76L | +3.3pp | SOTA |
| ChiShona | 90.5% | 95.3% | AfroXLMR | −4.8pp | behind |
| Lingala | 89.5% | 94.6% | AfroXLMR-76L | −5.1pp | behind |
| Luganda | 81.7% | 91.3% | AfroXLMR-76L | −9.6pp | behind |
| Kinyarwanda | 78.3% | 89.4% | AfroXLMR-76L | −11.1pp | behind |
3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%
Full Zulu leaderboard (incl. LLMs)
| Model | Parameters | Accuracy | Method |
|---|---|---|---|
| Sozisi (Bhala AI) | 15M | 93.1% | Frozen + probe |
| AfroXLMR-76L | 270M | 89.8% | Fine-tuned in-language |
| AfroXLMR | 270M | 89.0% | Fine-tuned in-language |
| mT5-Large | 1.2B | 82.4% | Fine-tuned in-language |
| XLM-R | 270M | 74.7% | Fine-tuned in-language |
| GPT-4o | ~1.8T | 71.2% | Zero-shot |
| Gemini 1.5 Pro | Undisclosed | 68.7% | Zero-shot |
Global Reach (Zero-Shot)
High-accuracy performance on languages with zero specific training. This is the proof of a Universal Morphological Manifold.
| Language | Family | MASSIVE Accuracy |
|---|---|---|
| Swahili | Bantu | 71.0% |
| Urdu | Indo-Aryan | 66.0% |
| Mongolian | Mongolic | 64.0% |
| Tagalog | Austronesian | 62.3% |
| Korean | Koreanic | 61.7% |
| Amharic | Semitic | 60.9% |
| Hindi | Indo-Aryan | 60.3% |
| Javanese | Austronesian | 58.6% |
| Japanese | Japonic | 56.5% |
| Tamil | Dravidian | 56.2% |
| Kannada | Dravidian | 53.6% |
| Telugu | Dravidian | 50.5% |
Sentiment Analysis (Cross-Language)
Our model understands emotion and sentiment in languages it was never trained on, achieving near-SOTA results without any fine-tuning.
| Language | Sozisi (15M, frozen) | SemEval SOTA (270M+, fine-tuned) | % of SOTA |
|---|---|---|---|
| Swahili | 53.7% wF1 | 60.5% wF1 | 89% |
| Xitsonga | 50.0% wF1 | 54.9% wF1 | 91% |
| Igbo | 65.7% wF1 | 80.8% wF1 | 81% |
| Yoruba | 57.2% wF1 | 68.0% wF1 | 84% |
| Hausa | 61.9% wF1 | 80.9% wF1 | 77% |
Named Entity Recognition (MasakhaNER, isiZulu)
Frozen Sozisi backbone + lightweight CRF head on the standard MasakhaNER benchmark. Real entity extraction across people, places, organizations, and dates.
| Entity Type | Description | F1 | Support |
|---|---|---|---|
| PER | People | 83.2% | 888 |
| DATE | Dates | 80.8% | 318 |
| LOC | Locations | 75.8% | 337 |
| ORG | Organizations | 65.4% | 373 |
Cross-Family MASSIVE: Sozisi vs InkubaLM
Apples-to-apples MLP probe on target-language train, tested on held-out test. InkubaLM (422M) is 28× larger and pretrained on 7 African languages + general web. Sozisi (15M) trained on isiZulu alone. We still win 2 of 17, and the gap shrinks fast as we scale.
| Language | Family | InkubaLM (422M) | Sozisi (15M) | Δ |
|---|---|---|---|---|
| Urdu | Indo-Aryan | 65.6% | 66.0% | +0.4 |
| Tamil | Dravidian | 54.3% | 56.2% | +1.9 |
| Mongolian | Mongolic | 64.7% | 64.0% | −0.7 |
| Hindi | Indo-Aryan | 65.2% | 60.3% | −4.9 |
| Swahili | Bantu | 79.2% | 71.0% | −8.2 |
| Kannada | Dravidian | 62.9% | 53.6% | −9.3 |
| Amharic | Semitic | 70.9% | 60.9% | −10.0 |
| Telugu | Dravidian | 62.2% | 50.5% | −11.7 |
| Korean | Koreanic | 73.6% | 61.7% | −11.9 |
| Tagalog | Austronesian | 78.0% | 62.3% | −15.7 |
| Japanese | Japonic | 72.8% | 56.5% | −16.3 |
| Javanese | Austronesian | 79.1% | 58.6% | −20.5 |
| Afrikaans | Germanic | 78.9% | 50.1% | −28.8 |
| Finnish | Uralic | 76.5% | 46.0% | −30.5 |
| Turkish | Turkic | 77.1% | 43.2% | −33.9 |
| Azerbaijani | Turkic | 77.8% | 33.7% | −44.1 |
| Hungarian | Uralic | 77.5% | 28.4% | −49.1 |
Sozisi wins Urdu + Tamil via our transliteration pipeline (CV mapping). On Swahili, InkubaLM was pretrained on it. On Uralic/Turkic/Germanic, InkubaLM's web-scale pretraining dominates. Sozisi is still improving — more training data + per-language Sozisi caches close the gap fast.
Cross-Family Next-Word Prediction (Zero-Shot)
Trained only on isiZulu. Transfers strongly to sister Nguni languages and neighboring Bantu families — phylogenetic signal preserved in the embedding space.
Every Script, Every Language
From Arabic to Hangul, our system unlocks any writing system in seconds. No retraining, no new data, just instant access.
| Language | Script | Before | With Sozisi | Improvement |
|---|---|---|---|---|
| Urdu | Arabic | 10.7% | 66.0% | +55.3pp |
| Mongolian | Cyrillic | 11.7% | 64.0% | +52.3pp |
| Korean | Hangul | 10.5% | 61.7% | +51.2pp |
| Amharic | Ethiopic | 10.1% | 60.9% | +50.8pp |
| Hindi | Devanagari | 11.1% | 60.3% | +49.2pp |
| Tamil | Tamil | 9.4% | 56.2% | +46.8pp |
Emergent Intelligence
Capabilities that emerged naturally from our structural approach, without being explicitly taught.
Thesis
Linguistic inductive bias substitutes for scale.
The dominant approach to AI assumes more is better: more parameters, more data, more compute. Our results suggest a different path. With the right structural priors, a 15M-parameter model captures language understanding that trillion-parameter systems still miss.
Sozisi is built on a phonotactic morphology-first tokenizer paired with architectural choices that exploit the compositional nature of language itself. Full methodology will appear in our forthcoming paper.