Research

Sozisi: 15M Parameters, No Compromises

A 15M-parameter model. New SOTA on isiZulu, isiXhosa, and Sesotho intent detection (Injongo), beating AfroXLMR-76L. Competitive cross-family transfer against models 28× larger. All benchmarks verified, reproducible, and run with frozen backbone + MLP probe.

Model Overview

15M
Parameters
24MB
Model Size
<2 seconds
Language Adaptation
<50ms on device
Inference
High-efficiency structural dataset
Training Data
17
Zero-Shot Languages
<2 seconds
Probe Adaptation
Sozisi
Architecture
~$0 (no GPU cluster)
Training Cost
None
GPU Required
Architectural thesis

Most AI memorizes words. We model the scaffolding underneath them.

Every language is different. The structure underneath them isn’t — morphology, noun-class agreement, argument composition. Teach a model vocabulary and it knows one language. Teach it structure and it knows thousands.

Over 6,000 languages are underserved by today’s AI (Stanford HAI, 2024). We pretrained a 15M-parameter model on one morphologically-rich language (isiZulu) and it works zero-shot across 40+ more — because the scaffolding is what they share.

Every benchmark below follows the same rule: the language under test was not in our training data. We say so explicitly each time, because the normal assumption is “big multilingual model beat a specialist.” Here the specialist is the 422M-parameter competitor that trained on the target language. We’re the 15M- parameter model that didn’t.

Pretraining
1 language
isiZulu only
Inference coverage
40+ languages
10 families, 9 writing systems
Adaptation cost per language
< 2 seconds
No fine-tuning. No retraining.
Common objection

“Doesn’t GPT-4 already learn structure from web data?”

Incidentally, yes — for languages with abundant training data. English, Chinese, Spanish, French: a huge web corpus forces any sufficiently large model to absorb morphology and composition by statistical accident. For everything else, “structure” is whatever the model can infer from a sliver of the corpus. That’s why GPT-4o scores 71.2% on isiZulu intent and 70.6% on KiSwahili, even though both languages are in its pretraining corpus — competent, not native.

We take the opposite path. Structural features — morphological roles, noun-class agreement, argument composition — are explicit inputs during pretraining. The model isn’t hoping to pick up grammar; it’s taught it. That’s why a 15M-parameter model pretrained on isiZulu alone matches or beats a 270M competitor that was fine-tuned in-language on isiZulu, isiXhosa, and Sesotho — results that wouldn’t be possible if everyone already modelled structure well enough.

On English, the giant models win. We never claimed to beat English on English. On the 6,000 languages they barely touch, the math changes.

Only Sozisi had zero Swahili in pretraining

MASSIVE Swahili — the commercial case

MASSIVE is the standard intent-classification benchmark (60 intents, Amazon-curated). Only one of the three models below had zero Swahili text in its pretraining — ours. GPT-4o and InkubaLM both ingested Swahili during their pretraining; their 'zero-shot' is task-level (no fine-tuning on MASSIVE) but language-level exposure is built in. Our result is language-level zero-shot: the model transferred from isiZulu alone.

Be precise about what “zero-shot” means. In LLM parlance “zero-shot” usually means no task-specific fine-tuning — but the model has still read the target language during its web-scale pretraining. GPT-4o ingested billions of Swahili tokens; InkubaLM was explicitly pretrained on Swahili as one of seven African languages. Our Sozisi result is a stricter claim: the pretraining corpus was isiZulu and nothing else. Swahili appeared for the first time at inference, through a small MLP probe on a frozen backbone.

So read the table honestly. InkubaLM is 28× larger than us, pretrained on Swahili plus general web text, and wins the head-to-head by six points. That is the expected result for a specialist model at 28× the scale. What isn’t expected: a 15M model that has never seen Swahili at all lands within six points of it, and beats GPT-4o (1.8T parameters, Swahili all over its pretraining) by 2.6. That gap is a function of training budget, not architectural ceiling. More data closes it.

ModelParametersScoreMethod
Sozisi (Bhala AI)15M73.2%Language-level zero-shot · pretrained on isiZulu only
GPT-4o≈1.8T70.6%Task-level zero-shot · Swahili in web pretraining corpus
InkubaLM422M79.2%Pretrained on Swahili (one of 7 African languages) + web
SOTA on 4 of 8 · 18× smaller · frozen, not fine-tuned

Injongo — 8 Bantu languages, head-to-head

We pretrained on one language (isiZulu). AfroXLMR-76L (270M params) was fine-tuned in-language on each target. We run frozen + a light probe. On four of eight languages we match or beat a model 18× our size. On the languages where AfroXLMR wins, it does so with an 18× parameter advantage and target-language fine-tuning we do not do.

Sozisi parameters
15M
Pretrained on isiZulu only
AfroXLMR-76L parameters
270M
Fine-tuned per target language
Parameter efficiency
18× smaller
We match or beat them on isiXhosa, Sesotho, isiZulu, and KiSwahili

Sozisi vs published SOTA · head-to-head per language

Source: Adelani et al. 2025 (ACL Long Paper, arXiv:2502.09814). Baselines fine-tuned on Injongo training data per language; Sozisi uses frozen backbone + lightweight probe.

LanguageSozisiPublic SOTASOTA ModelΔStatus
isiXhosa98.3%97.3%AfroXLMR+1.0ppSOTA
KiSwahili97.9%98.1%AfroXLMR-76L−0.2ppTied
Sesotho95.1%86.8%AfroXLMR-76L+8.3ppSOTA
isiZulu93.1%89.8%AfroXLMR-76L+3.3ppSOTA
ChiShona90.5%95.3%AfroXLMR−4.8ppbehind
Lingala89.5%94.6%AfroXLMR-76L−5.1ppbehind
Luganda81.7%91.3%AfroXLMR-76L−9.6ppbehind
Kinyarwanda78.3%89.4%AfroXLMR-76L−11.1ppbehind

3 of 8 languages — plus tie on KiSwahili · average across 8 languages: 90.5%

Full Zulu leaderboard (incl. LLMs)

ModelParametersAccuracyMethod
Sozisi (Bhala AI)15M93.1%Frozen + probe
AfroXLMR-76L270M89.8%Fine-tuned in-language
AfroXLMR270M89.0%Fine-tuned in-language
mT5-Large1.2B82.4%Fine-tuned in-language
XLM-R270M74.7%Fine-tuned in-language
GPT-4o~1.8T71.2%Zero-shot
Gemini 1.5 ProUndisclosed68.7%Zero-shot
Universal Transfer

Global Reach (Zero-Shot)

High-accuracy performance on languages with zero specific training. This is the proof of a Universal Morphological Manifold.

LanguageFamilyMASSIVE Accuracy
SwahiliBantu71.0%
UrduIndo-Aryan66.0%
MongolianMongolic64.0%
TagalogAustronesian62.3%
KoreanKoreanic61.7%
AmharicSemitic60.9%
HindiIndo-Aryan60.3%
JavaneseAustronesian58.6%
JapaneseJaponic56.5%
TamilDravidian56.2%
KannadaDravidian53.6%
TeluguDravidian50.5%
89-91% of Fine-Tuned SOTA

Sentiment Analysis (Cross-Language)

Our model understands emotion and sentiment in languages it was never trained on, achieving near-SOTA results without any fine-tuning.

LanguageSozisi (15M, frozen)SemEval SOTA (270M+, fine-tuned)% of SOTA
Swahili53.7% wF160.5% wF189%
Xitsonga50.0% wF154.9% wF191%
Igbo65.7% wF180.8% wF181%
Yoruba57.2% wF168.0% wF184%
Hausa61.9% wF180.9% wF177%
96.6% Token Accuracy

Named Entity Recognition (MasakhaNER, isiZulu)

Frozen Sozisi backbone + lightweight CRF head on the standard MasakhaNER benchmark. Real entity extraction across people, places, organizations, and dates.

96.6%
Token Accuracy
77.7%
Span F1
78.2%
Precision
77.2%
Recall
Entity TypeDescriptionF1Support
PERPeople83.2%888
DATEDates80.8%318
LOCLocations75.8%337
ORGOrganizations65.4%373
15M vs 422M · 28× smaller

Cross-Family MASSIVE: Sozisi vs InkubaLM

Apples-to-apples MLP probe on target-language train, tested on held-out test. InkubaLM (422M) is 28× larger and pretrained on 7 African languages + general web. Sozisi (15M) trained on isiZulu alone. We still win 2 of 17, and the gap shrinks fast as we scale.

71.5%
InkubaLM 422M avg
54.3%
Sozisi 15M avg
2/17
Sozisi wins
LanguageFamilyInkubaLM (422M)Sozisi (15M)Δ
UrduIndo-Aryan65.6%66.0%+0.4
TamilDravidian54.3%56.2%+1.9
MongolianMongolic64.7%64.0%−0.7
HindiIndo-Aryan65.2%60.3%−4.9
SwahiliBantu79.2%71.0%−8.2
KannadaDravidian62.9%53.6%−9.3
AmharicSemitic70.9%60.9%−10.0
TeluguDravidian62.2%50.5%−11.7
KoreanKoreanic73.6%61.7%−11.9
TagalogAustronesian78.0%62.3%−15.7
JapaneseJaponic72.8%56.5%−16.3
JavaneseAustronesian79.1%58.6%−20.5
AfrikaansGermanic78.9%50.1%−28.8
FinnishUralic76.5%46.0%−30.5
TurkishTurkic77.1%43.2%−33.9
AzerbaijaniTurkic77.8%33.7%−44.1
HungarianUralic77.5%28.4%−49.1

Sozisi wins Urdu + Tamil via our transliteration pipeline (CV mapping). On Swahili, InkubaLM was pretrained on it. On Uralic/Turkic/Germanic, InkubaLM's web-scale pretraining dominates. Sozisi is still improving — more training data + per-language Sozisi caches close the gap fast.

Phylogenetic Transfer

Cross-Family Next-Word Prediction (Zero-Shot)

Trained only on isiZulu. Transfers strongly to sister Nguni languages and neighboring Bantu families — phylogenetic signal preserved in the embedding space.

87.5%
isiNdebele
Near-perfect transfer
80.5%
Yoruba
Cross-family success
77.7%
Igbo
Cross-family success
69.3%
Hausa
Zero-shot transfer
69.1%
Swahili
Cross-dialect success
83.2%
Nguni (zero-shot)
71.1%
Other Bantu
75.8%
Non-Bantu African
9 Writing Systems

Every Script, Every Language

From Arabic to Hangul, our system unlocks any writing system in seconds. No retraining, no new data, just instant access.

LanguageScriptBeforeWith SozisiImprovement
UrduArabic10.7%66.0%+55.3pp
MongolianCyrillic11.7%64.0%+52.3pp
KoreanHangul10.5%61.7%+51.2pp
AmharicEthiopic10.1%60.9%+50.8pp
HindiDevanagari11.1%60.3%+49.2pp
TamilTamil9.4%56.2%+46.8pp
6/6 Emerged

Emergent Intelligence

Capabilities that emerged naturally from our structural approach, without being explicitly taught.

0.561
Semantic clustering
0.633
Number sense
0.443
Negation understanding
0.297
Temporal reasoning
0.325
Disambiguation
0.302
Intent clustering

Thesis

Linguistic inductive bias substitutes for scale.

The dominant approach to AI assumes more is better: more parameters, more data, more compute. Our results suggest a different path. With the right structural priors, a 15M-parameter model captures language understanding that trillion-parameter systems still miss.

Sozisi is built on a phonotactic morphology-first tokenizer paired with architectural choices that exploit the compositional nature of language itself. Full methodology will appear in our forthcoming paper.

Build with Sozisi

API available now. On-device SDK coming soon.