Update README.md
Browse files
README.md
CHANGED
|
@@ -1,10 +1,45 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LILT
|
| 2 |
+
|
| 3 |
+
**We build the multilingual layer for English-first AI.**
|
| 4 |
+
Custom evals, benchmarks, and RL environments across 200+ languages.
|
| 5 |
+
|
| 6 |
+
Most agent and coding benchmarks ship in English. We build the audited
|
| 7 |
+
non-English counterparts β and the multilingual environments models train
|
| 8 |
+
on β so labs and enterprises can measure and improve what their models
|
| 9 |
+
actually do in the languages their users speak.
|
| 10 |
+
|
| 11 |
+
### Why we publish here
|
| 12 |
+
Open releases make it easier for the community to stress-test our work,
|
| 13 |
+
reproduce our scores, and extend our benchmarks to new languages. Every
|
| 14 |
+
artifact is paired with a paper, a scoring script, and explicit limitations.
|
| 15 |
+
|
| 16 |
+
### What you'll find here
|
| 17 |
+
- **Benchmarks & datasets** β multilingual evaluations across coding,
|
| 18 |
+
agents, tool use, long context, instruction following, and domain QA.
|
| 19 |
+
Audited splits across our priority languages, scalable to 200+.
|
| 20 |
+
- **RL environments** β multilingual training environments for agentic
|
| 21 |
+
and tool-using models, with reproducible scoring.
|
| 22 |
+
- **Leaderboards & scoring** β Gradio Spaces with reproducible submission flows.
|
| 23 |
+
- **Baselines** β frontier-model scores published with exact prompts,
|
| 24 |
+
decoding params, and dated snapshots.
|
| 25 |
+
- **Papers** β methodology, audit workflow, and findings.
|
| 26 |
+
|
| 27 |
+
### Currently featured
|
| 28 |
+
π **GAIA-v2-LILT** β multilingual agent benchmark across AR / DE / HI / KO / PT-BR.
|
| 29 |
+
+20.7pp average gain post human-audit on frontier agents. Dataset, paper, and
|
| 30 |
+
leaderboard linked in the pinned collection.
|
| 31 |
+
|
| 32 |
+
π οΈ **LILTBench Hackathon (Jun 15β21, 2026)** β one-week community challenge to
|
| 33 |
+
crowdsource non-English coding tasks that break Claude Opus 4.6 in Terminal-Bench.
|
| 34 |
+
Co-hosted with The AI Collective. [Sign up](https://luma.com/55v3wgi9).
|
| 35 |
+
|
| 36 |
+
### Links
|
| 37 |
+
- Website: <https://lilt.com>
|
| 38 |
+
- Multilingual benchmarks: <https://lilt.com/products/multilingual-benchmarks>
|
| 39 |
+
- AI for Frontier Labs: <https://lilt.com/ai-for-frontier-labs>
|
| 40 |
+
- GitHub: <https://github.com/lilt>
|
| 41 |
+
- Contact (data services): <https://lilt.com/contact/ai-data-services>
|
| 42 |
+
|
| 43 |
+
### Citation
|
| 44 |
+
If you use one of our datasets or benchmarks, please cite the corresponding paper
|
| 45 |
+
linked on each dataset card.
|