allmalab
/

aLLMA-2-tokenizer

Model card Files Files and versions

jafarisbarov commited on Aug 10, 2025

Commit

46a0946

·

verified ·

1 Parent(s): ec0fdf1

Update README.md

Files changed (1) hide show

README.md +6 -2

README.md CHANGED Viewed

@@ -1,6 +1,10 @@
 ---
 library_name: transformers
-tags: []
 ---
 # A monolingual tokenizer for Azerbaijani trained on `azj_Latn` subset of FineWeb-2 corpus.
@@ -32,4 +36,4 @@ tags: []
     pages = "18--28",
     abstract = "The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support."
 }
-```

 ---
 library_name: transformers
+license: apache-2.0
+datasets:
+- HuggingFaceFW/fineweb-2
+language:
+- az
 ---
 # A monolingual tokenizer for Azerbaijani trained on `azj_Latn` subset of FineWeb-2 corpus.
     pages = "18--28",
     abstract = "The emergence of multilingual large language models has enabled the development of language understanding and generation systems in Azerbaijani. However, most of the production-grade systems rely on cloud solutions, such as GPT-4. While there have been several attempts to develop open foundation models for Azerbaijani, these works have not found their way into common use due to a lack of systemic benchmarking. This paper encompasses several lines of work that promote open-source foundation models for Azerbaijani. We introduce (1) a large text corpus for Azerbaijani, (2) a family of encoder-only language models trained on this dataset, (3) labeled datasets for evaluating these models, and (4) extensive evaluation that covers all major open-source models with Azerbaijani support."
 }
+```