--- license: llama3.1 language: - uz - en base_model: meta-llama/Llama-3.1-8B-Instruct library_name: transformers tags: - llama - uzbek - uzbekllm - uzbeknlp - text-generation - translation - summarization - question-answering - tokenizer datasets: - HuggingFaceFW/fineweb-2 - tahrirchi/uz-crawl - yakhyo/uz-wiki - wikipedia - tatsu-lab/alpaca - behbudiy/alpaca-cleaned-uz - UAzimov/uzbek-instruct-llm - behbudiy/translation-instruction metrics: - bleu - comet - accuracy pipeline_tag: text-generation --- ### Model Description This is the 8B parameter version of our Uzbek-optimized Llama series. Also, check out our other models: * **[uzlm/alloma-1B-Instruct](https://huggingface.co/beruniy/Llama-3.2-1B-Instruct-Uz)** * **[uzlm/alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)** --- Our **uzlm/alloma-8B-Instruct** model has been continually pretrained with context length of 4096 tokens, on 3.6B tokens (67% English, 33% Uzbek), then SFT fine-tuned. Our customized tokenizer averages 1.7 tokens per Uzbek word vs. ~3.5 in the original Llama models, meaning 2x faster inference and longer effective context length on Uzbek text. ## Methodology: Efficient Vocabulary Adaptation for Uzbek The primary motivation for our technical approach is to create a model with a more efficient tokenizer for the Uzbek language. This ensures both faster inference speeds and a longer effective context length when processing Uzbek text, as fewer tokens are needed to represent the same amount of information. To avoid the prohibitive cost of training from scratch, we adapted the powerful meta-llama/Llama-3.1 base model using an in-place vocabulary replacement strategy. We identified less relevant non-ASCII tokens in the original vocabulary and replaced them with our custom Uzbek tokens. This was performed without altering the model's architecture or total vocabulary size, carefully merging the new Uzbek BPE rules while preserving the original English ones. To give the new tokens a meaningful starting point for training, we initialized their embeddings using subtoken averaging. Each new Uzbek token was broken down by the original tokenizer, and its new embedding was created by averaging the embeddings of its subtokens. This method allowed for highly efficient continual pretraining on our bilingual dataset, resulting in a model fully optimized for Uzbek. --- ### Benchmarks 1B, 3B | Model | BLEU Uz→En (Zero_shot) | BLEU En→Uz (Zero_shot) | COMET Uz→En | COMET En→Uz | Uzbek Sentiment Analysis | Uzbek News Classification | MMLU-uz (Zero_shot) | MMLU (English) (Zero_shot) | | --------------------------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | | **[Llama-3.2 1B Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)** | 3.62 | 0.44 | 56.72 | 35.52 | 54.77 | 42.16 | 24.37 |38.15 | | **[alloma-1B-Instruct](https://huggingface.co/beruniy/Llama-3.2-1B-Instruct-uz)** | 16.64 | 10.20 | 81.42 | 82.73 | 63.49 | 10.75 | 26.27 | 26.29 | | **[Llama-3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)** | 11.91 | 2.54 | 71.96 | 55.62 | 56.01 | 70.60 | 31.88 | 52.04 | | **[alloma-3B-Instruct](https://huggingface.co/beruniy/Llama-3.2-3B-Instruct-Uz)** | 25.19 | 14.66 | 85.08 | 86.82 | 81.64 | 41.56 | 39.30 | 45.91 | ### Benchmarks 8B | Model | BLEU Uz→En (Zero_shot) | BLEU En→Uz (Zero_shot) | COMET Uz→En | COMET En→Uz | Uzbek Sentiment Analysis | Uzbek News Classification | MMLU-uz (Zero_shot) | MMLU (English) (Zero_shot) | | --------------------------------- | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | | **[Llama-3.1 8B Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)** | 24.23 | 8.28 | 83.12 | 82.22 | 69.77 | 73.63 | 40.51 | 60.59 | | **[Behbudiy Mistral 7B Uz](https://huggingface.co/behbudiy/Mistral-7B-Instruct-Uz)** | 28.09 | 15.96 | 86.26 | 88.42 | 83.41 | 55.51 | 36.56 | 47.09 | | **[Behbudiy Llama 8B Uz](https://huggingface.co/behbudiy/Llama-3.1-8B-Instruct-Uz)** | 27.08 | 13.29 | 84.76 | 85.62 | 81.66 | 68.22 | 41.28 | 59.18 | | **[alloma-8B-Instruct](https://huggingface.co/beruniy/Llama-3.1-8B-Instruct-Uz)** | 31.16 | 15.58 | 87.24 | 87.64 | 82.66 | 65.65 | 41.89 | 53.35 | The results show that our Uzbek-optimized models consistently outperform their base counterparts in translation benchmarks (BLEU and COMET) on the FLORES+ Uz-En / En-Uz evaluation datasets and sentiment analysis in Uzbek language. Also, on the MMLU benchmark, which measures general language understanding across multiple tasks in English, and News classification tasks, our Uzbek optimized model showed slight decline because of catastrophic forgetting of original English instruction following. (The official Llama model’s MMLU score may differ from our score due to our evaluation method. Refer to the links below to see evaluation details.) We’re eager to see how these models will contribute to Uzbek open-source and be used by our Uzbek 🇺🇿 community. 🚀 ## How to use The uzlm/alloma-8B-Instruct model can be used with transformers in the following way. We recommend preprocessing Uzbek input to replace apostrophe (') with sequence (APST) to achieve our model's lower tokenizer fertility. ### Use with transformers ```python import re, torch from transformers import AutoModelForCausalLM, AutoTokenizer import langid DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu") DTYPE = torch.bfloat16 MODEL_ID = "uzlm/alloma-8B-Instruct" PATTERN = r"[’‘‚‛ʻʼʽʾʿˈˊˋˌˍ'\']" tok = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True) tok.padding_side = "left" model = AutoModelForCausalLM.from_pretrained( MODEL_ID, torch_dtype=DTYPE, device_map="auto" ) EOT = "<|eot_id|>" SYSTEM = ( f"{tok.bos_token}<|start_header_id|>system<|end_header_id|>\n" "You are a helpful assistant<|eot_id|>" ) def prompt(user: str) -> str: return ( SYSTEM + "<|start_header_id|>user<|end_header_id|>\n" + f"{user}{EOT}" + "<|start_header_id|>assistant<|end_header_id|>" ) def generate(user: str, max_new: int = 256) -> str: lang, confidence = langid.classify(user) clean_text = re.sub(PATTERN, "APST", user) if lang != "en" else user enc = tok(prompt(clean_text), return_tensors="pt").to(DEVICE) out = model.generate(**enc, max_new_tokens=max_new, bos_token_id=tok.bos_token_id, eos_token_id=tok.convert_tokens_to_ids(EOT), pad_token_id=tok.pad_token_id, do_sample=False) txt = tok.decode(out[0], skip_special_tokens=False) txt = txt.split("<|start_header_id|>assistant<|end_header_id|>", 1)[1] return txt.split(EOT, 1)[0].replace("APST", "'").strip() print(generate("Menga Alisher Navoiy haqida aytib ber.")) ``` ## Information on Evaluation Method To evaluate on the translation task, we used FLORES+ Uz-En / En-Uz datasets. We used the following prompt to do zero-shot Uz-En evaluation both for the base model and Uzbek-optimized model (for En-Uz eval, we changed the positions of the words "English" and "Uzbek"). ```python prompt = f"Input: {clean_text} \n\nYour task is to accurately translate the given Uzbek text into English.\n" "Output only the English translation, without any additional comments.\n" "\nPlease translate the following Uzbek text into English." ``` To assess the model's ability in Uzbek sentiment analysis, we used the **risqaliyevds/uzbek-sentiment-analysis** dataset (refer to **behbudiy/uzbek-sentiment-analysis** dataset). We used the following prompt for the evaluation: ```python prompt = f'''Input: {clean_text} \n\nGiven the following text, determine the sentiment as either 'Positive' or 'Negative.' Respond with only the word 'Positive' or 'Negative' without any additional text or explanation." ''' ``` For Uzbek News Classification, we used **risqaliyevds/uzbek-zero-shot-classification** dataset and asked the model to predict the category of the news using the following prompt: ```python prompt = f'''Input: {clean_text}\n\nClassify the given news article in Uzbek. 0 - Siyosat - If the text is about politics. 1 - Iqtisodiyot - If the text is about the economy. 2 - Texnologiya - If the text is about technology. 3 - Sport - If the text is about sports. 4 - Madaniyat - If the text is about culture. 5 - Salomatlik - If the text is about health. 6 - Oila va Jamiyat - If the text is about family and society. 7 - TaAPSTlim - If the text is about education. 8 - Ekologiya - If the text is about ecology. 9 - Xorijiy Yangiliklar - If the text is about foreign news. Print only one digit ID of the corresponding class. ''' ``` On MMLU, we performed 0-shot evaluation using the following **template** and extracted the first token generated by the model for measuring accuracy: ```python template = "Given the above question and choices, choose the single best answer (A, B, C, or D). Respond with only one letter.. ``` ## Acknowledgements This project was developed by the teams at **[Examy.me](https://examy.me/)** and **[Teamwork.uz](https://teamwork.uz/)**. Their collaboration and resources were essential to the creation and success of the `alloma` model series. ## More For more details and examples, refer to the base model below: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct