israel commited on
Commit
0737409
Β·
verified Β·
1 Parent(s): 5773706

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -56
README.md CHANGED
@@ -5,80 +5,66 @@ language:
5
  pipeline_tag: text-generation
6
  ---
7
 
8
- # Walia Instruction Dataset for Amharic
9
 
10
- This repository contains instruction-tuning datasets used in [Walia-LLM](https://aclanthology.org/2024.findings-emnlp.25/), a fine-tuned LLaMA-2 model for the Amharic language. The dataset was carefully constructed by integrating task-specific and generative datasets, and it supports a variety of natural language processing tasks in Amharic.
11
 
12
- ## Dataset Summary
 
13
 
14
- The Walia dataset is designed to enhance large language models for the Amharic language by:
15
 
16
- - Converting existing task-specific datasets (e.g., sentiment analysis, QA, NER) into instruction format.
17
- - Creating new generative datasets (e.g., poem generation, religious lyrics, story generation).
18
- - Translating English instruction datasets (e.g., Alpaca, Dolly) into Amharic for comparative studies.
 
 
 
 
 
 
 
 
 
19
 
20
- Each data point follows a structured instruction format with:
21
- - `"instruction"` – a natural language task description,
22
- - `"input"` – optional input text for the task,
23
- - `"output"` – the expected model output in Amharic.
24
 
25
- ## Supported Tasks
 
 
 
26
 
27
- | Task | Source/Type | Notes |
28
- |---------------------------|-------------------|----------------------------|
29
- | Sentiment Analysis | AfriSenti | 3-class sentiment |
30
- | Named Entity Recognition | MasakhaNER | Personal name extraction |
31
- | News Classification | MasakhaNews | Multilingual topic classes |
32
- | QA | AmharicQA | Wikipedia-based |
33
- | Summarization | XL-Sum | Amharic summaries |
34
- | Machine Translation | NLLB, WMT19 | Both directions supported |
35
- | Poem/Lyrics/Story Gen | Custom | Sourced from web/telegram |
36
- | Spelling Correction | Synthetic | Character perturbations |
37
 
38
- ## Dataset Structure
39
 
40
- ```json
41
- {
42
- "instruction": "Translate the following sentence to Amharic.",
43
- "input": "Hello, how are you?",
44
- "output": "αˆ°αˆ‹αˆα£ αŠ₯αŠ•α‹΄α‰΅ αŠαˆ…?"
45
- }
46
- ```
47
-
48
- ## Data Statistics
49
 
50
- - ~122,000 instruction samples for training
51
- - ~15,000 for validation and test
52
- - 16+ task types and instruction templates
53
- - All responses are in Amharic (except source text in MT)
54
 
55
- ## How to Use
 
 
56
 
57
- You can load the dataset using the Hugging Face `datasets` library:
58
 
59
  ```python
60
- from datasets import load_dataset
61
-
62
- dataset = load_dataset("EthioNLP/walia-amharic-instructions")
63
- print(dataset["train"][0])
64
- ```
65
 
66
- ## Applications
 
67
 
68
- - Supervised fine-tuning (SFT) of LLMs for Amharic
69
- - Cross-lingual instruction tuning experiments
70
- - Evaluation of generative capabilities in low-resource languages
71
-
72
- ## Related Models
73
-
74
- The dataset is used to fine-tune:
75
- - [`EthioNLP/walia-llama-2`](https://huggingface.co/EthioNLP/walia-llama-2)
76
- - Other LLaMA variants for Amharic
77
 
78
  ## Citation
79
 
80
- Please cite the following paper if you use this dataset:
81
-
82
  ```bibtex
83
  @inproceedings{azime-etal-2024-walia,
84
  title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
@@ -93,4 +79,4 @@ Please cite the following paper if you use this dataset:
93
  doi = "10.18653/v1/2024.findings-emnlp.25",
94
  pages = "432--444"
95
  }
96
- ```
 
5
  pipeline_tag: text-generation
6
  ---
7
 
8
+ Walia-LLM: Fine-Tuned LLaMA-2 for Amharic
9
 
10
+ `Walia-LLM` is a fine-tuned LLaMA-2 model for the Amharic language, created by instruction tuning with task-specific and generative datasets. It is part of our effort to adapt and improve LLMs for low-resource languages.
11
 
12
+ This model was introduced in the EMNLP 2024 Findings paper:
13
+ > [Walia-LLM: Enhancing Amharic-LLaMA by Integrating Task-Specific and Generative Datasets](https://aclanthology.org/2024.findings-emnlp.25/)
14
 
15
+ ## Model Details
16
 
17
+ - Base model: LLaMA-2
18
+ - Fine-tuning method: Supervised fine-tuning (SFT) using LoRA
19
+ - Language: Amharic
20
+ - Tasks:
21
+ - Sentiment analysis
22
+ - Question answering
23
+ - Named entity recognition
24
+ - News classification
25
+ - Summarization
26
+ - Machine translation
27
+ - Poem/story/lyrics generation
28
+ - Spelling correction
29
 
30
+ ## Training Data
 
 
 
31
 
32
+ The model was trained on a custom instruction dataset derived from:
33
+ - Existing NLP benchmarks (e.g., AfriSenti, AmharicQA, MasakhaNER, MasakhaNews, XL-Sum)
34
+ - Manually collected generative datasets (e.g., religious lyrics, stories, poems)
35
+ - Translated instruction datasets (e.g., Alpaca, Dolly)
36
 
37
+ See [EthioNLP/walia-amharic-instructions](https://huggingface.co/datasets/EthioNLP/walia-amharic-instructions) for the dataset used.
 
 
 
 
 
 
 
 
 
38
 
39
+ ## Intended Use
40
 
41
+ This model is intended for:
42
+ - Research on instruction tuning in low-resource languages
43
+ - Generative NLP tasks in Amharic
44
+ - Evaluating multilingual LLM capabilities
 
 
 
 
 
45
 
46
+ ## Limitations
 
 
 
47
 
48
+ - Some generative outputs may be verbose or imprecise.
49
+ - Limited understanding of highly specific Amharic poetic or lyrical structures.
50
+ - Spell correction and NER performance is still under exploration.
51
 
52
+ ## Example Usage
53
 
54
  ```python
55
+ from transformers import AutoTokenizer, AutoModelForCausalLM
 
 
 
 
56
 
57
+ model = AutoModelForCausalLM.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
58
+ tokenizer = AutoTokenizer.from_pretrained("EthioNLP/Amharic-LLAMA-all-data")
59
 
60
+ prompt = "ሡለ αŠ αˆ›αˆ­αŠ› α‰‹αŠ•α‰‹ መግለጫ αŠ α‰…αˆ­α‰₯ፒ"
61
+ inputs = tokenizer(prompt, return_tensors="pt")
62
+ outputs = model.generate(**inputs, max_new_tokens=100)
63
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
64
+ ```
 
 
 
 
65
 
66
  ## Citation
67
 
 
 
68
  ```bibtex
69
  @inproceedings{azime-etal-2024-walia,
70
  title = "Walia-{LLM}: Enhancing {A}mharic-{LL}a{MA} by Integrating Task-Specific and Generative Datasets",
 
79
  doi = "10.18653/v1/2024.findings-emnlp.25",
80
  pages = "432--444"
81
  }
82
+ ```