likhonsheikh commited on
Commit
4fb22b9
·
verified ·
1 Parent(s): a8a01fe

Add model card README.md

Browse files
Files changed (1) hide show
  1. README.md +204 -0
README.md ADDED
@@ -0,0 +1,204 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ ---
3
+ library_name: transformers
4
+ tags:
5
+ - Bangla
6
+ - nlp
7
+ - decoder-only
8
+ - causal-lm
9
+ - Lora
10
+ - code-generation
11
+ - agentic-ai
12
+ - from-scratch
13
+ metrics:
14
+ - pass@k
15
+ - task-completion-rate
16
+ model_name: Sheikh-ABF
17
+ language: bn
18
+ license: mit
19
+ ---
20
+
21
+ # Sheikh-ABF: Sheikh Artificial Bangla Foundation
22
+
23
+ ## Model Description
24
+
25
+ **Sheikh-ABF** is a state-of-the-art, **decoder-only Transformer language model for Bangla NLP**, developed entirely **from scratch**. This project emphasizes a **Bangla-first approach**, focusing on the unique linguistic and cultural aspects of the Bengali language. The name 'Sheikh' explicitly refers to its origin as a model developed in **Bangladesh**, aiming to provide a foundational LLM for the region.
26
+
27
+ ### Goal
28
+ The primary objective is to create a robust base language model capable of advanced **internal reasoning**, moving beyond simple pattern matching to understand and process information more deeply. This base model serves as a strong foundation for future fine-tuning and specialized applications.
29
+
30
+ ### Core Principles
31
+ * **No Pre-trained Weights**: Trained entirely from scratch, ensuring a truly native Bangla foundation.
32
+ * **Bangla-First Approach**: Optimized for Bangla, addressing its specific linguistic nuances.
33
+ * **Internal Reasoning**: Designed to learn explicit 'thought processes' during training via interleaved thinking.
34
+ * **Base Model Only**: Focused on providing a general-purpose foundation, not end-use applications.
35
+
36
+ ## Model Architecture
37
+
38
+ The model is a **decoder-only Transformer**, styled after **GPT-2**, with approximately **~60 million parameters**.
39
+
40
+ * **Layers**: 8
41
+ * **Hidden Size (Embedding Dimension)**: 512
42
+ * **Attention Heads**: 8
43
+ * **Context Length (Maximum Sequence Length)**: 1024 tokens
44
+ * **Dropout Rate**: 0.1 (applied to residual connections, embeddings, and attention probabilities)
45
+
46
+ ## Tokenizer Details
47
+
48
+ The tokenizer is a **SentencePiece BPE (Byte-Pair Encoding) tokenizer**, trained exclusively on a **Bangla-only corpus**. It features a **vocabulary size of 32,000** unique tokens and incorporates several mandatory special tokens:
49
+
50
+ * `<bos>`: Beginning of Sentence
51
+ * `<eos>`: End of Sentence
52
+ * `<pad>`: Padding token
53
+ * `<think>`: Start Thinking (for internal reasoning blocks during training)
54
+ * `</think>`: End Thinking
55
+
56
+ These tokens are consistently used for proper parsing, context handling, and enabling advanced training strategies like loss masking.
57
+
58
+ ## Dataset and Mixing Ratios
59
+
60
+ The training corpus is a blend of three distinct dataset types:
61
+
62
+ * **70% Raw Bangla Text**: For foundational language modeling and fluency.
63
+ * **20% Instruction/QA**: For improving instruction following and question answering capabilities.
64
+ * **10% Reasoning**: Incorporates interleaved thinking (`<think>...</think>`) patterns to foster internal reasoning processes.
65
+
66
+ ## Interleaved Thinking and Loss Weighting
67
+
68
+ **Interleaved Thinking** is a core training strategy where explicit 'thought processes' (`<think>...</think>`) are included in the training data to teach the model logical reasoning. During inference, the model is expected to internalize this reasoning and produce direct answers without generating the `<think>` blocks.
69
+
70
+ To facilitate this, a **differential loss weighting strategy** is applied:
71
+ * **Normal Tokens**: Loss weight of 1.0 (emphasizing accurate generation of primary content).
72
+ * **`<think>` Tokens**: Loss weight of 0.3 (encouraging internalization of reasoning logic without over-prioritizing explicit generation).
73
+
74
+ ## Training Configuration
75
+
76
+ The base model was trained with efficiency and resource optimization in mind:
77
+
78
+ * **FP16 (Mixed Precision)**: Reduces memory and speeds up computations.
79
+ * **Gradient Checkpointing**: Further reduces memory footprint.
80
+ * **Gradient Accumulation Steps**: 8 (effective batch size of 16, with micro-batch size of 2).
81
+
82
+ ## LoRA Fine-Tuning for Coding and Agentic Workflows
83
+
84
+ This model has been conceptually prepared for **LoRA (Low-Rank Adaptation) fine-tuning**, specifically targeting **coding tasks and agentic workflows**. LoRA allows for efficient adaptation by training only a small fraction of parameters while keeping the base model frozen.
85
+
86
+ ### LoRA Strategy
87
+ * **Target Modules**: `c_attn` (query, key, and value projections in attention mechanism).
88
+ * **Rank (`r`)**: 8
89
+ * **Scaling Coefficient (`lora_alpha`)**: 16
90
+ * **Dropout (`lora_dropout`)**: 0.05
91
+
92
+ ### Adapter Training Configuration (Conceptual)
93
+ * **Learning Rate**: `5e-4` (0.0005)
94
+ * **Epochs**: 5 (initial)
95
+ * **Effective Batch Size**: 16 (micro-batch of 2, 8 gradient accumulation steps)
96
+ * **Scheduler**: Linear warmup (10%) and linear decay.
97
+
98
+ ## Evaluation Benchmarks (Conceptual)
99
+
100
+ To assess the LoRA fine-tuned model's performance on specialized tasks, hypothetical benchmarks were considered:
101
+
102
+ ### Coding Tasks
103
+ * **Benchmarks**: HumanEval-like (Bangla adaptation), LeetCode-style (simplified Bangla), Code Correction/Refactoring.
104
+ * **Metrics**: Functional Correctness (Pass@k), Adherence to Problem Constraints, Code Generation Quality, Safety/Security.
105
+
106
+ ### Agentic Workflows
107
+ * **Benchmarks**: Simulated Environment Tasks, Tool-Use Scenarios, Multi-step Reasoning Chains.
108
+ * **Metrics**: Task Completion Rate, Efficiency of Steps Taken, Correct Use of Tools, Adherence to User Intent, Robustness to Ambiguity.
109
+
110
+ ## Conceptual Benchmark Results
111
+
112
+ Below are hypothetical performance metrics for the LoRA fine-tuned model on coding and agentic tasks. These illustrate the expected types of evaluation results.
113
+
114
+ ### Coding Task Metrics:
115
+
116
+ ![Coding Metrics](coding_metrics.png)
117
+
118
+ ### Agentic Task Metrics:
119
+
120
+ ![Agentic Metrics](agentic_metrics.png)
121
+
122
+ ## Usage Instructions
123
+
124
+ To load and use the fine-tuned Bangla Decoder-Only Transformer model and its tokenizer from the Hugging Face Hub, you can use the `transformers` library.
125
+
126
+ ### Loading the Model and Tokenizer
127
+
128
+ First, ensure you have the `transformers` and `torch` libraries installed. Then, you can load the model and tokenizer using their `from_pretrained` methods, specifying the `repo_id`.
129
+
130
+ ```python
131
+ from transformers import AutoModelForCausalLM, AutoTokenizer
132
+ import torch
133
+
134
+ # Define the repository ID on Hugging Face Hub
135
+ repo_id = "likhonsheikh/bangla-decoder-only-transformer"
136
+
137
+ # Load the model
138
+ model = AutoModelForCausalLM.from_pretrained(repo_id)
139
+ # Ensure the model is in evaluation mode and on the correct device
140
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
141
+ model.to(device).eval()
142
+ print(f"Model loaded from {repo_id} and moved to {device}.")
143
+
144
+ # Load the tokenizer
145
+ tokenizer = AutoTokenizer.from_pretrained(repo_id)
146
+ print(f"Tokenizer loaded from {repo_id}.")
147
+
148
+ # Set pad_token_id if not already set (important for generation)
149
+ if tokenizer.pad_token is None:
150
+ tokenizer.add_special_tokens({'pad_token': '<pad>'})
151
+ # Resize model embeddings if new tokens were added
152
+ model.resize_token_embeddings(len(tokenizer))
153
+ ```
154
+
155
+ ### Performing Text Generation
156
+
157
+ Once the model and tokenizer are loaded, you can use the `model.generate()` method to create new text. It's important to prepare your prompt with the `<bos>` (beginning of sentence) token to signal the start of generation, similar to how the model was trained. The model was trained with loss masking for `<think>` tokens, meaning it focuses on generating the surrounding context rather than the content within `<think>` blocks. During inference, if the model generates a `<think>` token, it would typically generate an empty thought or move past it as it was trained to not predict its content explicitly.
158
+
159
+ ```python
160
+ # Example prompt for text generation
161
+ prompt = "<bos> বাংলাদেশের জাতীয় ফল হলো " # Bangla for: "The national fruit of Bangladesh is "
162
+
163
+ # Encode the prompt
164
+ input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
165
+
166
+ # Generate text
167
+ # You can adjust parameters like max_new_tokens, num_beams, temperature, top_k, top_p
168
+ output_ids = model.generate(
169
+ input_ids,
170
+ max_new_tokens=50, # Generate up to 50 new tokens
171
+ num_return_sequences=1,
172
+ do_sample=True, # Enable sampling for more diverse outputs
173
+ top_k=50, # Sample from top 50 probable tokens
174
+ top_p=0.95, # Sample from tokens that cumulatively exceed 95% probability
175
+ temperature=0.7, # Controls randomness: lower means less random
176
+ pad_token_id=tokenizer.pad_token_id, # Use the pad token ID
177
+ eos_token_id=tokenizer.eos_token_id # Stop generation at EOS token
178
+ )
179
+
180
+ # Decode the generated text
181
+ generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
182
+
183
+ print("
184
+ Generated Text:")
185
+ print(generated_text)
186
+ ```
187
+
188
+ ## Future Work and Next Steps
189
+
190
+ This project provides a foundational decoder-only Transformer model and a custom Bangla BPE tokenizer, trained according to the 'Sheikh-ABF Final Training Plan'. To further enhance its capabilities and utility, the following next steps and future work are suggested:
191
+
192
+ 1. **Dataset Expansion and Diversification**: The current training corpus is a small placeholder. Expanding the dataset significantly with more diverse and high-quality Bangla text, covering various domains (e.g., news, literature, technical, social media), will greatly improve the model's fluency, coherence, and knowledge.
193
+
194
+ 2. **Advanced Benchmarking**: Conduct comprehensive benchmarking against existing state-of-the-art Bangla NLP models across a suite of downstream tasks, such as text summarization, question answering, sentiment analysis, and machine translation. This will provide a clearer understanding of the model's strengths and weaknesses.
195
+
196
+ 3. **Fine-tuning for Specific Tasks**: Fine-tune the base model on task-specific datasets to adapt it for specialized applications. For instance, fine-tuning on a Bangla chatbot dataset for conversational AI, or on a legal document corpus for legal NLP tasks.
197
+
198
+ 4. **Experiment with Loss Weighting**: Further experimentation with the loss weighting strategy for `<think>` tokens is crucial. Different weighting schemes and dynamic adjustment based on training progress could lead to more effective learning of reasoning patterns.
199
+
200
+ 5. **Model Optimization and Scaling**: Explore techniques for model optimization, such as knowledge distillation or quantization, to deploy the model more efficiently on resource-constrained devices. Consider scaling the model up (more layers, larger hidden size) with a larger dataset for improved performance, if computational resources allow.
201
+
202
+ 6. **Integrate More Special Tokens/Structures**: Depending on specific use cases, introduce and train for additional special tokens or structural markers to guide model behavior, similar to the `<think>` tags.
203
+
204
+ 7. **Human Evaluation**: Beyond automated metrics, conduct human evaluations to assess the quality of generated text, particularly focusing on the coherence and correctness of reasoning responses when `<think>` tokens are involved.