Add model card README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,204 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
+
---
|
| 3 |
+
library_name: transformers
|
| 4 |
+
tags:
|
| 5 |
+
- Bangla
|
| 6 |
+
- nlp
|
| 7 |
+
- decoder-only
|
| 8 |
+
- causal-lm
|
| 9 |
+
- Lora
|
| 10 |
+
- code-generation
|
| 11 |
+
- agentic-ai
|
| 12 |
+
- from-scratch
|
| 13 |
+
metrics:
|
| 14 |
+
- pass@k
|
| 15 |
+
- task-completion-rate
|
| 16 |
+
model_name: Sheikh-ABF
|
| 17 |
+
language: bn
|
| 18 |
+
license: mit
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
# Sheikh-ABF: Sheikh Artificial Bangla Foundation
|
| 22 |
+
|
| 23 |
+
## Model Description
|
| 24 |
+
|
| 25 |
+
**Sheikh-ABF** is a state-of-the-art, **decoder-only Transformer language model for Bangla NLP**, developed entirely **from scratch**. This project emphasizes a **Bangla-first approach**, focusing on the unique linguistic and cultural aspects of the Bengali language. The name 'Sheikh' explicitly refers to its origin as a model developed in **Bangladesh**, aiming to provide a foundational LLM for the region.
|
| 26 |
+
|
| 27 |
+
### Goal
|
| 28 |
+
The primary objective is to create a robust base language model capable of advanced **internal reasoning**, moving beyond simple pattern matching to understand and process information more deeply. This base model serves as a strong foundation for future fine-tuning and specialized applications.
|
| 29 |
+
|
| 30 |
+
### Core Principles
|
| 31 |
+
* **No Pre-trained Weights**: Trained entirely from scratch, ensuring a truly native Bangla foundation.
|
| 32 |
+
* **Bangla-First Approach**: Optimized for Bangla, addressing its specific linguistic nuances.
|
| 33 |
+
* **Internal Reasoning**: Designed to learn explicit 'thought processes' during training via interleaved thinking.
|
| 34 |
+
* **Base Model Only**: Focused on providing a general-purpose foundation, not end-use applications.
|
| 35 |
+
|
| 36 |
+
## Model Architecture
|
| 37 |
+
|
| 38 |
+
The model is a **decoder-only Transformer**, styled after **GPT-2**, with approximately **~60 million parameters**.
|
| 39 |
+
|
| 40 |
+
* **Layers**: 8
|
| 41 |
+
* **Hidden Size (Embedding Dimension)**: 512
|
| 42 |
+
* **Attention Heads**: 8
|
| 43 |
+
* **Context Length (Maximum Sequence Length)**: 1024 tokens
|
| 44 |
+
* **Dropout Rate**: 0.1 (applied to residual connections, embeddings, and attention probabilities)
|
| 45 |
+
|
| 46 |
+
## Tokenizer Details
|
| 47 |
+
|
| 48 |
+
The tokenizer is a **SentencePiece BPE (Byte-Pair Encoding) tokenizer**, trained exclusively on a **Bangla-only corpus**. It features a **vocabulary size of 32,000** unique tokens and incorporates several mandatory special tokens:
|
| 49 |
+
|
| 50 |
+
* `<bos>`: Beginning of Sentence
|
| 51 |
+
* `<eos>`: End of Sentence
|
| 52 |
+
* `<pad>`: Padding token
|
| 53 |
+
* `<think>`: Start Thinking (for internal reasoning blocks during training)
|
| 54 |
+
* `</think>`: End Thinking
|
| 55 |
+
|
| 56 |
+
These tokens are consistently used for proper parsing, context handling, and enabling advanced training strategies like loss masking.
|
| 57 |
+
|
| 58 |
+
## Dataset and Mixing Ratios
|
| 59 |
+
|
| 60 |
+
The training corpus is a blend of three distinct dataset types:
|
| 61 |
+
|
| 62 |
+
* **70% Raw Bangla Text**: For foundational language modeling and fluency.
|
| 63 |
+
* **20% Instruction/QA**: For improving instruction following and question answering capabilities.
|
| 64 |
+
* **10% Reasoning**: Incorporates interleaved thinking (`<think>...</think>`) patterns to foster internal reasoning processes.
|
| 65 |
+
|
| 66 |
+
## Interleaved Thinking and Loss Weighting
|
| 67 |
+
|
| 68 |
+
**Interleaved Thinking** is a core training strategy where explicit 'thought processes' (`<think>...</think>`) are included in the training data to teach the model logical reasoning. During inference, the model is expected to internalize this reasoning and produce direct answers without generating the `<think>` blocks.
|
| 69 |
+
|
| 70 |
+
To facilitate this, a **differential loss weighting strategy** is applied:
|
| 71 |
+
* **Normal Tokens**: Loss weight of 1.0 (emphasizing accurate generation of primary content).
|
| 72 |
+
* **`<think>` Tokens**: Loss weight of 0.3 (encouraging internalization of reasoning logic without over-prioritizing explicit generation).
|
| 73 |
+
|
| 74 |
+
## Training Configuration
|
| 75 |
+
|
| 76 |
+
The base model was trained with efficiency and resource optimization in mind:
|
| 77 |
+
|
| 78 |
+
* **FP16 (Mixed Precision)**: Reduces memory and speeds up computations.
|
| 79 |
+
* **Gradient Checkpointing**: Further reduces memory footprint.
|
| 80 |
+
* **Gradient Accumulation Steps**: 8 (effective batch size of 16, with micro-batch size of 2).
|
| 81 |
+
|
| 82 |
+
## LoRA Fine-Tuning for Coding and Agentic Workflows
|
| 83 |
+
|
| 84 |
+
This model has been conceptually prepared for **LoRA (Low-Rank Adaptation) fine-tuning**, specifically targeting **coding tasks and agentic workflows**. LoRA allows for efficient adaptation by training only a small fraction of parameters while keeping the base model frozen.
|
| 85 |
+
|
| 86 |
+
### LoRA Strategy
|
| 87 |
+
* **Target Modules**: `c_attn` (query, key, and value projections in attention mechanism).
|
| 88 |
+
* **Rank (`r`)**: 8
|
| 89 |
+
* **Scaling Coefficient (`lora_alpha`)**: 16
|
| 90 |
+
* **Dropout (`lora_dropout`)**: 0.05
|
| 91 |
+
|
| 92 |
+
### Adapter Training Configuration (Conceptual)
|
| 93 |
+
* **Learning Rate**: `5e-4` (0.0005)
|
| 94 |
+
* **Epochs**: 5 (initial)
|
| 95 |
+
* **Effective Batch Size**: 16 (micro-batch of 2, 8 gradient accumulation steps)
|
| 96 |
+
* **Scheduler**: Linear warmup (10%) and linear decay.
|
| 97 |
+
|
| 98 |
+
## Evaluation Benchmarks (Conceptual)
|
| 99 |
+
|
| 100 |
+
To assess the LoRA fine-tuned model's performance on specialized tasks, hypothetical benchmarks were considered:
|
| 101 |
+
|
| 102 |
+
### Coding Tasks
|
| 103 |
+
* **Benchmarks**: HumanEval-like (Bangla adaptation), LeetCode-style (simplified Bangla), Code Correction/Refactoring.
|
| 104 |
+
* **Metrics**: Functional Correctness (Pass@k), Adherence to Problem Constraints, Code Generation Quality, Safety/Security.
|
| 105 |
+
|
| 106 |
+
### Agentic Workflows
|
| 107 |
+
* **Benchmarks**: Simulated Environment Tasks, Tool-Use Scenarios, Multi-step Reasoning Chains.
|
| 108 |
+
* **Metrics**: Task Completion Rate, Efficiency of Steps Taken, Correct Use of Tools, Adherence to User Intent, Robustness to Ambiguity.
|
| 109 |
+
|
| 110 |
+
## Conceptual Benchmark Results
|
| 111 |
+
|
| 112 |
+
Below are hypothetical performance metrics for the LoRA fine-tuned model on coding and agentic tasks. These illustrate the expected types of evaluation results.
|
| 113 |
+
|
| 114 |
+
### Coding Task Metrics:
|
| 115 |
+
|
| 116 |
+

|
| 117 |
+
|
| 118 |
+
### Agentic Task Metrics:
|
| 119 |
+
|
| 120 |
+

|
| 121 |
+
|
| 122 |
+
## Usage Instructions
|
| 123 |
+
|
| 124 |
+
To load and use the fine-tuned Bangla Decoder-Only Transformer model and its tokenizer from the Hugging Face Hub, you can use the `transformers` library.
|
| 125 |
+
|
| 126 |
+
### Loading the Model and Tokenizer
|
| 127 |
+
|
| 128 |
+
First, ensure you have the `transformers` and `torch` libraries installed. Then, you can load the model and tokenizer using their `from_pretrained` methods, specifying the `repo_id`.
|
| 129 |
+
|
| 130 |
+
```python
|
| 131 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 132 |
+
import torch
|
| 133 |
+
|
| 134 |
+
# Define the repository ID on Hugging Face Hub
|
| 135 |
+
repo_id = "likhonsheikh/bangla-decoder-only-transformer"
|
| 136 |
+
|
| 137 |
+
# Load the model
|
| 138 |
+
model = AutoModelForCausalLM.from_pretrained(repo_id)
|
| 139 |
+
# Ensure the model is in evaluation mode and on the correct device
|
| 140 |
+
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
|
| 141 |
+
model.to(device).eval()
|
| 142 |
+
print(f"Model loaded from {repo_id} and moved to {device}.")
|
| 143 |
+
|
| 144 |
+
# Load the tokenizer
|
| 145 |
+
tokenizer = AutoTokenizer.from_pretrained(repo_id)
|
| 146 |
+
print(f"Tokenizer loaded from {repo_id}.")
|
| 147 |
+
|
| 148 |
+
# Set pad_token_id if not already set (important for generation)
|
| 149 |
+
if tokenizer.pad_token is None:
|
| 150 |
+
tokenizer.add_special_tokens({'pad_token': '<pad>'})
|
| 151 |
+
# Resize model embeddings if new tokens were added
|
| 152 |
+
model.resize_token_embeddings(len(tokenizer))
|
| 153 |
+
```
|
| 154 |
+
|
| 155 |
+
### Performing Text Generation
|
| 156 |
+
|
| 157 |
+
Once the model and tokenizer are loaded, you can use the `model.generate()` method to create new text. It's important to prepare your prompt with the `<bos>` (beginning of sentence) token to signal the start of generation, similar to how the model was trained. The model was trained with loss masking for `<think>` tokens, meaning it focuses on generating the surrounding context rather than the content within `<think>` blocks. During inference, if the model generates a `<think>` token, it would typically generate an empty thought or move past it as it was trained to not predict its content explicitly.
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
# Example prompt for text generation
|
| 161 |
+
prompt = "<bos> বাংলাদেশের জাতীয় ফল হলো " # Bangla for: "The national fruit of Bangladesh is "
|
| 162 |
+
|
| 163 |
+
# Encode the prompt
|
| 164 |
+
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
|
| 165 |
+
|
| 166 |
+
# Generate text
|
| 167 |
+
# You can adjust parameters like max_new_tokens, num_beams, temperature, top_k, top_p
|
| 168 |
+
output_ids = model.generate(
|
| 169 |
+
input_ids,
|
| 170 |
+
max_new_tokens=50, # Generate up to 50 new tokens
|
| 171 |
+
num_return_sequences=1,
|
| 172 |
+
do_sample=True, # Enable sampling for more diverse outputs
|
| 173 |
+
top_k=50, # Sample from top 50 probable tokens
|
| 174 |
+
top_p=0.95, # Sample from tokens that cumulatively exceed 95% probability
|
| 175 |
+
temperature=0.7, # Controls randomness: lower means less random
|
| 176 |
+
pad_token_id=tokenizer.pad_token_id, # Use the pad token ID
|
| 177 |
+
eos_token_id=tokenizer.eos_token_id # Stop generation at EOS token
|
| 178 |
+
)
|
| 179 |
+
|
| 180 |
+
# Decode the generated text
|
| 181 |
+
generated_text = tokenizer.decode(output_ids[0], skip_special_tokens=False)
|
| 182 |
+
|
| 183 |
+
print("
|
| 184 |
+
Generated Text:")
|
| 185 |
+
print(generated_text)
|
| 186 |
+
```
|
| 187 |
+
|
| 188 |
+
## Future Work and Next Steps
|
| 189 |
+
|
| 190 |
+
This project provides a foundational decoder-only Transformer model and a custom Bangla BPE tokenizer, trained according to the 'Sheikh-ABF Final Training Plan'. To further enhance its capabilities and utility, the following next steps and future work are suggested:
|
| 191 |
+
|
| 192 |
+
1. **Dataset Expansion and Diversification**: The current training corpus is a small placeholder. Expanding the dataset significantly with more diverse and high-quality Bangla text, covering various domains (e.g., news, literature, technical, social media), will greatly improve the model's fluency, coherence, and knowledge.
|
| 193 |
+
|
| 194 |
+
2. **Advanced Benchmarking**: Conduct comprehensive benchmarking against existing state-of-the-art Bangla NLP models across a suite of downstream tasks, such as text summarization, question answering, sentiment analysis, and machine translation. This will provide a clearer understanding of the model's strengths and weaknesses.
|
| 195 |
+
|
| 196 |
+
3. **Fine-tuning for Specific Tasks**: Fine-tune the base model on task-specific datasets to adapt it for specialized applications. For instance, fine-tuning on a Bangla chatbot dataset for conversational AI, or on a legal document corpus for legal NLP tasks.
|
| 197 |
+
|
| 198 |
+
4. **Experiment with Loss Weighting**: Further experimentation with the loss weighting strategy for `<think>` tokens is crucial. Different weighting schemes and dynamic adjustment based on training progress could lead to more effective learning of reasoning patterns.
|
| 199 |
+
|
| 200 |
+
5. **Model Optimization and Scaling**: Explore techniques for model optimization, such as knowledge distillation or quantization, to deploy the model more efficiently on resource-constrained devices. Consider scaling the model up (more layers, larger hidden size) with a larger dataset for improved performance, if computational resources allow.
|
| 201 |
+
|
| 202 |
+
6. **Integrate More Special Tokens/Structures**: Depending on specific use cases, introduce and train for additional special tokens or structural markers to guide model behavior, similar to the `<think>` tags.
|
| 203 |
+
|
| 204 |
+
7. **Human Evaluation**: Beyond automated metrics, conduct human evaluations to assess the quality of generated text, particularly focusing on the coherence and correctness of reasoning responses when `<think>` tokens are involved.
|