File size: 12,213 Bytes
657b89b de643a6 657b89b 4020bf9 657b89b e5519de 657b89b 709e626 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 |
---
library_name: transformers
license: llama3.1
base_model: meta-llama/Llama-3.1-8B
tags:
- generated_from_trainer
model-index:
- name: llama31-8b-halo-bikol-cpt
results: []
datasets:
- sapinsapin/halo-bikol
---
<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->
# llama31-8b-halo-bikol-cpt
This model is a continuously pretrained model version of [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on [sapinsapin/halo-bikol](https://huggingface.co/sapinsapin/halo-bikol) dataset which is webscraped bikol dataset.
## Model description
More information needed
## Intended uses & limitations
More information needed
## Training and evaluation data
More information needed
## Training procedure
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- distributed_type: multi-GPU
- num_devices: 16
- gradient_accumulation_steps: 8
- total_train_batch_size: 128
- total_eval_batch_size: 128
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_ratio: 0.03
- training_steps: 1500
### Training results
Example inference:
```bash
Sending request to http://localhost:8000/v1/completions...
Payload: {
"model": "sapinsapin/llama31-8b-halo-bikol-cpt",
"prompt": "nasa hampang mismo kan dating planta",
"max_tokens": 512,
"temperature": 0.5
}
Status: 200
Response:
{
"id": "cmpl-5b4eb44ef68f48f98b6e950250a40725",
"object": "text_completion",
"created": 1766879142,
"model": "sapinsapin/llama31-8b-halo-bikol-cpt",
"choices": [
{
"index": 0,
"text": " kan lamang | Magbikol Kita Saturday, December 27, , 4:14am nasa hampang mismo kan dating planta kan lamang Published on Sunday, September 13, , 6:49pm by Ernie Verdadero | Midya-Midya An sabi sako kan sarong taga-planta kan lamang, kan makatapos sya nin high school, nagpuon syang magtrabaho sa dating planta kan ilaw sa Tabaco. Planta an sabi nya kan nagkakapirang mga kontrabanderong ginatos na taon na nagtalubo duman. Alas-10 nin aga si baad na pagtanom kan mga bulong. Pigbabantayan nin mga tawong armas an mga makamurumundong bulong tanganing maibitaran an pagluwas nin mga lamang. Sarong kanto kan planta an tinrabaho nya. Duman nagpuon an saiyang pagbibyahe. Primero nang nahiling nya an malain na estado kan tinampo sa Tabaco. Saro sa mga gibo nya na nakatudan nya iyo an pagtanom nin mga kahoy. Nin huli ta haloy nang pan\u00f4 nin mga bulong an planta, dakol na an mga kahoy na nagtatalubo duman. Sabi nya, maski malain an buhay mo, agom mo, kaibahan mo, sa pagtanom ka san\u00e1. Maski magretiro na sya, dai nya mapugulan an pagtanom. Sa harong nya an sarong establisimyento duman sa Tabaco na nagtatanom nin mga orchids. Sabi ko saiya, ikagwapo iyan. Syempre, sabi nya, carampatan man nanggad. Kun ano an bu\u00f3t sabihon kan carampatan, yano man an bu\u00f3t sabihon kan gwapo. Gabos kita mab\u00faot na tawo. Siisay an dai mab\u00faot? Sabi ko saiya, an tawong dai mab\u00faot iyo an tawong dai nakukuntento sa saiyang kapalibutan. Aram an tawo na dai nya mapugulan an paghanap nin paagi na makakatabang saiya tanganing mas",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"token_ids": null,
"prompt_logprobs": null,
"prompt_token_ids": null
}
],
"service_tier": null,
"system_fingerprint": null,
"usage": {
"prompt_tokens": 10,
"total_tokens": 522,
"completion_tokens": 512,
"prompt_tokens_details": null
},
"kv_transfer_params": null
}
```
### Framework versions
- Transformers 4.57.3
- Pytorch 2.9.0a0+145a3a7bda.nv25.10
- Datasets 4.4.2
- Tokenizers 0.22.1
# LLM Completions API Test Example
## Overview
This test validates that **continuous pretraining (CPT)** successfully adapted Llama 3.1 8B to the Bikol language. The completions endpoint test serves as a sense-check to confirm the model has learned the new language's vocabulary, grammar, and patterns.
### What is Continuous Pretraining?
Continuous pretraining extends a foundation model's knowledge by training on domain-specific or language-specific corpora using the same next-token prediction objective as initial pretraining. For this model:
- **Base Model**: `meta-llama/Llama-3.1-8B` (primarily English)
- **Training Data**: BalitaNLP dataset (Bikol language corpus)
- **Objective**: Causal language modeling (predict next token)
- **Result**: Model learns Bikol vocabulary, syntax, and cultural context while retaining general capabilities
CPT teaches the model to "speak" Bikol fluently by exposing it to thousands of Bikol text examples during training.
### Next Steps: Post-Training
After validating CPT success, the next phase is **post-training** to make the model useful for real applications:
1. **Supervised Fine-Tuning (SFT)**
- Train on instruction-response pairs in Bikol
- Dataset format: `{"instruction": "Ano ang...", "response": "..."}`
- Teaches the model to follow instructions and answer questions
- Example: Question answering, summarization, translation tasks
2. **Preference Alignment (RLHF/DPO)**
- Align model outputs with human preferences
- Use Direct Preference Optimization (DPO) or RLHF
- Dataset: Preferred vs rejected response pairs
- Improves helpfulness, safety, and cultural appropriateness
3. **Task-Specific Fine-Tuning**
- Specialize for specific use cases (e.g., customer support, education)
- Use LoRA/QLoRA for parameter-efficient adaptation
- Smaller datasets (hundreds to thousands of examples)
**Current Stage**: ✅ CPT Complete → Testing raw language generation
**Next Stage**: → SFT → Instruction-following capabilities
## Request
### cURL Command
```bash
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "sapinsapin/llama31-8b-halo-bikol-cpt",
"prompt": "nasa hampang mismo kan dating planta",
"max_tokens": 512,
"temperature": 0.5
}'
```
### Python Script
```python
#!/usr/bin/env python3
import requests
import json
def test_completions(
base_url: str = "http://localhost:8000",
model: str = "sapinsapin/llama31-8b-halo-bikol-cpt",
prompt: str = "nasa hampang mismo kan dating planta",
max_tokens: int = 512,
temperature: float = 0.5
):
"""Test OpenAI-compatible completions endpoint"""
url = f"{base_url}/v1/completions"
payload = {
"model": model,
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature
}
headers = {"Content-Type": "application/json"}
response = requests.post(url, headers=headers, json=payload)
print(f"Status: {response.status_code}")
print(f"Response:\n{json.dumps(response.json(), indent=2)}")
return response.json()
if __name__ == "__main__":
test_completions()
```
## Response
### Status
✅ **200 OK**
### Metadata
- **Model**: `sapinsapin/llama31-8b-halo-bikol-cpt`
- **Request ID**: `cmpl-5b4eb44ef68f48f98b6e950250a40725`
- **Finish Reason**: `length` (max_tokens reached)
### Token Usage
| Metric | Count |
|--------|-------|
| Prompt Tokens | 10 |
| Completion Tokens | 512 |
| Total Tokens | 522 |
### Generated Text (Bikol)
```
kan lamang | Magbikol Kita Saturday, December 27, , 4:14am nasa hampang mismo kan dating planta kan lamang Published on Sunday, September 13, , 6:49pm by Ernie Verdadero | Midya-Midya An sabi sako kan sarong taga-planta kan lamang, kan makatapos sya nin high school, nagpuon syang magtrabaho sa dating planta kan ilaw sa Tabaco. Planta an sabi nya kan nagkakapirang mga kontrabanderong ginatos na taon na nagtalubo duman. Alas-10 nin aga si baad na pagtanom kan mga bulong. Pigbabantayan nin mga tawong armas an mga makamurumundong bulong tanganing maibitaran an pagluwas nin mga lamang. Sarong kanto kan planta an tinrabaho nya. Duman nagpuon an saiyang pagbibyahe. Primero nang nahiling nya an malain na estado kan tinampo sa Tabaco. Saro sa mga gibo nya na nakatudan nya iyo an pagtanom nin mga kahoy. Nin huli ta haloy nang panô nin mga bulong an planta, dakol na an mga kahoy na nagtatalubo duman. Sabi nya, maski malain an buhay mo, agom mo, kaibahan mo, sa pagtanom ka sanâ. Maski magretiro na sya, dai nya mapugulan an pagtanom. Sa harong nya an sarong establisimyento duman sa Tabaco na nagtatanom nin mga orchids. Sabi ko saiya, ikagwapo iyan. Syempre, sabi nya, carampatan man nanggad. Kun ano an buót sabihon kan carampatan, yano man an buót sabihon kan gwapo. Gabos kita mabúot na tawo. Siisay an dai mabúot? Sabi ko saiya, an tawong dai mabúot iyo an tawong dai nakukuntento sa saiyang kapalibutan. Aram an tawo na dai nya mapugulan an paghanap nin paagi na makakatabang saiya tanganing mas
```
### Full JSON Response
```json
{
"id": "cmpl-5b4eb44ef68f48f98b6e950250a40725",
"object": "text_completion",
"created": 1766879142,
"model": "sapinsapin/llama31-8b-halo-bikol-cpt",
"choices": [
{
"index": 0,
"text": "kan lamang | Magbikol Kita Saturday, December 27, , 4:14am nasa hampang mismo kan dating planta kan lamang Published on Sunday, September 13, , 6:49pm by Ernie Verdadero | Midya-Midya An sabi sako kan sarong taga-planta kan lamang, kan makatapos sya nin high school, nagpuon syang magtrabaho sa dating planta kan ilaw sa Tabaco. Planta an sabi nya kan nagkakapirang mga kontrabanderong ginatos na taon na nagtalubo duman. Alas-10 nin aga si baad na pagtanom kan mga bulong. Pigbabantayan nin mga tawong armas an mga makamurumundong bulong tanganing maibitaran an pagluwas nin mga lamang. Sarong kanto kan planta an tinrabaho nya. Duman nagpuon an saiyang pagbibyahe. Primero nang nahiling nya an malain na estado kan tinampo sa Tabaco. Saro sa mga gibo nya na nakatudan nya iyo an pagtanom nin mga kahoy. Nin huli ta haloy nang panô nin mga bulong an planta, dakol na an mga kahoy na nagtatalubo duman. Sabi nya, maski malain an buhay mo, agom mo, kaibahan mo, sa pagtanom ka sanâ. Maski magretiro na sya, dai nya mapugulan an pagtanom. Sa harong nya an sarong establisimyento duman sa Tabaco na nagtatanom nin mga orchids. Sabi ko saiya, ikagwapo iyan. Syempre, sabi nya, carampatan man nanggad. Kun ano an buót sabihon kan carampatan, yano man an buót sabihon kan gwapo. Gabos kita mabúot na tawo. Siisay an dai mabúot? Sabi ko saiya, an tawong dai mabúot iyo an tawong dai nakukuntento sa saiyang kapalibutan. Aram an tawo na dai nya mapugulan an paghanap nin paagi na makakatabang saiya tanganing mas",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null
}
],
"usage": {
"prompt_tokens": 10,
"total_tokens": 522,
"completion_tokens": 512
}
}
```
## Analysis
### CPT Validation Results
✅ **Language Adaptation Confirmed**
The model successfully generated coherent Bikol text about:
- A person working at a plant facility in Tabaco
- Their experience after high school working at a power plant
- Planting trees and maintaining orchids
- Philosophical reflections on contentment and good character
**Key Indicators of Successful CPT:**
- Natural Bikol grammar and sentence structure
- Proper use of Bikol-specific particles ("kan", "nin", "sa", "an")
- Cultural context (Tabaco location, local occupations)
- Coherent narrative flow without code-switching to English
- Vocabulary diversity ("nagtatalubo", "makamurumundong", "establisimyento")
### Limitations of CPT-Only Model
This model generates fluent Bikol text but:
- ❌ Does not follow instructions (no instruction tuning yet)
- ❌ Cannot answer questions in a structured way
- ❌ May generate uncontrolled or irrelevant content
- ❌ Lacks safety guardrails and preference alignment
**Use Case**: Raw text generation, data augmentation, language modeling research
**Not Ready For**: Chatbots, Q&A systems, production assistants (requires SFT + alignment)
|