|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- culturalheritagenus/rumi-correction-v1.1-data-v3 |
|
|
language: |
|
|
- en |
|
|
- ms |
|
|
metrics: |
|
|
- bleu |
|
|
base_model: |
|
|
- aisingapore/Gemma-SEA-LION-v3-9B-IT |
|
|
--- |
|
|
|
|
|
# Model Card for Model ID |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
|
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
|
|
This model is trained with QLoRA with parameters `r = lora_alpha = 4`. |
|
|
|
|
|
- **Developed by:** hyhyhyhyyhyh |
|
|
- **Model type:** Gemma 2 9B |
|
|
- **Language(s) (NLP):** Malay, English |
|
|
- **License:** [More Information Needed] |
|
|
- **Finetuned from model** aisingapore/Gemma-SEA-LION-v3-9B-IT |
|
|
|
|
|
### Model Sources [optional] |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** [More Information Needed] |
|
|
- **Paper [optional]:** [More Information Needed] |
|
|
- **Demo [optional]:** [More Information Needed] |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model: |
|
|
``` |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer |
|
|
|
|
|
trained_model = AutoModelForCausalLM.from_pretrained( |
|
|
"culturalheritagenus/rumi-correction-v1.1", |
|
|
device_map="auto", |
|
|
torch_dtype=torch.bfloat16 |
|
|
) |
|
|
trained_tokenizer = AutoTokenizer.from_pretrained("culturalheritagenus/rumi-correction-v1.1") |
|
|
``` |
|
|
To perform inference: |
|
|
``` |
|
|
messages = [ |
|
|
{"role": "user", "content": "You are a Malay language spelling corrector. I will give you some text written in messy Rumi (shortened or mistyped). Rewrite it in correct Malay Rumi spelling.\naurng ank. yngdim dimn anm aurngdan"}, |
|
|
] |
|
|
inputs = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize = True, |
|
|
add_generation_prompt = True, # Must add for generation |
|
|
return_tensors = "pt", |
|
|
).to("cuda") |
|
|
|
|
|
|
|
|
text_streamer = TextStreamer(tokenizer) |
|
|
_ = trained_model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128, use_cache = True) |
|
|
``` |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
The model was trained on [culturalheritagenus/rumi-correction-v1.1-data-v3](https://huggingface.co/datasets/culturalheritagenus/rumi-correction-v1.1-data-v3) |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
|
|
|
To replicate this model, please refer to the provided script and below. Ensure that the versions of all languages and libraries are the same. |
|
|
|
|
|
|
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly --> |
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
|
|
- **Hardware Type:** 1x GH200 (96 GB) |
|
|
- **Hours used:** ~12 |
|
|
- **Cloud Provider:** Lambda |
|
|
- **Compute Region:** US-East (Lambda Labs) |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Software |
|
|
|
|
|
- Python version: 3.10.12 |
|
|
- CUDA version: 12.8 |
|
|
- Torch version: 2.7.1+cu128 |
|
|
|
|
|
## Citation [optional] |
|
|
|
|
|
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. --> |
|
|
|
|
|
**BibTeX:** |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
**APA:** |
|
|
|
|
|
[More Information Needed] |
|
|
|
|
|
|
|
|
## Model Card Authors [optional] |
|
|
|
|
|
hyhyhyhyyhyh |
|
|
|