Model Card for llama-carvalho-scansion-gl-cx
Nos-PT/Llama-Carvalho-PT-GL fine tuned for scansion (lexical to metrical syllabification).
The checkpoint was uploaded using HfApi.upload_folder() given problems when pushing
the LoRA adapters to HF in any other of the formats tested so far.
Given those same problems, it needs to be downloaded with huggingface_hub.snapshot_download.
We tested the download as follows, with the local_dir parameter rather than using HF's local cache, to download the checkpoint to a local directory
(~/models/llama-carvalho-scansion-gl-cx in the example):
from pathlib import Path
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="compellit/llama-carvalho-scansion-gl-cx",
local_dir=Path("~/models/llama-carvalho-scansion-gl-cx").expanduser(),
)
Quick start
Requires a prompt (the one below was used for fine-tuning) and an input text in this format:
"PREV: sin / *fe / nin / cre- / *en- / zas"
"CUR: *ten / *cen- / tos / de / al- / *ta- / res"
"NEXT: *che- / os / de / ri- / *que- / zas"
"OUTPUT:
For that input, the model outputs the prediction for the CUR: line: *ten / *cen- / tos / de al- / *ta- / res
The code below performs inference.
from pathlib import Path
import torch
from unsloth import FastLanguageModel
model_name = str(Path("~/models/llama-carvalho-scansion-gl-cx").expanduser().resolve())
max_seq_length = 512
load_in_4bit = True
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=model_name,
max_seq_length=max_seq_length,
dtype=None,
load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)
instruction = """
I need to scan some lines (scansion is the syllabic division of poetic lines and identification of stresses).
I will give you the lexical syllabification of each line as input. Based on this, you need to identify metrical syllables. This will require merging some syllables (erasing syllable boundaries) and splitting others (adding syllable boundaries).
The input format is:
1. Syllables separated by " / ".
2. Each lexically stressed syllable is preceded by "*".
The desired output format is:
1. Syllables separated by " / ".
2. Each metrically stressed syllable is preceded by "*".
Do not carry out any other modifications to the input, just modify syllable boundaries if needed.
Do not repeat the scanned line more than once.
Output only the scanned version of the line in CUR. Do not output PREV or NEXT.
PREV: previous line or [EMPTY]
CUR: line to scan
NEXT: following line or [EMPTY]
OUTPUT:
"""
example = (
"PREV: sin / *fe / nin / cre- / *en- / zas\n"
"CUR: *ten / *cen- / tos / de / al- / *ta- / res\n"
"NEXT: *che- / os / de / ri- / *que- / zas\n"
"OUTPUT:\n\n"
)
messages = [
{"role": "system", "content": instruction.strip()},
{"role": "user", "content": example},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.inference_mode():
outputs = model.generate(
inputs["input_ids"],
max_new_tokens=128,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
do_sample=False,
)
prompt_len = inputs["input_ids"].shape[1]
generated_tokens = outputs[0][prompt_len:]
print(tokenizer.decode(generated_tokens, skip_special_tokens=True).strip())
Framework versions
This model was trained with SFT.
- PEFT 0.18.0
- TRL: 0.19.1
- Transformers: 4.57.3
- Pytorch: 2.9.1
- Datasets: 4.3.0
- Tokenizers: 0.22.1
- Downloads last month
- 45