Model Card for llama-carvalho-scansion-gl-cx

Nos-PT/Llama-Carvalho-PT-GL fine tuned for scansion (lexical to metrical syllabification).

The checkpoint was uploaded using HfApi.upload_folder() given problems when pushing the LoRA adapters to HF in any other of the formats tested so far.

Given those same problems, it needs to be downloaded with huggingface_hub.snapshot_download. We tested the download as follows, with the local_dir parameter rather than using HF's local cache, to download the checkpoint to a local directory (~/models/llama-carvalho-scansion-gl-cx in the example):

from pathlib import Path
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="compellit/llama-carvalho-scansion-gl-cx",
    local_dir=Path("~/models/llama-carvalho-scansion-gl-cx").expanduser(),
)

Quick start

Requires a prompt (the one below was used for fine-tuning) and an input text in this format:

"PREV: sin / *fe / nin / cre- / *en- / zas"
"CUR: *ten / *cen- / tos / de / al- / *ta- / res"
"NEXT: *che- / os / de / ri- / *que- / zas"
"OUTPUT:

For that input, the model outputs the prediction for the CUR: line: *ten / *cen- / tos / de al- / *ta- / res

The code below performs inference.

from pathlib import Path

import torch
from unsloth import FastLanguageModel


model_name = str(Path("~/models/llama-carvalho-scansion-gl-cx").expanduser().resolve())

max_seq_length = 512
load_in_4bit = True

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=None,
    load_in_4bit=load_in_4bit,
)
FastLanguageModel.for_inference(model)

instruction = """
I need to scan some lines (scansion is the syllabic division of poetic lines and identification of stresses).

I will give you the lexical syllabification of each line as input. Based on this, you need to identify metrical syllables. This will require merging some syllables (erasing syllable boundaries) and splitting others (adding syllable boundaries).

The input format is:
1. Syllables separated by " / ".
2. Each lexically stressed syllable is preceded by "*".

The desired output format is:
1. Syllables separated by " / ".
2. Each metrically stressed syllable is preceded by "*".

Do not carry out any other modifications to the input, just modify syllable boundaries if needed.
Do not repeat the scanned line more than once.
Output only the scanned version of the line in CUR. Do not output PREV or NEXT.

PREV: previous line or [EMPTY]
CUR: line to scan
NEXT: following line or [EMPTY]

OUTPUT:

"""

example = (
    "PREV: sin / *fe / nin / cre- / *en- / zas\n"
    "CUR: *ten / *cen- / tos / de / al- / *ta- / res\n"
    "NEXT: *che- / os / de / ri- / *que- / zas\n"
    "OUTPUT:\n\n"
)
messages = [
    {"role": "system", "content": instruction.strip()},
    {"role": "user", "content": example},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.inference_mode():
    outputs = model.generate(
        inputs["input_ids"],
        max_new_tokens=128,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
        do_sample=False,
    )

prompt_len = inputs["input_ids"].shape[1]
generated_tokens = outputs[0][prompt_len:]
print(tokenizer.decode(generated_tokens, skip_special_tokens=True).strip())

Framework versions

This model was trained with SFT.

PEFT 0.18.0
TRL: 0.19.1
Transformers: 4.57.3
Pytorch: 2.9.1
Datasets: 4.3.0
Tokenizers: 0.22.1

Downloads last month: 45

Model tree for compellit/llama-carvalho-scansion-gl-cx

Base model

meta-llama/Llama-3.1-8B

Finetuned

Nos-PT/Llama-Carvalho-PT-GL

Adapter

(2)

this model

Collection including compellit/llama-carvalho-scansion-gl-cx

Scansion Models

Collection

Automatic metrical scansion of poetry in Galician. Best in the series is byt5-scansion-gl-cx. • 6 items • Updated 3 days ago