banking77-llama-1b-lora

Fine-tuned unsloth/Llama-3.2-1B-Instruct on the mteb/banking77 77-class banking intent classification benchmark using QLoRA + rsLoRA + NEFTune.

90.21% exact-match accuracy — up from 0% zero-shot on the same task Single T4 GPU · 3 epochs · ~50 minutes

Evaluation Results

Setting	Accuracy	Correct / Total
Base model (zero-shot)	0.00%	0 / 3,076
Fine-tuned — free generation	90.21%	2,775 / 3,076
Fine-tuned — constrained decoding (trie)	90.28%	2,777 / 3,076

Evaluated on the full Banking77 test split (3,076 samples, 77 classes, ~40 samples/class). Constrained decoding uses a token-level prefix trie over all 77 valid label names — making it physically impossible to generate an invalid label. It fixed 2 samples and broke 0 (zero regression).

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-1B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "rajo0113/banking77-llama-1b-lora")
model = model.merge_and_unload()  # optional: merge weights for faster inference
tokenizer = AutoTokenizer.from_pretrained("rajo0113/banking77-llama-1b-lora")

Inference

The model was trained with Llama's native chat template. Use it the same way at inference:

INSTRUCTION = (
    "Classify the following banking customer query into one of 77 intent "
    "categories. Output only the intent label name (snake_case)."
)

def classify(query: str) -> str:
    messages = [{"role": "user", "content": f"{INSTRUCTION}\n\n{query}"}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=20, do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    ).strip()

print(classify("I lost my card, how do I get a new one?"))
# → card_arrival
print(classify("Why was my international transfer declined?"))
# → declined_transfer
print(classify("Can I use Apple Pay with my account?"))
# → apple_pay_or_google_pay

Training Configuration

Parameter	Value
Base model	`unsloth/Llama-3.2-1B-Instruct`
Quantisation	4-bit NF4 (bitsandbytes)
LoRA rank	16
LoRA alpha	32 (`2 × rank`)
rsLoRA	✓ (`1/√rank` output scaling)
NEFTune noise alpha	5
Optimizer	`adamw_8bit`
Learning rate	2e-4 (cosine + 3% warmup)
Effective batch size	16 (per_device=8, grad_accum=2)
Epochs	3
Sequence packing	✓
Trainable parameters	~1.2% of total
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Why the native chat template?

Using Llama's native chat format rather than Alpaca ### Instruction / ### Response was the single most impactful architectural decision. It matches the format the model was RLHF-trained on, preventing prior-fighting during SFT.

Dataset

mteb/banking77 — 13,069 samples across 77 banking intent classes. Train split: 9,993 samples. Test split: 3,076 samples (held out entirely during training). Labels are integer IDs (0–76) resolved to snake_case strings (e.g. card_arrival, transfer_not_received_by_recipient).

The task is framed as generative classification: the model outputs the label string directly, leveraging its text generation capability without requiring a classification head.

Worst-Class Analysis

Class	Accuracy	Root cause
`topping_up_by_card`	0/40 (0%)	Near-identical surface form to `top_up_by_card_charge`; model hallucinated invalid label `top_up_by_card` on 9/40 samples
`card_arrival`	21/40 (52.5%)	Dataset-level ambiguity with `card_delivery_estimate` — 18 errors from a single confused pair
`transfer_not_received_by_recipient`	27/39 (69.2%)	Overlapping intent descriptions

Constrained decoding eliminates invalid-label hallucinations but cannot fix semantic confusion the model never learned to resolve during training.

Training Code & Full Methodology

Full training notebook (Google Colab T4) with evaluation suite: github.com/rajo69/Finetuning-Experiment-1

Citation

@inproceedings{casanueva2020efficient,
  title     = {Efficient Intent Detection with Dual Sentence Encoders},
  author    = {Casanueva, I{\~n}igo and others},
  booktitle = {Proceedings of the 2nd Workshop on NLP for ConvAI},
  year      = {2020}
}
@article{hu2021lora,
  title  = {LoRA: Low-Rank Adaptation of Large Language Models},
  author = {Hu, Edward J and others},
  year   = {2021}
}
@article{dettmers2023qlora,
  title  = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author = {Dettmers, Tim and others},
  year   = {2023}
}

Downloads last month: 13

Model tree for rajo0113/banking77-llama-1b-lora

Base model

meta-llama/Llama-3.2-1B-Instruct

Finetuned

unsloth/Llama-3.2-1B-Instruct

Adapter

(401)

this model

Dataset used to train rajo0113/banking77-llama-1b-lora

Evaluation results

Exact Match Accuracy (free generation) on Banking77
test set self-reported

0.902
Exact Match Accuracy (constrained decoding) on Banking77
test set self-reported

0.903