banking77-llama-1b-lora
Fine-tuned unsloth/Llama-3.2-1B-Instruct on the mteb/banking77 77-class banking intent classification benchmark using QLoRA + rsLoRA + NEFTune.
90.21% exact-match accuracy — up from 0% zero-shot on the same task Single T4 GPU · 3 epochs · ~50 minutes
Evaluation Results
| Setting | Accuracy | Correct / Total |
|---|---|---|
| Base model (zero-shot) | 0.00% | 0 / 3,076 |
| Fine-tuned — free generation | 90.21% | 2,775 / 3,076 |
| Fine-tuned — constrained decoding (trie) | 90.28% | 2,777 / 3,076 |
Evaluated on the full Banking77 test split (3,076 samples, 77 classes, ~40 samples/class). Constrained decoding uses a token-level prefix trie over all 77 valid label names — making it physically impossible to generate an invalid label. It fixed 2 samples and broke 0 (zero regression).
How to Use
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"unsloth/Llama-3.2-1B-Instruct",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base, "rajo0113/banking77-llama-1b-lora")
model = model.merge_and_unload() # optional: merge weights for faster inference
tokenizer = AutoTokenizer.from_pretrained("rajo0113/banking77-llama-1b-lora")
Inference
The model was trained with Llama's native chat template. Use it the same way at inference:
INSTRUCTION = (
"Classify the following banking customer query into one of 77 intent "
"categories. Output only the intent label name (snake_case)."
)
def classify(query: str) -> str:
messages = [{"role": "user", "content": f"{INSTRUCTION}\n\n{query}"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs, max_new_tokens=20, do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
return tokenizer.decode(
out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
).strip()
print(classify("I lost my card, how do I get a new one?"))
# → card_arrival
print(classify("Why was my international transfer declined?"))
# → declined_transfer
print(classify("Can I use Apple Pay with my account?"))
# → apple_pay_or_google_pay
Training Configuration
| Parameter | Value |
|---|---|
| Base model | unsloth/Llama-3.2-1B-Instruct |
| Quantisation | 4-bit NF4 (bitsandbytes) |
| LoRA rank | 16 |
| LoRA alpha | 32 (2 × rank) |
| rsLoRA | ✓ (1/√rank output scaling) |
| NEFTune noise alpha | 5 |
| Optimizer | adamw_8bit |
| Learning rate | 2e-4 (cosine + 3% warmup) |
| Effective batch size | 16 (per_device=8, grad_accum=2) |
| Epochs | 3 |
| Sequence packing | ✓ |
| Trainable parameters | ~1.2% of total |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Why the native chat template?
Using Llama's native chat format rather than Alpaca ### Instruction / ### Response was the
single most impactful architectural decision. It matches the format the model was RLHF-trained
on, preventing prior-fighting during SFT.
Dataset
mteb/banking77 — 13,069 samples across 77 banking intent classes.
Train split: 9,993 samples. Test split: 3,076 samples (held out entirely during training).
Labels are integer IDs (0–76) resolved to snake_case strings (e.g. card_arrival, transfer_not_received_by_recipient).
The task is framed as generative classification: the model outputs the label string directly, leveraging its text generation capability without requiring a classification head.
Worst-Class Analysis
| Class | Accuracy | Root cause |
|---|---|---|
topping_up_by_card |
0/40 (0%) | Near-identical surface form to top_up_by_card_charge; model hallucinated invalid label top_up_by_card on 9/40 samples |
card_arrival |
21/40 (52.5%) | Dataset-level ambiguity with card_delivery_estimate — 18 errors from a single confused pair |
transfer_not_received_by_recipient |
27/39 (69.2%) | Overlapping intent descriptions |
Constrained decoding eliminates invalid-label hallucinations but cannot fix semantic confusion the model never learned to resolve during training.
Training Code & Full Methodology
Full training notebook (Google Colab T4) with evaluation suite: github.com/rajo69/Finetuning-Experiment-1
Citation
@inproceedings{casanueva2020efficient,
title = {Efficient Intent Detection with Dual Sentence Encoders},
author = {Casanueva, I{\~n}igo and others},
booktitle = {Proceedings of the 2nd Workshop on NLP for ConvAI},
year = {2020}
}
@article{hu2021lora,
title = {LoRA: Low-Rank Adaptation of Large Language Models},
author = {Hu, Edward J and others},
year = {2021}
}
@article{dettmers2023qlora,
title = {QLoRA: Efficient Finetuning of Quantized LLMs},
author = {Dettmers, Tim and others},
year = {2023}
}
- Downloads last month
- 29
Model tree for rajo0113/banking77-llama-1b-lora
Base model
meta-llama/Llama-3.2-1B-InstructDataset used to train rajo0113/banking77-llama-1b-lora
Evaluation results
- Exact Match Accuracy (free generation) on Banking77test set self-reported0.902
- Exact Match Accuracy (constrained decoding) on Banking77test set self-reported0.903