banking77-llama-1b-lora

Fine-tuned unsloth/Llama-3.2-1B-Instruct on the mteb/banking77 77-class banking intent classification benchmark using QLoRA + rsLoRA + NEFTune.

90.21% exact-match accuracy — up from 0% zero-shot on the same task Single T4 GPU · 3 epochs · ~50 minutes


Evaluation Results

Setting Accuracy Correct / Total
Base model (zero-shot) 0.00% 0 / 3,076
Fine-tuned — free generation 90.21% 2,775 / 3,076
Fine-tuned — constrained decoding (trie) 90.28% 2,777 / 3,076

Evaluated on the full Banking77 test split (3,076 samples, 77 classes, ~40 samples/class). Constrained decoding uses a token-level prefix trie over all 77 valid label names — making it physically impossible to generate an invalid label. It fixed 2 samples and broke 0 (zero regression).


How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-1B-Instruct",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "rajo0113/banking77-llama-1b-lora")
model = model.merge_and_unload()  # optional: merge weights for faster inference
tokenizer = AutoTokenizer.from_pretrained("rajo0113/banking77-llama-1b-lora")

Inference

The model was trained with Llama's native chat template. Use it the same way at inference:

INSTRUCTION = (
    "Classify the following banking customer query into one of 77 intent "
    "categories. Output only the intent label name (snake_case)."
)

def classify(query: str) -> str:
    messages = [{"role": "user", "content": f"{INSTRUCTION}\n\n{query}"}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        out = model.generate(
            **inputs, max_new_tokens=20, do_sample=False,
            pad_token_id=tokenizer.eos_token_id,
        )
    return tokenizer.decode(
        out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True
    ).strip()

print(classify("I lost my card, how do I get a new one?"))
# → card_arrival
print(classify("Why was my international transfer declined?"))
# → declined_transfer
print(classify("Can I use Apple Pay with my account?"))
# → apple_pay_or_google_pay

Training Configuration

Parameter Value
Base model unsloth/Llama-3.2-1B-Instruct
Quantisation 4-bit NF4 (bitsandbytes)
LoRA rank 16
LoRA alpha 32 (2 × rank)
rsLoRA ✓ (1/√rank output scaling)
NEFTune noise alpha 5
Optimizer adamw_8bit
Learning rate 2e-4 (cosine + 3% warmup)
Effective batch size 16 (per_device=8, grad_accum=2)
Epochs 3
Sequence packing ✓
Trainable parameters ~1.2% of total
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj

Why the native chat template?

Using Llama's native chat format rather than Alpaca ### Instruction / ### Response was the single most impactful architectural decision. It matches the format the model was RLHF-trained on, preventing prior-fighting during SFT.


Dataset

mteb/banking77 — 13,069 samples across 77 banking intent classes. Train split: 9,993 samples. Test split: 3,076 samples (held out entirely during training). Labels are integer IDs (0–76) resolved to snake_case strings (e.g. card_arrival, transfer_not_received_by_recipient).

The task is framed as generative classification: the model outputs the label string directly, leveraging its text generation capability without requiring a classification head.


Worst-Class Analysis

Class Accuracy Root cause
topping_up_by_card 0/40 (0%) Near-identical surface form to top_up_by_card_charge; model hallucinated invalid label top_up_by_card on 9/40 samples
card_arrival 21/40 (52.5%) Dataset-level ambiguity with card_delivery_estimate — 18 errors from a single confused pair
transfer_not_received_by_recipient 27/39 (69.2%) Overlapping intent descriptions

Constrained decoding eliminates invalid-label hallucinations but cannot fix semantic confusion the model never learned to resolve during training.


Training Code & Full Methodology

Full training notebook (Google Colab T4) with evaluation suite: github.com/rajo69/Finetuning-Experiment-1


Citation

@inproceedings{casanueva2020efficient,
  title     = {Efficient Intent Detection with Dual Sentence Encoders},
  author    = {Casanueva, I{\~n}igo and others},
  booktitle = {Proceedings of the 2nd Workshop on NLP for ConvAI},
  year      = {2020}
}
@article{hu2021lora,
  title  = {LoRA: Low-Rank Adaptation of Large Language Models},
  author = {Hu, Edward J and others},
  year   = {2021}
}
@article{dettmers2023qlora,
  title  = {QLoRA: Efficient Finetuning of Quantized LLMs},
  author = {Dettmers, Tim and others},
  year   = {2023}
}
Downloads last month
29
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for rajo0113/banking77-llama-1b-lora

Adapter
(399)
this model

Dataset used to train rajo0113/banking77-llama-1b-lora

Evaluation results