Qwen2.5-14B-Instruct, CaseHOLD LoRA

A LoRA adapter for Qwen2.5-14B-Instruct fine-tuned on CaseHOLD (the legal holding-selection task from LexGLUE). This is a capability probe, not a product. The point is to show that a small LoRA on a 14B model you can actually own lifts a targeted domain task and still deploys on consumer hardware. It is not a legal product and not legal advice. Read the limitations section first.

Result

First-token logit scoring over the answer letters, on a held-out validation split, n=500, identical questions for the base model and the adapter (apples to apples). CaseHOLD is a five-option task.

	Accuracy
Base Qwen2.5-14B-Instruct	0.738 (369/500)
Base + this adapter	0.840 (420/500)
Lift	+10.2 points (+13.8% relative)

The lift is roughly 4 to 5 standard errors above baseline at n=500, so it is not sampling noise.

Training

Base: Qwen/Qwen2.5-14B-Instruct
Data: CaseHOLD (LexGLUE), 8000 training examples
Method: LoRA (r=16, alpha=32, dropout=0.05) across all attention and MLP projections, bf16
400 steps, effective batch 16, lr 2e-4, cosine schedule
Trainable parameters: 68.8M (0.46% of the model)
Training loss: 2.03 down to 1.27
Hardware: one H100 SXM 80GB, about 21 minutes

Deploy

The adapter merges into the base and follows the same quantize-and-serve path proven for the sibling medical adapter (merge, f16 GGUF, q4_K_M at about 8.4 GB, runs on a 24GB or 12GB card). The legal deploy loop was not separately closed in this probe, so treat the deploy path as demonstrated-by-analogy rather than separately measured for this adapter.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = "Qwen/Qwen2.5-14B-Instruct"
tok = AutoTokenizer.from_pretrained(base)
model = AutoModelForCausalLM.from_pretrained(base, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(model, "ArgusForge/qwen2.5-14b-casehold-lora")

Limitations and intended use

This is a research capability probe. It is not a legal product and not legal advice. Do not use it to make or inform legal decisions. The accuracy figure is a single run, single seed, on an n=500 held-out slice. A product-grade claim would need multi-seed confirmation, the full validation split, an external held-out set, and a domain validation regime before any advice-facing use.

Reproduction

Public base model, public dataset (CaseHOLD via LexGLUE). The eval is first-token logit scoring over the option letters on a held-out n=500 split, with identical prompts for base and adapter.

Downloads last month: 27

Inference Providers NEW

Multiple Choice

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ArgusForge/qwen2.5-14b-casehold-lora

Base model

Qwen/Qwen2.5-14B

Finetuned

Qwen/Qwen2.5-14B-Instruct

Adapter

(351)

this model

Dataset used to train ArgusForge/qwen2.5-14b-casehold-lora

Evaluation results

Accuracy (held-out, n=500) on CaseHOLD (LexGLUE)
self-reported

0.840