Jina v5 H256 β Universal Classification LoRA
A single LoRA adapter (405KB) that improves classification performance across all MTEB Classification tasks, built on top of the compressed gomyk/jina-v5-h256-distilled-conv embedding model.
Unlike per-task adapters, this is one universal adapter that enhances the embedding space for classification in general β no task-specific fine-tuning needed at inference time.
LoRA Specification
| Property | Value |
|---|---|
| Base model | gomyk/jina-v5-h256-distilled-conv |
| Base architecture | EuroBERT (6L / 256d / 41,778 vocab / 16.9M params) |
| LoRA rank | 8 |
| LoRA alpha | 16 |
| LoRA scaling (alpha/rank) | 2.0 |
| Target modules | q_proj, k_proj, v_proj, o_proj |
| LoRA layers | 6 layers x 4 projections = 24 LoRA matrices |
| LoRA A shape | [256, 8] per projection |
| LoRA B shape | [8, 256] per projection |
| Total LoRA params | 98,304 (0.58% of base model) |
| Adapter file size | 405KB |
MTEB Classification Results
Average: 73.20% β 82.98% (+9.78%p)
| Task | Classes | Baseline | + LoRA | Delta |
|---|---|---|---|---|
| AmazonCounterfactualClassification | 2 | 76.93% | 91.57% | +14.64%p |
| Banking77Classification | 77 | 77.83% | 84.81% | +6.98%p |
| ImdbClassification | 2 | 73.03% | 79.54% | +6.51%p |
| MTOPDomainClassification | 11 | 90.63% | 96.18% | +5.55%p |
| MassiveIntentClassification | 60 | 67.90% | 78.08% | +10.18%p |
| MassiveScenarioClassification | 18 | 72.97% | 85.68% | +12.71%p |
| ToxicConversationsClassification | 2 | 61.83% | 77.99% | +16.16%p |
| TweetSentimentExtractionClassification | 3 | 64.46% | 70.01% | +5.55%p |
| Average | 73.20% | 82.98% | +9.78%p |
Baseline: MTEB default evaluation (logistic regression on frozen embeddings) + LoRA: Same evaluation, but with universal LoRA adapter merged into embeddings
Training Method
Multi-Task Classification with shared LoRA backbone:
ββ Head_Amazon(256β2) ββ CE loss
ββ Head_Banking(256β77) ββ CE loss
Input β [EuroBERT ββ Head_IMDB(256β2) ββ CE loss
+ LoRA] ββ Head_MTOP(256β11) ββ CE loss
β pool βββΊββ Head_Massive_I(256β60)ββ CE loss
ββ Head_Massive_S(256β18)ββ CE loss
ββ Head_Toxic(256β2) ββ CE loss
ββ Head_Tweet(256β3) ββ CE loss
All heads share the same LoRA backbone.
After training, heads are discarded β only the LoRA adapter is kept.
The LoRA-enhanced embeddings are universally better for classification.
| Parameter | Value |
|---|---|
| Training data | 112,716 samples from 8 MTEB Classification tasks |
| Total classes | 175 (across all tasks) |
| Optimizer | AdamW (lr=2e-4, weight_decay=0.01) |
| Scheduler | CosineAnnealingLR |
| Epochs | 10 |
| Batch size | 32 |
| Max sequence length | 128 |
| Gradient clipping | max_norm=1.0 |
| Loss function | Cross-Entropy (per task head) |
| Final training loss | 0.189 |
| Final training accuracy | ~93% |
Usage
Option 1: Merge LoRA into base model
import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
# Load base model
model = AutoModel.from_pretrained(
"gomyk/jina-v5-h256-distilled-conv", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
"gomyk/jina-v5-h256-distilled-conv", trust_remote_code=True)
# Download and merge LoRA
lora_path = hf_hub_download(
"gomyk/jina-v5-h256-lora-classification-universal",
"lora_adapter.pt")
lora_state = torch.load(lora_path, map_location="cpu", weights_only=True)
idx = 0
for name, module in model.named_modules():
for target in ["q_proj", "k_proj", "v_proj", "o_proj"]:
child = getattr(module, target, None)
if child is not None and isinstance(child, nn.Linear):
A = lora_state[f"lora_{idx}_A"]
B = lora_state[f"lora_{idx}_B"]
scaling = lora_state[f"lora_{idx}_scaling"].item()
child.weight.data += (scaling * (A @ B)).T
idx += 1
# Now use as normal embedding model
model.eval()
inputs = tokenizer("This movie was great!", return_tensors="pt",
truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
hidden = outputs.last_hidden_state
mask = inputs["attention_mask"].unsqueeze(-1).float()
embedding = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)
print(embedding.shape) # [1, 256]
Option 2: Use with downstream classifier
# After merging LoRA (see above), add your own classifier
from sklearn.linear_model import LogisticRegression
# Encode your training data
train_embeddings = [] # encode your texts with the merged model
train_labels = [...]
clf = LogisticRegression(max_iter=1000)
clf.fit(train_embeddings, train_labels)
# Predict
test_embedding = ... # encode test text
prediction = clf.predict(test_embedding)
File Structure
.
βββ README.md # This file
βββ lora_adapter.pt # LoRA A/B matrices (405KB)
βββ meta.json # Training metadata
βββ task_info.json # Per-task class counts
How LoRA Works
For each of the 24 attention projections (4 per layer x 6 layers):
Original: y = W @ x W: [256, 256] β frozen
With LoRA: y = W @ x + (16/8) * (x @ A) @ B
A: [256, 8] β learned down-projection
B: [8, 256] β learned up-projection
At merge time: W_new = W + 2.0 * (A @ B)^T β zero overhead at inference
License
Apache 2.0
Model tree for gomyk/jina-v5-h256-lora-classification-universal
Base model
EuroBERT/EuroBERT-210m Finetuned
jinaai/jina-embeddings-v5-text-nano Finetuned
gomyk/jina-v5-h256-distilled-conv