Jina v5 H256 β€” Universal Classification LoRA

A single LoRA adapter (405KB) that improves classification performance across all MTEB Classification tasks, built on top of the compressed gomyk/jina-v5-h256-distilled-conv embedding model.

Unlike per-task adapters, this is one universal adapter that enhances the embedding space for classification in general β€” no task-specific fine-tuning needed at inference time.

LoRA Specification

Property Value
Base model gomyk/jina-v5-h256-distilled-conv
Base architecture EuroBERT (6L / 256d / 41,778 vocab / 16.9M params)
LoRA rank 8
LoRA alpha 16
LoRA scaling (alpha/rank) 2.0
Target modules q_proj, k_proj, v_proj, o_proj
LoRA layers 6 layers x 4 projections = 24 LoRA matrices
LoRA A shape [256, 8] per projection
LoRA B shape [8, 256] per projection
Total LoRA params 98,304 (0.58% of base model)
Adapter file size 405KB

MTEB Classification Results

Average: 73.20% β†’ 82.98% (+9.78%p)

Task Classes Baseline + LoRA Delta
AmazonCounterfactualClassification 2 76.93% 91.57% +14.64%p
Banking77Classification 77 77.83% 84.81% +6.98%p
ImdbClassification 2 73.03% 79.54% +6.51%p
MTOPDomainClassification 11 90.63% 96.18% +5.55%p
MassiveIntentClassification 60 67.90% 78.08% +10.18%p
MassiveScenarioClassification 18 72.97% 85.68% +12.71%p
ToxicConversationsClassification 2 61.83% 77.99% +16.16%p
TweetSentimentExtractionClassification 3 64.46% 70.01% +5.55%p
Average 73.20% 82.98% +9.78%p

Baseline: MTEB default evaluation (logistic regression on frozen embeddings) + LoRA: Same evaluation, but with universal LoRA adapter merged into embeddings

Training Method

Multi-Task Classification with shared LoRA backbone:

                    β”Œβ”€ Head_Amazon(256β†’2)   ─→ CE loss
                    β”œβ”€ Head_Banking(256β†’77)  ─→ CE loss
Input β†’ [EuroBERT   β”œβ”€ Head_IMDB(256β†’2)     ─→ CE loss
         + LoRA]    β”œβ”€ Head_MTOP(256β†’11)    ─→ CE loss
         β†’ pool  β”€β”€β–Ίβ”œβ”€ Head_Massive_I(256β†’60)─→ CE loss
                    β”œβ”€ Head_Massive_S(256β†’18)─→ CE loss
                    β”œβ”€ Head_Toxic(256β†’2)    ─→ CE loss
                    └─ Head_Tweet(256β†’3)    ─→ CE loss

All heads share the same LoRA backbone.
After training, heads are discarded β€” only the LoRA adapter is kept.
The LoRA-enhanced embeddings are universally better for classification.
Parameter Value
Training data 112,716 samples from 8 MTEB Classification tasks
Total classes 175 (across all tasks)
Optimizer AdamW (lr=2e-4, weight_decay=0.01)
Scheduler CosineAnnealingLR
Epochs 10
Batch size 32
Max sequence length 128
Gradient clipping max_norm=1.0
Loss function Cross-Entropy (per task head)
Final training loss 0.189
Final training accuracy ~93%

Usage

Option 1: Merge LoRA into base model

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download

# Load base model
model = AutoModel.from_pretrained(
    "gomyk/jina-v5-h256-distilled-conv", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(
    "gomyk/jina-v5-h256-distilled-conv", trust_remote_code=True)

# Download and merge LoRA
lora_path = hf_hub_download(
    "gomyk/jina-v5-h256-lora-classification-universal",
    "lora_adapter.pt")
lora_state = torch.load(lora_path, map_location="cpu", weights_only=True)

idx = 0
for name, module in model.named_modules():
    for target in ["q_proj", "k_proj", "v_proj", "o_proj"]:
        child = getattr(module, target, None)
        if child is not None and isinstance(child, nn.Linear):
            A = lora_state[f"lora_{idx}_A"]
            B = lora_state[f"lora_{idx}_B"]
            scaling = lora_state[f"lora_{idx}_scaling"].item()
            child.weight.data += (scaling * (A @ B)).T
            idx += 1

# Now use as normal embedding model
model.eval()
inputs = tokenizer("This movie was great!", return_tensors="pt",
                    truncation=True, max_length=128)
with torch.no_grad():
    outputs = model(**inputs)
    hidden = outputs.last_hidden_state
    mask = inputs["attention_mask"].unsqueeze(-1).float()
    embedding = (hidden * mask).sum(1) / mask.sum(1).clamp(min=1e-9)

print(embedding.shape)  # [1, 256]

Option 2: Use with downstream classifier

# After merging LoRA (see above), add your own classifier
from sklearn.linear_model import LogisticRegression

# Encode your training data
train_embeddings = []  # encode your texts with the merged model
train_labels = [...]

clf = LogisticRegression(max_iter=1000)
clf.fit(train_embeddings, train_labels)

# Predict
test_embedding = ...  # encode test text
prediction = clf.predict(test_embedding)

File Structure

.
β”œβ”€β”€ README.md              # This file
β”œβ”€β”€ lora_adapter.pt        # LoRA A/B matrices (405KB)
β”œβ”€β”€ meta.json              # Training metadata
└── task_info.json         # Per-task class counts

How LoRA Works

For each of the 24 attention projections (4 per layer x 6 layers):

  Original:  y = W @ x                    W: [256, 256] β€” frozen
  With LoRA: y = W @ x + (16/8) * (x @ A) @ B

  A: [256, 8]  β€” learned down-projection
  B: [8, 256]  β€” learned up-projection

  At merge time: W_new = W + 2.0 * (A @ B)^T  β€” zero overhead at inference

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gomyk/jina-v5-h256-lora-classification-universal