Update usage docs: replace AutoModel with working safetensors inference

e95edec 22 days ago

3.27 kB

language:
  - ko
  - en
license: apache-2.0
tags:
  - merge
  - slerp
  - best
  - instruction-tuned
  - alignment
  - korean
  - llm
pipeline_tag: text-generation

EVAFRILL-Mo 3B — SLERP Merge (Recommended)

Spherical linear interpolation (SLERP) merge of SFT v2 and DPO R2. This is the recommended variant for general use.

Training Stage

Model merge — SLERP interpolation between SFT v2 (50%) and DPO R2 (50%). No additional training was performed; this is a post-hoc weight interpolation.

Key Details

Merge method: SLERP (spherical linear interpolation)
Sources: SFT v2 (50%) + DPO R2 (50%)
Inference: temp=0.7, repetition_penalty=1.2 recommended

Metrics

Metric	Value
Repetition rate	74.5% (lowest among all variants)
HellaSwag	34.6%
ARC-Easy	32.0%

Why SLERP

SLERP merging interpolates weights along the unit sphere, better preserving the learned representations from both checkpoints compared to naive linear averaging. The 50/50 split between SFT v2 and DPO R2 achieves the best trade-off between instruction-following quality and repetition reduction across all evaluated variants.

Main Model Card

See the main README for full project details, architecture, and training history.

Usage

Note: This is a custom Mamba-2 hybrid architecture — AutoModelForCausalLM is not supported. Use direct safetensors loading with the EVAFRILL-Mo source code.

# Prerequisites
git clone https://github.com/pathcosmos/EVAFRILL-Mo
pip install torch safetensors tokenizers PyYAML

import json, torch
from model.config import LMConfig
from model.transformer import LLM
from tokenizers import Tokenizer
from safetensors.torch import load_file as load_safetensors

CKPT = "path/to/slerp"  # this directory

with open(f"{CKPT}/config.json") as f:
    data = json.load(f)
for k in ("model_type", "architectures", "_variant", "_description"):
    data.pop(k, None)
cfg = LMConfig(**data)
cfg.use_flash_attn = False

model = LLM(cfg)
state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
model.load_state_dict(state, strict=False)
model = model.to(device="cuda:0", dtype=torch.bfloat16).eval()

tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
prompt = "<|user|>\n질문을 여기에 입력하세요\n<|assistant|>\n"
ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")

with torch.no_grad():
    for _ in range(512):
        logits, _ = model(ids)
        logits = logits[:, -1, :].float()
        for prev_id in set(ids[0].tolist()):
            if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
            else: logits[0, prev_id] *= 1.2
        probs = torch.softmax(logits / 0.7, dim=-1)
        next_id = torch.multinomial(probs, 1)
        ids = torch.cat([ids, next_id], dim=1)
        if next_id.item() == tok.token_to_id("</s>"): break

print(tok.decode(ids[0].tolist()))

Alternatively, use the wrapped runner from frankenstallm_test:

from eval_framework.evafrill_runner import generate
result = generate("한국어로 인사해주세요.")
print(result["response"])