E-Star-12B-v2-Base

⚠️ 비상업적 μ‚¬μš© μ „μš© (Non-Commercial Use Only) 이 λͺ¨λΈμ€ CC BY-NC 4.0 λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€. 상업적 λͺ©μ μ˜ μ‚¬μš©(μ œν’ˆΒ·μ„œλΉ„μŠ€ 톡합, 유료 API 제곡, λ‚΄λΆ€ 운영 μ‹œμŠ€ν…œ 적용 λ“±)은 ν—ˆμš©λ˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. 상업적 이용 문의: [Selectstar 곡식 채널]을 톡해 별도 λΌμ΄μ„ μŠ€ 계약이 ν•„μš”ν•©λ‹ˆλ‹€.

  • μ†Œμœ μž: Selectstar Eval team
  • μž‘μ„±μΌ: 2026-05-22
  • μƒνƒœ: active

1. λͺ¨λΈ μ„€λͺ…

μ•„ν‚€ν…μ²˜ / νŒŒλΌλ―Έν„°

ν•­λͺ© λ‚΄μš©
베이슀 λͺ¨λΈ Gemma-3-12B-IT
νŒŒλΌλ―Έν„° 수 12B
ν•™μŠ΅ 방식 Full Fine-Tuning (SFT)
좜λ ₯ ꡬ쑰 feedback β†’ highlight β†’ decision

버전 정보

버전 μ„€λͺ…
v0.1 초기 Base 버전. K2-Feedback 기반 3단계 필터링 데이터(6,311개)둜 ν•™μŠ΅

λͺ©μ  (μ‚¬μš© 사둀)

ν•œκ΅­μ–΄ 루브릭 기반 평가λ₯Ό μ•ˆμ •μ μœΌλ‘œ μˆ˜ν–‰ν•˜λŠ” SLM 기반 evaluator. κΈˆμœ΅Β·λ²•λ₯  도메인 RAG νŒŒμ΄ν”„λΌμΈμ˜ ν’ˆμ§ˆ 평가에 νŠΉν™”λ˜μ–΄ 있으며, λ‹€μŒ 평가 좕을 μ§€μ›ν•œλ‹€.

  • Faithfulness: 응닡이 제곡된 λ¬Έμ„œμ— κ·Όκ±°ν•˜λŠ”μ§€ νŒλ‹¨ (ν™˜κ° 진단)
  • Context Relevancy: κ²€μƒ‰λœ λ¬Έμ„œκ°€ μ§ˆμ˜μ— κ΄€λ ¨λ˜λŠ”μ§€ νŒλ‹¨ (검색 ν’ˆμ§ˆ)
  • Response Relevancy: 응닡이 μ§ˆμ˜μ— 적절히 λŒ€μ‘ν•˜λŠ”μ§€ νŒλ‹¨ (μ’…ν•© 응닡 적합성)

λͺ¨λΈ μ‘μš© κ°€λŠ₯μ„±

  • κΈˆμœ΅Β·λ²•λ₯  μ™Έ λ„λ©”μΈμ˜ RAG ν‰κ°€λ‘œ ν™•μž₯ κ°€λŠ₯ (도메인별 루브릭 섀계 ν•„μš”)
  • λ²”μš© 루브릭 기반 LLM 좜λ ₯ ν’ˆμ§ˆ 평가 (Ko Feedback Benchμ—μ„œ κ²€μ¦λœ rubric following λŠ₯λ ₯)
  • 평가 νŒŒμ΄ν”„λΌμΈ μžλ™ν™” μ‹œ frontier λͺ¨λΈ λŒ€λΉ„ λΉ„μš© 효율적 λŒ€μ•ˆμœΌλ‘œ ν™œμš©

2. λͺ¨λΈ μ‹€ν–‰ 방법

ν•™μŠ΅ μ½”λ“œ μŠ€λ‹ˆνŽ«

from trl import SFTTrainer, SFTConfig
from transformers import GemmaForCausalLM, AutoTokenizer

model_name = "google/gemma-3-12b-it"
model = GemmaForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

sft_config = SFTConfig(
    output_dir="./eval-estar-base-v0.1",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=8,
    learning_rate=1e-5,
    num_train_epochs=5,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,          # early stopping κΈ°μ€€: validation loss
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    bf16=True,
    logging_steps=10,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    args=sft_config,
)

trainer.train()

β€» μ‹€μ œ ν•™μŠ΅ μ‹œ validation loss κΈ°μ€€ 2 μ—ν­μ—μ„œ early stopping 적용

μΆ”λ‘  μ½”λ“œ μŠ€λ‹ˆνŽ«

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "datumo/E-Star-12B-v2-Base"
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_path)

# ── System Prompt ──
system_prompt = """You are a rubric evaluator.
Your task is to evaluate a response strictly and only according to the provided pass criteria and scoring rubric. 
In your output, return the final evaluation (the three output tags: <feedback>, <highlight>, and <decision>).

# Evaluation Procedure (must follow all steps):

1. First, carefully read the Data to Evaluate, the pass criteria, and the scoring rubric to fully understand the requirements.
2. Evaluate the response only against the given criteria: do not introduce external standards, do not reward style unless the rubric explicitly allows it, and judge by absolute rubric definitions rather than relative comparisons.
3. Re-check fine-grained details in the response and the rubric, ensuring any tags (if present) are correctly mapped to the pass criteria and that small deviations are not overlooked.
4. Write criterion-focused feedback that explicitly references the rubric, quoting exact words or phrases from the response when they are decisive, and clearly stating which criteria are satisfied and which are violated.
5. Finally, extract the key verbatim spans that most influenced your judgment and assign the final score according to the scoring rubric.
"""

# ── User Prompt μ˜ˆμ‹œ (Reasoning / Problem Solving) ──
user_prompt = """You MUST write ALL output (<feedback>, <highlight>, <decision>) in the SAME language as the input question and response being evaluated. If the input is in Korean, your entire output MUST be in Korean.
# Output Format:
<feedback>
Write detailed feedback (reasons) that strictly evaluates the quality of the response using only the given scoring rubric. Do not explicitly state the score in a sentence (e.g., "Therefore, the score is …").
</feedback>
<highlight>
List of words or phrases that you believe are the most important in determining the score.
</highlight>
<decision>
Provide the final integer score assigned based on the scoring rubric.
</decision>

# Data to Evaluate

### Problem
ν•œ 곡μž₯μ—μ„œ ν•˜λ£¨μ— 120개의 μ œν’ˆμ„ μƒμ‚°ν•œλ‹€. λΆˆλŸ‰λ₯ μ΄ 5%일 λ•Œ, 일주일(7일) λ™μ•ˆ μƒμ‚°λ˜λŠ” 정상 μ œν’ˆμ˜ μˆ˜λŠ”?

### Model Response
ν•˜λ£¨ μƒμ‚°λŸ‰: 120개
λΆˆλŸ‰λ₯ : 5% β†’ λΆˆλŸ‰ν’ˆ: 120 Γ— 0.05 = 6개
ν•˜λ£¨ 정상 μ œν’ˆ: 120 - 6 = 114개
일주일 정상 μ œν’ˆ: 114 Γ— 7 = 798개

### Optional Ground Truth
798개

# Rubric
Evaluate whether the model correctly solves the problem and provides reasoning that is logically consistent with the final answer. Prioritize correctness of the conclusion, then soundness of the reasoning.

Score 1: The final answer is wrong and the reasoning is invalid, irrelevant, or missing.
Score 2: The response shows limited progress but contains major reasoning flaws leading to an incorrect or unreliable answer.
Score 3: The response demonstrates partial reasoning ability but is incomplete, contains mistakes, or reaches an uncertain result.
Score 4: The response is mostly correct with generally sound reasoning, though minor errors or gaps may remain.
Score 5: The response reaches the correct answer through clear, consistent, and logically valid reasoning appropriate to the problem."""

# ── Inference ──
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=2048,
    temperature=0.0,
    do_sample=False,
)

response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)

3. ν•™μŠ΅ 데이터셋

HuggingFace 데이터셋

πŸ€— datumo/E-Star-Train-6K

ν•­λͺ© λ‚΄μš©
μ‹œλ“œ 데이터 K2-Feedback (HAERAEHUB, 2024) β€” 99.7K
μ΅œμ’… ν•™μŠ΅ 데이터 6,311개 (3단계 필터링 ν›„)

필터링 νŒŒμ΄ν”„λΌμΈ μš”μ•½

단계 규λͺ¨ λ³€ν™” 방법
Stage 1 99.7K β†’ 26K Qwen3-30B-A3B / Qwen3-Next-80B-A3B κ°„ 초기 ν•©μ˜
Stage 2 26K β†’ 8K Gemma 베이슀 λͺ¨λΈ κΈ°μ€€ 일치/뢈일치 κ· ν˜•ν™”
Stage 3 8K β†’ 6K GPT-5.2 단일 평가 + μ†Œν˜• frontier debate ꡐ차 검증

평가 벀치마크


4. ν•™μŠ΅ μ„€μ •

μ£Όμš” ν•™μŠ΅ νŒŒλΌλ―Έν„°

# SFT Config
learning_rate: 1e-5
num_train_epochs: 5 (early stopping at epoch 2)
per_device_train_batch_size: 1
gradient_accumulation_steps: 8
bf16: true
eval_strategy: epoch
metric_for_best_model: eval_loss
load_best_model_at_end: true

# Fine-tuning
method: Full Fine-Tuning (no LoRA)
framework: TRL SFTTrainer (von Werra et al., 2020)

μ‚¬μš© GPU / ν•™μŠ΅ μ‹œκ°„

ν•­λͺ© λ‚΄μš©
GPU 4
ν•™μŠ΅ μ‹œκ°„ 1.2 hr

5. 평가 κ²°κ³Ό

5.1 Feedback Bench (μ˜μ–΄, Rubric Following)

Type Models Pearson Kendall Ο„ Spearman
Frontier GPT-5.2 0.916 0.865 0.911
Frontier Sonnet-4.6 0.840 0.776 0.847
Instruct SLM Gemma-3-12B-IT 0.810 0.725 0.794
Instruct SLM oss-20b 0.844 0.762 0.839
Evaluator LM Prometheus-8x7B-v2.0 0.823 0.736 0.806
Evaluator LM GLIDER 3.8B 0.678 0.595 0.688
Ours E-Star-12B-Base 0.856 0.778 0.847

5.2 Ko Feedback Bench (ν•œκ΅­μ–΄, Rubric Following)

Type Models Pearson Kendall Ο„ Spearman
Frontier GPT-5.2 0.929 0.886 0.925
Frontier Sonnet-4.6 0.820 0.758 0.833
Instruct SLM Gemma-3-12B-IT 0.653 0.593 0.661
Instruct SLM oss-20b 0.778 0.704 0.779
Evaluator LM Prometheus-8x7B-v2.0 0.377 0.441 0.501
Evaluator LM GLIDER 3.8B 0.523 0.487 0.563
Ours E-Star-12B-Base 0.826 0.754 0.819

5.3 RAG Quality Bench (κΈˆμœ΅Β·λ²•λ₯ , Domain Adaptation)

Models LAW(CR) LAW(FF) LAW(RR) FIN(CR) FIN(FF) FIN(RR) Average
GPT-5.2 0.846 0.785 0.941 0.882 0.740 0.970 0.861
Sonnet-4.6 0.910 0.786 0.872 0.932 0.845 0.925 0.878
Gemma-3-12B-IT 0.620 0.742 0.742 0.830 0.713 0.821 0.745
oss-20b 0.846 0.722 0.870 0.793 0.752 0.900 0.813
Prometheus-8x7B-v2.0 0.392 0.477 0.772 0.386 0.240 0.806 0.512
GLIDER 3.8B 0.657 0.670 0.680 0.432 0.415 0.548 0.567
E-Star-12B-Base 0.853 0.730 0.816 0.835 0.720 0.880 0.806

CR = Context Relevancy, FF = Faithfulness, RR = Response Relevancy


6. ν•œκ³„

  • Ko Feedback BenchλŠ” κΈ°κ³„λ²ˆμ—­ 기반으둜 κ΅¬μΆ•λ˜μ—ˆμœΌλ―€λ‘œ, λ²ˆμ—­ λ…Έμ΄μ¦ˆκ°€ 평가 μ„±λŠ₯ 츑정에 영ν–₯을 쀄 수 있음
  • ν•™μŠ΅ 데이터와 벀치마크 λ ˆμ΄λΈ”μ΄ λ™μΌν•œ debate 기반 절차둜 κ΅¬μΆ•λ˜μ—ˆμœΌλ―€λ‘œ, μ ˆλŒ€μ  평가 ν’ˆμ§ˆλ³΄λ‹€λŠ” ν•©μ˜ 기반 λ ˆμ΄λΈ”λ§ κΈ°μ€€κ³Όμ˜ μ •λ ¬ 정도λ₯Ό λ°˜μ˜ν•  수 있음 (도메인 μ „λ¬Έκ°€ human evaluation λ―Έμ‹€μ‹œ)
  • Reference-free μ„€μ •μœΌλ‘œ ν•™μŠ΅ 및 ν‰κ°€λ˜μ—ˆμœΌλ―€λ‘œ, reference 포함 ν™˜κ²½μ—μ„œμ˜ μ„±λŠ₯은 별도 검증 ν•„μš”
  • 12B 규λͺ¨ SLM νŠΉμ„±μƒ frontier λͺ¨λΈ λŒ€λΉ„ λ³΅μž‘ν•œ 루브릭 해석 λŠ₯λ ₯에 ν•œκ³„κ°€ μžˆμ„ 수 있음
  • RAG 평가 μ‹œ μž…λ ₯ λ¬Έμ„œ 수 증가에 λ”°λ₯Έ μ„±λŠ₯ λ³€ν™”λŠ” 미검증

7. λΌμ΄μ„ μŠ€

CC BY-NC 4.0 β€” 비상업적 μ‚¬μš© μ „μš©

이 λͺ¨λΈμ€ Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) λΌμ΄μ„ μŠ€ ν•˜μ— λ°°ν¬λ©λ‹ˆλ‹€.

βœ… ν—ˆμš© 사항

  • ν•™μˆ  연ꡬ 및 λ…Όλ¬Έ μž‘μ„±μ— μ‚¬μš©
  • λΉ„μ˜λ¦¬ ꡐ윑 λͺ©μ μ˜ ν™œμš©
  • 개인 ν•™μŠ΅ 및 μ‹€ν—˜ λͺ©μ μ˜ μ‚¬μš©
  • μœ„ 쑰건 ν•˜μ— μˆ˜μ • 및 재배포 (단, 원본 좜처 λͺ…μ‹œ 및 동일 λΌμ΄μ„ μŠ€ 적용 ν•„μˆ˜)

❌ κΈˆμ§€ 사항 (비상업적 μ‘°ν•­ μœ„λ°˜)

  • 유료 μ œν’ˆΒ·μ„œλΉ„μŠ€μ— λͺ¨λΈμ„ ν†΅ν•©ν•˜κ±°λ‚˜ API ν˜•νƒœλ‘œ μ œκ³΅ν•˜λŠ” ν–‰μœ„
  • 사내 운영 μ‹œμŠ€ν…œ, 고객 λŒ€λ©΄ μ„œλΉ„μŠ€, 수읡 창좜 νŒŒμ΄ν”„λΌμΈμ— μ§Β·κ°„μ ‘μ μœΌλ‘œ μ μš©ν•˜λŠ” ν–‰μœ„
  • λͺ¨λΈ κ°€μ€‘μΉ˜λ₯Ό 상업적 λͺ©μ μœΌλ‘œ μž¬λ°°ν¬ν•˜κ±°λ‚˜ νŒλ§€ν•˜λŠ” ν–‰μœ„
  • 상업적 λͺ¨λΈ ν•™μŠ΅μ„ μœ„ν•œ νŒŒμΈνŠœλ‹ 데이터 생성 λ“± 간접적 상업 ν™œμš©

    상업적 이용 문의: 상업적 λΌμ΄μ„ μŠ€κ°€ ν•„μš”ν•˜μ‹  경우, Selectstar 곡식 채널을 톡해 λ¬Έμ˜ν•΄ μ£Όμ„Έμš”.

ꡬ성 μš”μ†Œλ³„ λΌμ΄μ„ μŠ€

ꡬ성 μš”μ†Œ λΌμ΄μ„ μŠ€
베이슀 λͺ¨λΈ (Gemma-3-12B-IT) Gemma Terms of Use (Google)
ν•™μŠ΅ 데이터 (K2-Feedback) K2-Feedback (HAERAEHUB, 2024) λΌμ΄μ„ μŠ€ 정책에 따름
νŒŒμΈνŠœλ‹ κ°€μ€‘μΉ˜ (λ³Έ λͺ¨λΈ) CC BY-NC 4.0

μ€‘μš”: 베이슀 λͺ¨λΈμΈ Gemma의 이용 μ•½κ΄€ λ˜ν•œ μ€€μˆ˜ν•΄μ•Ό ν•©λ‹ˆλ‹€. Gemma Terms of Useμ—μ„œ λ³„λ„λ‘œ κΈˆμ§€ν•˜λŠ” μ‚¬μš© 방식은 λ³Έ λͺ¨λΈμ—λ„ λ™μΌν•˜κ²Œ μ μš©λ©λ‹ˆλ‹€.

Downloads last month
55
Safetensors
Model size
12B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for datumo/E-star-12B-base

Finetuned
(360)
this model