Chat evaluation mid
timestamp: 2025-10-15 13:24:50
- source: mid
- task_name: None
- dtype: bfloat16
- temperature: 0.0000
- max_new_tokens: 512
- num_samples: 1
- top_k: 50
- batch_size: 8
- model_tag: None
- step: None
- max_problems: None
- ARC-Easy: 0.3906
- ARC-Challenge: 0.2739
- MMLU: 0.3094
- GSM8K: 0.0273
- HumanEval: 0.0671
- ChatCORE metric: 0.0786