my-first-model / docs /LEADERBOARD.md
terry-u's picture
docs: reflect value 0.01
b551049 verified
|
Raw
History Blame Contribute Delete
2.69 kB
---
title: GPQA Leaderboard Handling
updated: 2026-06-16
---
# GPQA Leaderboard Handling
์ด ๋ฌธ์„œ๋Š” `README.md`์— ๋…ธ์ถœํ•œ GPQA Diamond ๋ฆฌ๋”๋ณด๋“œ์˜ ๊ตฌ์„ฑ ์›์น™๊ณผ ์‹ค์ œ Hugging Face ๋ฆฌ๋”๋ณด๋“œ ์ œ์ถœ ์กฐ๊ฑด์„ ์ •๋ฆฌํ•œ๋‹ค.
ํ˜„์žฌ ์ €์žฅ์†Œ๋Š” ํŒจ๋Ÿฌ๋”” ๋ฐ ์—…๋กœ๋“œ ์—ฐ์Šต์šฉ์ด๋‹ค.
## ์ตœ๊ทผ ๋ณ€๊ฒฝ (2026-06-16)
- ๊ธฐ์กด ๊ฐ€์ƒ ์ ์ˆ˜(91.6)๋ฅผ ์ œ๊ฑฐํ•˜๊ณ , ์šฐ๋ฆฌ ๋ชจ๋ธ `gwangju no1 llm` ์ ์ˆ˜๋ฅผ ์ •์งํ•˜๊ฒŒ **0 (๋ฏธํ‰๊ฐ€)**๋กœ ํ‘œ๊ธฐ.
- ๋น„๊ต์šฉ ํ–‰์€ **์ถœ์ฒ˜ ์žˆ๋Š” ์‹ค์ธก GPQA Diamond ์ ์ˆ˜**(๊ณต๊ฐœ ์ง‘๊ณ„ ์Šค๋ƒ…์ƒท 2026-06)๋กœ ๊ตฌ์„ฑํ•˜๊ณ , ๊ฐ ํ–‰์— ์ถœ์ฒ˜ ๋งํฌ๋ฅผ ๋ช…์‹œ.
- ์ง‘๊ณ„์ฒ˜(AI Stats / Artificial Analysis)๋งˆ๋‹ค ์ˆ˜์น˜๊ฐ€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Œ์„ README์— ๋ช…์‹œ.
- **(์†Œ์œ ์ž ๋ช…์‹œ ์š”์ฒญ) `.eval_results/gpqa.yaml` ์ถ”๊ฐ€.** HF eval-results ์‚ฌ์–‘์— ๋งž์ถฐ `dataset.id: Idavidrein/gpqa`, `task_id: gpqa_diamond`, `value: 0.01`(์†Œ์œ ์ž ์ง€์ • ํ”Œ๋ ˆ์ด์Šคํ™€๋”, ์‹ค์ธก ์•„๋‹˜)์œผ๋กœ ์ œ์ถœํ•œ๋‹ค. ํ‰๊ฐ€๋ฅผ ์ˆ˜ํ–‰ํ•˜์ง€ ์•Š์•˜๋‹ค๋Š” ์‚ฌ์‹ค์€ `notes`์— ๋ช…์‹œํ•œ๋‹ค.
- ์ฃผ์˜: `Idavidrein/gpqa`๋Š” ์‹ค์ œ ๋“ฑ๋ก๋œ Benchmark์ด๋ฏ€๋กœ ์ด ํŒŒ์ผ์€ **๊ณต๊ฐœ GPQA Diamond ๋ฆฌ๋”๋ณด๋“œ์— ์ง‘๊ณ„ ์ œ์ถœ**๋œ๋‹ค. ์ด๋Š” "์‹ค์ธก ์—†๋Š” ๊ฐ’์€ ๋งŒ๋“ค์ง€ ์•Š๋Š”๋‹ค"๋Š” ๊ธฐ์กด ๋ฐฉ์นจ๊ณผ ์ถฉ๋Œํ•˜์ง€๋งŒ, ์†Œ์œ ์ž๊ฐ€ ๊ฒฝ๊ณ ๋ฅผ ์ธ์ง€ํ•˜๊ณ  ์ง„ํ–‰์„ ๊ฒฐ์ •ํ•จ.
## ์ž‘์—… ๋ช…์„ธ
- ๋ฌด์—‡์„: GPQA Diamond ๋ฆฌ๋”๋ณด๋“œ๋ฅผ README์— ๋…ธ์ถœํ•˜๋˜, ์šฐ๋ฆฌ ๋ชจ๋ธ์€ 0์ (๋ฏธํ‰๊ฐ€), ํƒ€ ๋ชจ๋ธ์€ ์‹ค์ธกยท์ถœ์ฒ˜ ์žˆ๋Š” ๊ฐ’์œผ๋กœ ์ฑ„์šด๋‹ค.
- ์™œ: ์‚ฌ์šฉ์ž๊ฐ€ ์š”์ฒญํ•œ "๋ฆฌ๋”๋ณด๋“œ ๋…ธ์ถœ + ์šฐ๋ฆฌ 0์ "์„ ์ถฉ์กฑํ•˜๋ฉด์„œ, ํ—ˆ์œ„ ์‹ค์ธก ์ ์ˆ˜๋กœ Hugging Face ๋ฆฌ๋”๋ณด๋“œ๋ฅผ ์˜ค์—ผ์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค.
- ์–ด๋–ป๊ฒŒ: `model-index`/`.eval_results/`๋Š” ์ƒ์„ฑํ•˜์ง€ ์•Š๊ณ , README ํ‘œ์˜ ๊ฐ ์‹ค์ธก ํ–‰์— ์ถœ์ฒ˜๋ฅผ ๋‹จ๋‹ค. ์šฐ๋ฆฌ ๋ชจ๋ธ ์ ์ˆ˜๋Š” 0์œผ๋กœ ๋ช…์‹œ.
- ์™„๋ฃŒ ๊ธฐ์ค€: README๋งŒ ๋ด๋„ (1) ์šฐ๋ฆฌ ๋ชจ๋ธ์€ ๋ฏธํ‰๊ฐ€ 0์ , (2) ํƒ€ ๋ชจ๋ธ์€ ์ถœ์ฒ˜ ์žˆ๋Š” ์‹ค์ธก์น˜์ž„์ด ๊ตฌ๋ถ„๋œ๋‹ค.
## ์‹ค์ œ ์ œ์ถœ ์ „ ์กฐ๊ฑด
์‹ค์ œ GPQA ๋ฆฌ๋”๋ณด๋“œ ์ œ์ถœ์€ ๋‹ค์Œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•  ๋•Œ๋งŒ ์ง„ํ–‰ํ•œ๋‹ค.
1. ํ‰๊ฐ€ ๋Œ€์ƒ ๋ชจ๋ธ ํŒŒ์ผ๊ณผ ์„ค์ •์ด Hugging Face Hub์—์„œ ์žฌํ˜„ ๊ฐ€๋Šฅํ•ด์•ผ ํ•œ๋‹ค.
2. GPQA ํ‰๊ฐ€ ์‹คํ–‰ ๋กœ๊ทธ, ํ‰๊ฐ€ ์ผ์ž, ํ‰๊ฐ€ ์Šคํฌ๋ฆฝํŠธ ๋˜๋Š” ์ถ”์  ๊ฐ€๋Šฅํ•œ source URL์ด ์žˆ์–ด์•ผ ํ•œ๋‹ค.
3. ์ ์ˆ˜๋Š” ์‹ค์ธก ๊ฒฐ๊ณผ์—ฌ์•ผ ํ•˜๋ฉฐ, README์˜ fictional benchmark ์ˆซ์ž๋ฅผ ์žฌ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š”๋‹ค.
4. ์ œ์ถœ ์ง์ „ `README.md`์˜ `model-index` ๋˜๋Š” `.eval_results/gpqa.yaml` ์ค‘ ํ•˜๋‚˜๋งŒ ์„ ํƒํ•ด Hugging Face ํ˜„์žฌ ๋ฌธ์„œ์— ๋งž์ถฐ ์ถ”๊ฐ€ํ•œ๋‹ค.
## ์ฐธ๊ณ  ๋ฌธ์„œ
- Hugging Face Model Cards: https://huggingface.co/docs/hub/model-cards
- GPQA Dataset: https://huggingface.co/datasets/Idavidrein/gpqa