Upload source/README.md with huggingface_hub
#30
by somebody-to-love - opened
- source/README.md +1849 -0
source/README.md
ADDED
|
@@ -0,0 +1,1849 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# FRANKENSTALLM
|
| 2 |
+
|
| 3 |
+

|
| 4 |
+

|
| 5 |
+

|
| 6 |
+

|
| 7 |
+

|
| 8 |
+

|
| 9 |
+
|
| 10 |
+
> **ํ๊ตญ์ด 3B LLM์ 8ร NVIDIA B200 ์์์ ์ฒ์๋ถํฐ ์ง์ ๋ง๋ ๋ค.**
|
| 11 |
+
> Frankenstein์ฒ๋ผ ์กฐ๊ฐ์ ์ด์ด ๋ถ์ด๊ณ , ์ฒ ๊ฐ์ฒ๋ผ ๋จ๋จํ๊ฒ ๋จ๋ จํ๋ค.
|
| 12 |
+
|
| 13 |
+
GitHub: [`pathcosmos/FRANKENSTALLM`](https://github.com/pathcosmos/FRANKENSTALLM)
|
| 14 |
+
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
## ๋ชฉ์ฐจ
|
| 18 |
+
|
| 19 |
+
1. [์ ์ด ํ๋ก์ ํธ์ธ๊ฐ](#1-์-์ด-ํ๋ก์ ํธ์ธ๊ฐ)
|
| 20 |
+
2. [ํ์ฌ ์ํ โ ํ๋์ ๋ณด๊ธฐ](#2-ํ์ฌ-์ํ--ํ๋์-๋ณด๊ธฐ)
|
| 21 |
+
3. [ํ๋์จ์ด ํ๊ฒฝ](#3-ํ๋์จ์ด-ํ๊ฒฝ)
|
| 22 |
+
4. [ํ๋ก์ ํธ ๊ตฌ์กฐ](#4-ํ๋ก์ ํธ-๊ตฌ์กฐ)
|
| 23 |
+
5. [ํ๋ก์ ํธ ์ฌ์ ํ์๋ผ์ธ](#5-ํ๋ก์ ํธ-์ฌ์ -ํ์๋ผ์ธ)
|
| 24 |
+
6. [๋ชจ๋ธ ์ํคํ
์ฒ](#6-๋ชจ๋ธ-์ํคํ
์ฒ)
|
| 25 |
+
7. [ํ์ต ๋ฐ์ดํฐ](#7-ํ์ต-๋ฐ์ดํฐ)
|
| 26 |
+
8. [ํ์ต ์ค์ ๋ฐ ์ต์ ํ](#8-ํ์ต-์ค์ -๋ฐ-์ต์ ํ)
|
| 27 |
+
9. [์คํ ๊ฒฐ๊ณผ โ 1B ๋ฒ ์ด์ค๋ผ์ธ](#9-์คํ-๊ฒฐ๊ณผ--1b-๋ฒ ์ด์ค๋ผ์ธ)
|
| 28 |
+
10. [์คํ ๊ฒฐ๊ณผ โ 3B Base ์ข
ํฉ ํ๊ฐ (v2)](#10-์คํ-๊ฒฐ๊ณผ--3b-base-์ข
ํฉ-ํ๊ฐ-v2)
|
| 29 |
+
- [10.1 ํ์ต ์ปค๋ธ](#101-ํ์ต-์ปค๋ธ)
|
| 30 |
+
- [10.2 PPL (Perplexity) โ 19๊ฐ ๋ฐ์ดํฐ์
](#102-ppl-perplexity--19๊ฐ-๋ฐ์ดํฐ์
)
|
| 31 |
+
- [10.3 ํ๊ตญ์ด ๋ฒค์น๋งํฌ](#103-ํ๊ตญ์ด-๋ฒค์น๋งํฌ)
|
| 32 |
+
- [10.4 ์์ด ๋ฒค์น๋งํฌ](#104-์์ด-๋ฒค์น๋งํฌ)
|
| 33 |
+
- [10.5 Calibration](#105-calibration)
|
| 34 |
+
- [10.6 0-shot vs 5-shot ๋น๊ต](#106-0-shot-vs-5-shot-๋น๊ต)
|
| 35 |
+
- [10.7 ์ฐธ๊ณ ๋ชจ๋ธ ๋น๊ต](#107-์ฐธ๊ณ -๋ชจ๋ธ-๋น๊ต)
|
| 36 |
+
- [10.8 ์์ฑ ํ์ง ๋ฐ ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น](#108-์์ฑ-ํ์ง-๋ฐ-ํ๋ผ๋ฏธํฐ-๊ทธ๋ฆฌ๋-์์น)
|
| 37 |
+
- [10.9 ํ๊ฐ ํ์ดํ๋ผ์ธ](#109-ํ๊ฐ-ํ์ดํ๋ผ์ธ)
|
| 38 |
+
11. [์คํ ๊ฒฐ๊ณผ โ 3B SFT ์ข
ํฉ ํ๊ฐ](#11-์คํ-๊ฒฐ๊ณผ--3b-sft-์ข
ํฉ-ํ๊ฐ)
|
| 39 |
+
- [11.1 SFT ํ์ต ๊ฒฐ๊ณผ](#111-sft-ํ์ต-๊ฒฐ๊ณผ)
|
| 40 |
+
- [11.2 6์ฐจ์ ํ๊ฐ ์์ฝ](#112-6์ฐจ์-ํ๊ฐ-์์ฝ)
|
| 41 |
+
- [11.3 Base vs SFT ๋น๊ต](#113-base-vs-sft-๋น๊ต)
|
| 42 |
+
- [11.4 ์ฝ๋ ๊ฐ์ ์ฌํญ](#114-์ฝ๋-๊ฐ์ -์ฌํญ)
|
| 43 |
+
- [11.5 ORPO ์งํ ํ์ ](#115-orpo-์งํ-ํ์ )
|
| 44 |
+
12. [Phase 3 โ ORPO (์ ํธ๋ ์ ๋ ฌ)](#12-phase-3--orpo-์ ํธ๋-์ ๋ ฌ)
|
| 45 |
+
- [12.1 ORPO ์ ํ ๋ฐฐ๊ฒฝ](#121-orpo-์ ํ-๋ฐฐ๊ฒฝ)
|
| 46 |
+
- [12.2 ๋ฐ์ดํฐ](#122-๋ฐ์ดํฐ)
|
| 47 |
+
- [12.3 HP Sweep ์ค๊ณ](#123-hp-sweep-์ค๊ณ-6-config)
|
| 48 |
+
- [12.4 ์๋ ์ด๋ ฅ](#124-์๋-์ด๋ ฅ--5๋ฒ์-์คํจ)
|
| 49 |
+
- [12.5 ์ค์ ๊ฒฐ๊ณผ](#125-์ค์-๊ฒฐ๊ณผ-์งํ-์ค)
|
| 50 |
+
- [12.7 ORPO ๋ณธ ํ์ต](#127-orpo-๋ณธ-ํ์ต-์งํ-์ค-2026-03-09)
|
| 51 |
+
- [12.8 ORPO ์ข
ํฉ ํ๊ฐ ํ์ดํ๋ผ์ธ](#128-orpo-์ข
ํฉ-ํ๊ฐ-ํ์ดํ๋ผ์ธ)
|
| 52 |
+
13. [์คํ ๋ฐฉ๋ฒ](#13-์คํ-๋ฐฉ๋ฒ)
|
| 53 |
+
14. [๋ก๋๋งต](#14-๋ก๋๋งต)
|
| 54 |
+
15. [์ฐธ๊ณ ๋ฌธ์](#15-์ฐธ๊ณ -๋ฌธ์)
|
| 55 |
+
16. [๊ธฐ์ ์คํ ์์ฝ](#16-๊ธฐ์ -์คํ-์์ฝ)
|
| 56 |
+
17. [๊ด๋ จ ํ๋ก์ ํธ](#๊ด๋ จ-ํ๋ก์ ํธ)
|
| 57 |
+
18. [๋ค์ ์ต์ ํ ๊ณํ](#18-๋ค์-์ต์ ํ-๊ณํ--mfu-335--47-๋ชฉํ)
|
| 58 |
+
19. [GPU ํ๋์จ์ด & ๋น์ฉ ๋ถ์](#19-gpu-ํ๋์จ์ด--๋น์ฉ-๋ถ์--3b--60b-ํ๋ฆฌํธ๋ ์ธ)
|
| 59 |
+
|
| 60 |
+
---
|
| 61 |
+
|
| 62 |
+
## 1. ์ ์ด ํ๋ก์ ํธ์ธ๊ฐ
|
| 63 |
+
|
| 64 |
+
ํ๊ตญ์ด LLM ์ํ๊ณ๋ ๋น ๋ฅด๊ฒ ์ฑ์ฅํ๊ณ ์๋ค. ๊ทธ๋ฌ๋ ๋๋ถ๋ถ์ ๊ณต๊ฐ ๋ชจ๋ธ์ ์์ด ๊ธฐ๋ฐ ์ฌ์ ํ์ต ์์ ํ๊ตญ์ด ํ์ธํ๋์ ์น์ ํํ๊ฑฐ๋, ํ์ต ๊ณผ์ ์ด ๊ณต๊ฐ๋์ง ์์ ์ฌํ์ด ๋ถ๊ฐ๋ฅํ๋ค.
|
| 65 |
+
|
| 66 |
+
์ด ํ๋ก์ ํธ๋ ๋ค๋ฅด๋ค.
|
| 67 |
+
|
| 68 |
+
- **์ฒ์๋ถํฐ(from scratch)**: ํ ํฌ๋์ด์ ํ์ต๋ถํฐ ํ๋ฆฌํธ๋ ์ธ, SFT, ์ ํธ๋ ์ ๋ ฌ๊น์ง ๋ชจ๋ ๋จ๊ณ๋ฅผ ์ง์ ๊ตฌํํ๋ค.
|
| 69 |
+
- **์์ ๊ณต๊ฐ ๋น๋ ๋ก๊ทธ**: ์ฑ๊ณต๋ง ๊ธฐ๋กํ์ง ์๋๋ค. ๋ฒ๊ทธ, ์คํจ, ํ๋จ ์ฐฉ์ค, ๊ทธ๋ฆฌ๊ณ ๊ทธ ์์ธ ๋ถ์๊น์ง ๋ชจ๋ ๊ธฐ๋กํ๋ค.
|
| 70 |
+
- **์ค์ฉ์ ์ธ ๊ท๋ชจ**: ํ์ ๋
ผ๋ฌธ์ฉ ์ฅ๋๊ฐ ๋ชจ๋ธ(125M)๋ ์๋๊ณ , ์ฐ๊ตฌ์๊ฐ ์๋๋ฉด ์ฌํ ๋ถ๊ฐ๋ฅํ 70B๋ ์๋, **3B ๊ท๋ชจ**์ ์ค์ฉ์ ํ๊ตญ์ด ๋ชจ๋ธ์ด ๋ชฉํ๋ค.
|
| 71 |
+
- **B200 ์ต์ ํ**: NVIDIA B200์ FP8 Tensor Core, NVLink 5.0, FlashAttention-2๋ฅผ ์ต๋ํ ํ์ฉํ๋ค. ์ต์ ํ๋์จ์ด๋ฅผ ์ต๋๋ก ์ฅ์ด์ง๋ ๊ณผ์ ์์ฒด๊ฐ ํ์ต์ด๋ค.
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
์ด README๋ ์์ฑ๋ ๊ฒฐ๊ณผ๋ฌผ์ ๋ฐํ๊ฐ ์๋๋ผ, **ํ์ฌ ์งํ ์ค์ธ ๋น๋์ ๋ก๊ทธ**๋ค.
|
| 75 |
+
|
| 76 |
+
---
|
| 77 |
+
|
| 78 |
+
## 2. ํ์ฌ ์ํ โ ํ๋์ ๋ณด๊ธฐ
|
| 79 |
+
|
| 80 |
+
```
|
| 81 |
+
2026-03-09 ๊ธฐ์ค
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
| ๋จ๊ณ | ์ํ | ์ธ๋ถ ๋ด์ฉ |
|
| 85 |
+
|------|------|-----------|
|
| 86 |
+
| Phase 0: ๊ธฐ๋ฐ ๊ตฌ์ถ | โ
์๋ฃ | OOM ์์ , GQA FA ์ต์ ํ, NCCL NVLS, ํ์ดํ๋ผ์ธ ์ค๋น |
|
| 87 |
+
| Phase 1: 3B Pretrain | โ
์๋ฃ | 57,000 steps, loss 1.466, ~63์๊ฐ |
|
| 88 |
+
| Phase 2: SFT | โ
์๋ฃ | 25,500 steps (early stop), val_loss 1.8851, ~15.5์๊ฐ |
|
| 89 |
+
| Phase 2.5: SFT ํ๊ฐ | โ
์๋ฃ | 6์ฐจ์ ํ๊ฐ 4/6 PASS, ORPO ์งํ ๊ฒฐ์ |
|
| 90 |
+
| Phase 3: ORPO Sweep | โ
์๋ฃ | 6-config sweep ์๋ฃ, best: lr=1.2e-5, beta=0.25 |
|
| 91 |
+
| **Phase 3: ORPO ๋ณธ ํ์ต** | **๐ ์งํ ์ค** | **630K pairs, 2 epochs, ~9,840 steps, ~4.8์๊ฐ** |
|
| 92 |
+
| Phase 4: ๋ฐฐํฌ | ๐ ๋๊ธฐ | GGUF ๋ณํ โ Ollama ์๋น |
|
| 93 |
+
|
| 94 |
+
### Phase 2 (SFT) ์ต์ข
๊ฒฐ๊ณผ
|
| 95 |
+
|
| 96 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 97 |
+
|------|-----|
|
| 98 |
+
| ์ต์ข
step | **25,500 / 33,000** (77.3%, early stopping) |
|
| 99 |
+
| **Val loss (best)** | **1.8851** (step 23,000) |
|
| 100 |
+
| ํ์ต ์๊ฐ | **~15์๊ฐ 41๋ถ** (2026-03-05 22:15 ~ 2026-03-06 13:56) |
|
| 101 |
+
| VRAM ์ฌ์ฉ | **24.2GB** / 183GB per GPU (13.2%) |
|
| 102 |
+
| Base ๋ชจ๋ธ | checkpoint-0057000 (pretrain loss 1.466) |
|
| 103 |
+
| SFT ๋ฐ์ดํฐ | **2,439,397 samples** (24๊ฐ ์์ค, 7.48 GB) |
|
| 104 |
+
| ์ฌ๊ณ | 0๊ฑด (OOM, NCCL, NaN ์์) |
|
| 105 |
+
|
| 106 |
+
**SFT Val Loss ์ ์ฒด ์ถ์ด**:
|
| 107 |
+
```
|
| 108 |
+
Step 500: 2.073
|
| 109 |
+
Step 2,000: 1.956 (-0.117)
|
| 110 |
+
Step 5,000: 1.911 (-0.045)
|
| 111 |
+
Step 10,000: 1.892 (-0.019)
|
| 112 |
+
Step 15,000: 1.886 (-0.006)
|
| 113 |
+
Step 20,000: 1.885 (-0.001)
|
| 114 |
+
Step 23,000: 1.8851 โ BEST
|
| 115 |
+
Step 25,500: 1.8851 โ Early Stop (patience 5/5)
|
| 116 |
+
```
|
| 117 |
+
|
| 118 |
+
### SFT 6์ฐจ์ ํ๊ฐ ์์ฝ
|
| 119 |
+
|
| 120 |
+
| ์ฐจ์ | ๊ฒฐ๊ณผ | ํต์ฌ ์์น |
|
| 121 |
+
|------|------|-----------|
|
| 122 |
+
| Perplexity (์ง์ ๋ณด์กด) | **PASS** | forgetting 0.9% |
|
| 123 |
+
| ์์ฑ ํ์ง | **FAIL** | Greedy ๋ฐ๋ณต๋ฅ 72.97% |
|
| 124 |
+
| ํ๊ตญ์ด ๋ฒค์น๋งํฌ | **FAIL** | KoBEST ํ๊ท 43.26% |
|
| 125 |
+
| ์์ด ๋ฒค์น๋งํฌ | **PASS** | ์ ํ์คํฌ ํํ ์ด๊ณผ |
|
| 126 |
+
| Calibration | **PASS** | Top-1 68.59% |
|
| 127 |
+
| SFT Chat ๋ฅ๋ ฅ | **PASS** | EOS ์ข
๋ฃ์จ 60% (Base 0%) |
|
| 128 |
+
|
| 129 |
+
> **ํ์ : ORPO ์งํ** โ ์ง์ ๋ณด์กด ์ฐ์(0.9%), ๋ฐ๋ณต๋ฅ ์ ์ ํธ๋ ์ ๋ ฌ๋ก ํด๊ฒฐ.
|
| 130 |
+
> ์์ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
|
| 131 |
+
|
| 132 |
+
---
|
| 133 |
+
|
| 134 |
+
## 3. ํ๋์จ์ด ํ๊ฒฝ
|
| 135 |
+
|
| 136 |
+
### GPU
|
| 137 |
+
|
| 138 |
+
| ํญ๋ชฉ | ์ฌ์ |
|
| 139 |
+
|------|------|
|
| 140 |
+
| ๋ชจ๋ธ | 8ร NVIDIA B200 |
|
| 141 |
+
| VRAM | 183GB HBM3e per GPU (~1.47TB ํฉ๊ณ) |
|
| 142 |
+
| FP8 Tensor Core | 2,250 TFLOPS/GPU (์ด 18,000 TFLOPS) |
|
| 143 |
+
| BF16 | 1,125 TFLOPS/GPU |
|
| 144 |
+
| HBM3e ๋์ญํญ | ~7.67 TB/s per GPU |
|
| 145 |
+
| ์ธํฐ์ปค๋ฅํธ | NVLink 5.0 (900 GB/s bidirectional per GPU) |
|
| 146 |
+
| ํ ํด๋ก์ง | NVSwitch โ ๋ชจ๋ GPUโGPU ๋จ์ผ ํ All-to-All Mesh |
|
| 147 |
+
| ์ ๋ ฅ | 940W ์ค์ธก / 1000W cap |
|
| 148 |
+
|
| 149 |
+
B200์ FP8 ๋ค์ดํฐ๋ธ ์ง์ ๋ชจ๋ธ์ด๋ค. `torch.float8_e4m3fn` ์ TransformerEngine์ MXFP8 ๋ ์ํผ์ ๊ฒฐํฉํด ํ์ตํ๋ค. BF16 ๋๋น ์ฐ์ฐ๋์ด ์ด๋ก ์ 2๋ฐฐ์ด๋ฉฐ, ๋ฉ๋ชจ๋ฆฌ ํจ์จ๋ ํฅ์๋๋ค.
|
| 150 |
+
|
| 151 |
+
### CPU ๋ฐ ์์คํ
๋ฉ๋ชจ๋ฆฌ
|
| 152 |
+
|
| 153 |
+
| ํญ๋ชฉ | ์ฌ์ |
|
| 154 |
+
|------|------|
|
| 155 |
+
| CPU | 2ร AMD EPYC 9365 (Turin / Zen 5) |
|
| 156 |
+
| ๋ฌผ๋ฆฌ ์ฝ์ด | 72๊ฐ (36์ฝ์ด ร 2์์ผ) |
|
| 157 |
+
| NUMA ๊ตฌ์ฑ | 2๋
ธ๋: node0 (core 0-35) / node1 (core 36-71) |
|
| 158 |
+
| GPUโNUMA ๋งคํ | GPU 0-3 โ NUMA node 0, GPU 4-7 โ NUMA node 1 |
|
| 159 |
+
| RAM | 2.21TB DDR5 (~2.03TB ์ฌ์ ) |
|
| 160 |
+
| L3 ์บ์ | 384MB (12 CCX ร 32MB) |
|
| 161 |
+
|
| 162 |
+
**NUMA ์ฃผ์**: ์ด๊ธฐ DDP ๋ฐ์นญ ์ 5/8 rank๊ฐ ์๋ชป๋ NUMA ๋
ธ๋์์ ์คํ๋๋ ๋ฌธ์ ๋ฐ์. 69%์ DataLoader worker๊ฐ ํฌ๋ก์ค-NUMA์๋ค. NUMA affinity ์ต์ ํ๋ ๋ฏธ์ ์ฉ ์ํ(๋ก๋๋งต ํญ๋ชฉ).
|
| 163 |
+
|
| 164 |
+
### ์คํ ๋ฆฌ์ง
|
| 165 |
+
|
| 166 |
+
| ๊ฒฝ๋ก | ์ฉ๋ | ์ฌ์ ๊ณต๊ฐ |
|
| 167 |
+
|------|------|-----------|
|
| 168 |
+
| `/PROJECT/0325120031_A/ghong/taketimes/llm-bang/` | ๋ฉ์ธ ์์
(์ฒดํฌํฌ์ธํธ, ๋ฐ์ดํฐ) | 2.2TB |
|
| 169 |
+
| `/home/ghong/` | ์๊ท๋ชจ ์ฝ๋ | 5GB (์ ํ) |
|
| 170 |
+
|
| 171 |
+
> **์ฃผ์**: ์ฒดํฌํฌ์ธํธ(์์ญ GB), ํ์ต ๋ฐ์ดํฐ(82GB+), ์ค๊ฐ ์ฐ์ถ๋ฌผ์ ๋ชจ๋ `/PROJECT/...` ๊ฒฝ๋ก์ ์ ์ฅํ๋ค. ํ ๋๋ ํ ๋ฆฌ ์ฉ๋ ์ด๊ณผ ์ํ.
|
| 172 |
+
|
| 173 |
+
### ์ํํธ์จ์ด ํ๊ฒฝ
|
| 174 |
+
|
| 175 |
+
| ํจํค์ง | ๋ฒ์ |
|
| 176 |
+
|--------|------|
|
| 177 |
+
| PyTorch | `2.10.0a0+b4e4ee81d3.nv25.12` (NVIDIA ์ปค์คํ
) |
|
| 178 |
+
| FlashAttention | 2.7.4.post1+25.12 |
|
| 179 |
+
| TransformerEngine | 2.10.0 |
|
| 180 |
+
| NCCL | 2.28.9 |
|
| 181 |
+
| Triton | 3.5.1 |
|
| 182 |
+
| CUDA | 13.1 |
|
| 183 |
+
| Driver | 580.95.05 |
|
| 184 |
+
|
| 185 |
+
> **๊ฒฝ๊ณ **: PyTorch๋ NVIDIA B200 ์ต์ ํ ์ปค์คํ
๋น๋๋ค. `pip install torch`๋ก ์ฌ์ค์นํ๋ฉด B200 ์ต์ ํ๊ฐ ๊นจ์ง๋ค. **์ ๋ ์ฌ์ค์น ๊ธ์ง.**
|
| 186 |
+
|
| 187 |
+
---
|
| 188 |
+
|
| 189 |
+
## 4. ํ๋ก์ ํธ ๊ตฌ์กฐ
|
| 190 |
+
|
| 191 |
+
```
|
| 192 |
+
llm-bang/
|
| 193 |
+
โโโ CLAUDE.md # Claude Code ๊ฐ์ด๋
|
| 194 |
+
โโโ README.md # ์ด ํ์ผ
|
| 195 |
+
โโโ PROGRESS.md # ์งํ ๊ธฐ๋ก (๋ ์ง๋ณ ๋ก๊ทธ)
|
| 196 |
+
โโโ Modelfile.3b # Ollama ๋ชจ๋ธ ํ์ผ
|
| 197 |
+
โ
|
| 198 |
+
โโโ configs/
|
| 199 |
+
โ โโโ korean_3b_fp8.yaml # 3B FP8 ํ์ต ์ค์ (ํ์ฌ ์ฌ์ฉ ์ค)
|
| 200 |
+
โ โโโ 3b_pretrain.yaml # 3B ํ๋ฆฌํธ๋ ์ธ ์ค์ (๋์ฒด)
|
| 201 |
+
โ โโโ korean_1b_fp8.yaml # 1B FP8 ์ค์ (์์นด์ด๋ธ)
|
| 202 |
+
โ โโโ korean_3b_sft.yaml # 3B SFT v1 ์ค์ (์๋ฃ)
|
| 203 |
+
โ โโโ korean_3b_sft_v2.yaml # 3B SFT v2 ์ค์ (lr=5e-5, data mixing)
|
| 204 |
+
โ โโโ korean_3b_orpo.yaml # 3B ORPO ์ค์ (lr=5e-6, beta=0.1)
|
| 205 |
+
โ โโโ hybrid_3b.yaml # Hybrid 3B (Mamba-2 + Attention)
|
| 206 |
+
โ โโโ small_fp8.yaml # 125M FP8 ๊ฒ์ฆ์ฉ
|
| 207 |
+
โ โโโ medium.yaml # ์คํ ๋ชจ๋ธ ์ค์
|
| 208 |
+
โ โโโ small.yaml # ์ํ ๋ชจ๋ธ ์ค์
|
| 209 |
+
โ
|
| 210 |
+
โโโ data/
|
| 211 |
+
โ โโโ 3b_train.bin # ํ๋ฆฌํธ๋ ์ธ ํ์ต ๋ฐ์ดํฐ (82GB, 41.12B tokens)
|
| 212 |
+
โ โโโ 3b_val.bin # ๊ฒ์ฆ ๋ฐ์ดํฐ (151MB)
|
| 213 |
+
โ โโโ cc100_ko_train.bin # CC100 ํ๊ตญ์ด (4.5GB)
|
| 214 |
+
โ โโโ cosmo_auto_math_text_train.bin # ์ํ ํ
์คํธ (2.6GB)
|
| 215 |
+
โ โโโ build scripts, __init__.py
|
| 216 |
+
โ
|
| 217 |
+
โโโ model/
|
| 218 |
+
โ โโโ attention.py # GQA FlashAttention (Phase 0 ์ต์ ํ ์ ์ฉ)
|
| 219 |
+
โ โโโ transformer.py # ํธ๋์คํฌ๋จธ ๋ฉ์ธ ์ํคํ
์ฒ
|
| 220 |
+
โ โโโ config.py # ๋ชจ๋ธ ์ค์ dataclass
|
| 221 |
+
โ โโโ layers.py # ์ปค์คํ
๋ ์ด์ด (RMSNorm, SwiGLU ๋ฑ)
|
| 222 |
+
โ
|
| 223 |
+
โโโ train/
|
| 224 |
+
โ โโโ pretrain.py # ํ๋ฆฌํธ๋ ์ธ ์คํฌ๋ฆฝํธ (DDP ์ต์ ํ)
|
| 225 |
+
โ โโโ sft.py # SFT ํ์ต
|
| 226 |
+
โ โโโ orpo.py # ORPO ํ์ต
|
| 227 |
+
โ โโโ trainer.py # ํตํฉ ํธ๋ ์ด๋ (loss sync ์ต์ ํ)
|
| 228 |
+
โ โโโ utils.py # ์ ํธ๋ฆฌํฐ (NCCL 7200s timeout ๋ฑ)
|
| 229 |
+
โ
|
| 230 |
+
โโโ scripts/
|
| 231 |
+
โ โโโ launch_3b_pretrain.sh # 3B ํ๋ฆฌํธ๋ ์ธ ๋ฐ์ฒ (NCCL ํ๊ฒฝ๋ณ์ ํฌํจ)
|
| 232 |
+
โ โโโ launch_3b_sft.sh # 3B SFT v1 ๋ฐ์ฒ
|
| 233 |
+
โ โโโ launch_3b_sft_v2.sh # 3B SFT v2 ๋ฐ์ฒ (data mixing)
|
| 234 |
+
โ โโโ launch_3b_orpo.sh # 3B ORPO ๋ฐ์ฒ
|
| 235 |
+
โ โโโ monitor_3b.sh # ์ค์๊ฐ ํ์ต ๋ชจ๋ํฐ
|
| 236 |
+
โ โโโ training_watchdog.sh # ์์น๋
(10๋ถ ๊ฐ๊ฒฉ, ํฌ๋ก )
|
| 237 |
+
โ โโโ convert_3b_gguf.sh # GGUF ๋ณํ ์คํฌ๋ฆฝํธ
|
| 238 |
+
โ โโโ deploy_3b_ollama.sh # Ollama ๋ฐฐํฌ
|
| 239 |
+
โ โโโ quality_gate.sh # ๋ฐฐํฌ ์ ํ์ง ๊ฒ์ดํธ
|
| 240 |
+
โ โโโ telegram_notify.py # ํ
๋ ๊ทธ๋จ ์๋ฆผ (urllib ์ฌ์ฉ, curl ์ฐจ๋จ)
|
| 241 |
+
โ โโโ hourly_status.sh # 1์๊ฐ ๊ฐ๊ฒฉ ์ํ ๋ฆฌํฌํธ
|
| 242 |
+
โ
|
| 243 |
+
โโโ eval/
|
| 244 |
+
โ โโโ debate/
|
| 245 |
+
โ โ โโโ justice_league_3b_case.md # 3B ์ ํ ๋
ผ์ฆ (์ ์คํฐ์ค๋ฆฌ๊ทธ ๋ฉํฐ์์ด์ ํธ)
|
| 246 |
+
โ โโโ decision/
|
| 247 |
+
โ โ โโโ FINAL_DECISION_REPORT.md # SFT ์ฌ์์ ํ๊ฒฐ๋ฌธ
|
| 248 |
+
โ โโโ plan/
|
| 249 |
+
โ โ โโโ 3B_MASTER_PLAN.md # 3B ๋ง์คํฐ ํ๋
|
| 250 |
+
โ โโโ tasks/ # ๋ชจ๋ํ๋ ํ๊ฐ ํ์คํฌ
|
| 251 |
+
โ โ โโโ task_runner.py # 8-GPU ๋ณ๋ ฌ ํ์คํฌ ์คํ๊ธฐ
|
| 252 |
+
โ โ โโโ ppl_task.py # Perplexity ํ๊ฐ ํ์คํฌ
|
| 253 |
+
โ โ โโโ lm_eval_task.py # lm-evaluation-harness ๋ํผ
|
| 254 |
+
โ โ โโโ calibration_task.py # Calibration ๋ถ์
|
| 255 |
+
โ โ โโโ generation_task.py # ์์ฑ ํ์ง + ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น
|
| 256 |
+
โ โ โโโ token_nll_task.py # Token NLL ๋ถํฌ ๋ถ์
|
| 257 |
+
โ โโโ outputs/ # ํ๊ฐ ๊ฒฐ๊ณผ (์๋ ์์ฑ, .gitignore)
|
| 258 |
+
โ โโโ full_eval_pipeline.py # v2 ์ข
ํฉ ํ๊ฐ ํ์ดํ๋ผ์ธ (8-GPU ๋ณ๋ ฌ)
|
| 259 |
+
โ โโโ sft_eval_pipeline.py # SFT 6์ฐจ์ ํ๊ฐ ํ์ดํ๋ผ์ธ
|
| 260 |
+
โ โโโ reeval_pipeline.py # ์ฌํ๊ฐ ํ์ดํ๋ผ์ธ (0+5-shot ์ฐ์)
|
| 261 |
+
โ โโโ report_generator.py # ๋งํฌ๋ค์ด ๋ฆฌํฌํธ ์๋ ์์ฑ
|
| 262 |
+
โ โโโ comprehensive_eval.py # v1 ์ข
ํฉ ํ๊ฐ (๋ ๊ฑฐ์)
|
| 263 |
+
โ โโโ test_generation_params.py # ์์ฑ ํ๋ผ๋ฏธํฐ ํ์
|
| 264 |
+
โ
|
| 265 |
+
โโโ tokenizer/
|
| 266 |
+
โ โโโ korean_sp/ # SentencePiece 64K ๋ชจ๋ธ ํ์ผ
|
| 267 |
+
โ โโโ tokenizer.json # HuggingFace ํฌ๋งท (2.4MB)
|
| 268 |
+
โ โโโ train_sp_tokenizer.py # ํ ํฌ๋์ด์ ํ์ต ์คํฌ๋ฆฝํธ
|
| 269 |
+
โ โโโ convert_sp_to_hf.py # SentencePiece โ HF ๋ณํ
|
| 270 |
+
โ
|
| 271 |
+
โโโ checkpoints/ # ๋ชจ๋ธ ์ฒดํฌํฌ์ธํธ (๋์ฉ๋, .gitignore)
|
| 272 |
+
โ
|
| 273 |
+
โโโ docs/
|
| 274 |
+
โ โโโ PROJECT_HISTORY.md # ํ๋ก์ ํธ ์ ์ฒด ์ฌ์ ์์ธ ๊ธฐ๋ก
|
| 275 |
+
โ โโโ 3B_WORKPLAN.md # 3B ์์
๊ณํ
|
| 276 |
+
โ
|
| 277 |
+
โโโ reports/
|
| 278 |
+
โโโ 2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md
|
| 279 |
+
โโโ 2026-03-05_3B_BASE_EVALUATION_REPORT.md
|
| 280 |
+
โโโ 2026-03-05_3B_SFT_PROGRESS_REPORT.md # SFT ํ์ต ๋ณด๊ณ ์ (Phase 2)
|
| 281 |
+
โโโ 2026-03-05_3B_NEXT_STEPS_REFERENCE.md
|
| 282 |
+
โโโ 2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md
|
| 283 |
+
โโโ 2026-03-05_PPL_EVALUATION.md
|
| 284 |
+
โโโ 2026-03-05_BENCHMARK_RESULTS.md
|
| 285 |
+
โโโ 2026-03-05_GENERATION_QUALITY.md
|
| 286 |
+
โโโ 2026-03-06_3B_SFT_EVAL_PLAN.md # SFT 6์ฐจ์ ํ๊ฐ ๊ณํ์
|
| 287 |
+
โโโ 2026-03-06_3B_SFT_EVALUATION_REPORT.md # SFT 6์ฐจ์ ํ๊ฐ ๊ฒฐ๊ณผ
|
| 288 |
+
โโโ 2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md # SFT ์๋ฃ + ์ฝ๋ ๊ฐ์ ์ข
ํฉ
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
---
|
| 292 |
+
|
| 293 |
+
## 5. ํ๋ก์ ํธ ์ฌ์ ํ์๋ผ์ธ
|
| 294 |
+
|
| 295 |
+
์ด ์น์
์ด ์ด README์ ํต์ฌ์ด๋ค. ๊ฒฐ๊ณผ๋ง์ด ์๋๋ผ **์** ๊ทธ๋ฐ ๊ฒฐ์ ์ ๋ด๋ ธ๋์ง, **์ด๋์** ์คํจํ๋์ง๋ฅผ ์์งํ๊ฒ ๊ธฐ๋กํ๋ค.
|
| 296 |
+
|
| 297 |
+
---
|
| 298 |
+
|
| 299 |
+
### Day 1 (Feb 25) โ ์ฒซ ๋ถ์จ: 125M FP8 ๊ฒ์ฆ
|
| 300 |
+
|
| 301 |
+
ํ๋ก์ ํธ์ ์์์ ์์ ์๋ฌธ์์ ์ถ๋ฐํ๋ค. B200์๏ฟฝ๏ฟฝ๏ฟฝ FP8์ด ์ค์ ๋ก ์์ ์ ์ผ๋ก ํ์ต๋๋๊ฐ?
|
| 302 |
+
|
| 303 |
+
TransformerEngine์ MXFP8 ๋ ์ํผ๋ฅผ 125M ์ํ ๋ชจ๋ธ์ ์ ์ฉํด ๊ฒ์ฆํ๋ค. ๊ฒฐ๋ก ์ **์์ ์ ์ผ๋ก ๋์ํ๋ค**. loss ์๋ ด๋ ์ ์์ด์๊ณ , VRAM ํจ์จ๋ BF16 ๋๋น ํ์ฐํ ๊ฐ์ ์ด ์์๋ค. ์ด ๊ฒ์ฆ์ด ์ ์ฒด ํ์ดํ๋ผ์ธ์ ์ฒซ ๋ฒ์งธ ๋
น์ ์ ํธ์๋ค.
|
| 304 |
+
|
| 305 |
+
๊ฐ์ ๋ , ์ธํ๋ผ ์ธํ
๋ ์๋ฃํ๋ค. DDP 8-GPU ํ๊ฒฝ, NCCL ํ๊ฒฝ๋ณ์, ์ฒดํฌํฌ์ธํธ ์ ์ฅ ๊ฒฝ๋ก, ํ
๋ ๊ทธ๋จ ์๋ฆผ ์์คํ
์ ์ด์์ด ์ด๋ ๊ฐ์ถฐ์ก๋ค.
|
| 306 |
+
|
| 307 |
+
---
|
| 308 |
+
|
| 309 |
+
### Day 1~2 (Feb 25~26) โ 1B ํ๋ฆฌํธ๋ ์ธ: 34K ์คํ
, PPL 5.67
|
| 310 |
+
|
| 311 |
+
125M ๊ฒ์ฆ ์งํ 1B ๋ชจ๋ธ ํ๋ฆฌํธ๋ ์ธ์ ๋์
ํ๋ค.
|
| 312 |
+
|
| 313 |
+
- **์ํคํ
์ฒ**: d_model=2048, 24 layers, GQA 4:1, SwiGLU, RoPE
|
| 314 |
+
- **๋ฐ์ดํฐ**: C4 Korean ๊ธฐ๋ฐ
|
| 315 |
+
- **ํ์ต**: 34,000 ์คํ
, FP8, 8ร B200 DDP
|
| 316 |
+
|
| 317 |
+
์ต์ข
๊ฒฐ๊ณผ:
|
| 318 |
+
- **Loss: 1.904**
|
| 319 |
+
- **PPL (C4 Korean): 5.67**
|
| 320 |
+
|
| 321 |
+
์์น๋ง ๋ณด๋ฉด ๊ทธ๋ญ์ ๋ญ ๊ด์ฐฎ๋ค. ๊ทธ๋ฌ๋ ์ค์ ํ
์คํธ ์์ฑ์ ์์ผ๋ณด๋ฉด ๋ฌธ์ ๊ฐ ๋ณด์๋ค. ๋ฐ๋ณต ํจํด, ์ด์ํ ๋ฌธ์ฅ ๊ตฌ์กฐ, ๋งฅ๋ฝ ์ดํ. ํ๋ฆฌํธ๋ ์ธ ๋ชจ๋ธ์ด๋ ๋น์ฐํ๋ค. ์ด์ SFT ์ฐจ๋ก์๋ค.
|
| 322 |
+
|
| 323 |
+
---
|
| 324 |
+
|
| 325 |
+
### Day 2 (Feb 26) โ SFT v1: 0.0์ด๋ผ๋ ์ฌ์
|
| 326 |
+
|
| 327 |
+
SFT๋ฅผ ๋๋ ธ๋ค. ํ์ต์ด ์์๋์๋ง์ loss๊ฐ ๋น ๋ฅด๊ฒ ๋จ์ด์ง๊ธฐ ์์ํ๋ค. ์ฒ์์ ์ข์ ์ ํธ๋ผ๊ณ ์๊ฐํ๋ค.
|
| 328 |
+
|
| 329 |
+
๊ทธ๋ฐ๋ฐ loss๊ฐ **0.0**์ด ๋๋ค.
|
| 330 |
+
|
| 331 |
+
val loss๋ 0.0. ์์ฑ ๊ฒฐ๊ณผ๋ ์์ ํ ์ฐ๋ ๊ธฐ์๋ค.
|
| 332 |
+
|
| 333 |
+
์์ธ์ ์ฐพ์๋ค: **label off-by-one ๋ฒ๊ทธ**. ์
๋ ฅ ํ ํฐ๊ณผ ๋ ์ด๋ธ ํ ํฐ์ด ํ ์นธ์ฉ ๋ฐ๋ ค ์์๋ค. ๋ชจ๋ธ์ด ์ค์ ๋ก ๋ค์ ํ ํฐ์ ์์ธกํ๋ ๊ฒ์ด ์๋๋ผ, ์ด๋ฏธ ์๊ณ ์๋ ์ ๋ต์ ๋ง์ถ๋ ๊ตฌ์กฐ๊ฐ ๋ผ ์์๋ค. loss๊ฐ 0์ด ๋ ๊ฑด "์๋ฒฝํ ํ์ต"์ด ์๋๋ผ **๋ฐ์ดํฐ ๋์(label leakage)** ์๋ค.
|
| 334 |
+
|
| 335 |
+
ํ๋ฃจ๋ฅผ ๋ ๋ ธ๋ค.
|
| 336 |
+
|
| 337 |
+
---
|
| 338 |
+
|
| 339 |
+
### Day 3 (Feb 27) โ 5๊ฐ์ง ๋ฒ๊ทธ, ๋ฃจํธ ์ฝ์ฆ ๋ถ์
|
| 340 |
+
|
| 341 |
+
์คํจ๋ฅผ ๋ถ์ํ๊ธฐ ์ํด **5-์์ด์ ํธ ๋ฃจํธ ์ฝ์ฆ ๋ถ์**์ ์ํํ๋ค. ๊ฒฐ๋ก ์ ๋ฒ๊ทธ ํ๋๊ฐ ์๋์๋ค. SFT ํ์ดํ๋ผ์ธ ์ ์ฒด์ ๋ฌธ์ ๊ฐ ์์๋ค.
|
| 342 |
+
|
| 343 |
+
๋ฐ๊ฒฌ๋ 5๊ฐ์ง ํต์ฌ ๋ฒ๊ทธ:
|
| 344 |
+
|
| 345 |
+
| ๋ฒ๊ทธ | ์ฆ์ | ์ํฅ |
|
| 346 |
+
|------|------|------|
|
| 347 |
+
| Static padding (no packing) | ์งง์ ์ํ๋ max_len์ผ๋ก ํจ๋ฉ | GPU ๋ญ๋น, ํ์ต ๋นํจ์จ |
|
| 348 |
+
| EOS ํ ํฐ ์ ๋จ | ์๋ต ๋์ EOS๊ฐ ์์ | ๋ชจ๋ธ์ด "๋ฌธ์ฅ ๋"์ ๋ชป ๋ฐฐ์ |
|
| 349 |
+
| ๋จ์ผ ์ํญ | ๋ฐ์ดํฐ๋ฅผ ํ ๋ฒ๋ง ๋ด | ์ธ๋ํผํ
|
|
| 350 |
+
| ๊ฒ์ฆ ๋ถ๋ฆฌ ์์ | val_loss ์ธก์ ๋ถ๊ฐ | ์ค๋ฒํผํ
๊ฐ์ง ๋ถ๊ฐ |
|
| 351 |
+
| ๋ฐ์ดํฐ ํ์ง | ๋
ธ์ด์ฆ, ์ค๋ณต, ๋ถ๊ท ํ | ๋ฐ๋ณต ์์ฑ ํจํด ์ ๋ |
|
| 352 |
+
|
| 353 |
+
ํนํ EOS ์ ๋จ ๋ฒ๊ทธ๋ subtleํ๋ค. ๋ชจ๋ธ์ด ์๋ต์ ๋ง์น๋ ์์ ์ ๋ฐฐ์ฐ์ง ๋ชปํ๋ฉด, ์์ฑ ์ ๋์์์ด ๊ฐ์ ํจํด์ ๋ฐ๋ณตํ๊ฑฐ๋ ์๋ฏธ ์๋ ํ ํฐ์ ์ด์ด๋ถ์ธ๋ค. 18% ๋ฐ๋ณต๋ฅ ์ ์์ธ ์ค ํ๋์๋ค.
|
| 354 |
+
|
| 355 |
+
---
|
| 356 |
+
|
| 357 |
+
### Day 3 (Feb 27) โ SFT v2: ์ฑ๊ณต์ด์ง๋ง 18% ๋ฐ๋ณต
|
| 358 |
+
|
| 359 |
+
5๊ฐ์ง ๋ฒ๊ทธ๋ฅผ ๋ชจ๋ ์์ ํ๊ณ SFT v2๋ฅผ ๋๋ ธ๋ค.
|
| 360 |
+
|
| 361 |
+
- **val_loss: 2.2062** โ ํฉ๋ฆฌ์ ์์ค
|
| 362 |
+
- **๋ฐ๋ณต๋ฅ : 18%** (rep_penalty=1.1 ์ ์ฉ ํ)
|
| 363 |
+
|
| 364 |
+
์์ฑ ํ์ง์ v1์ ๋นํด ํ์ฐํ ๊ฐ์ ๋๋ค. ํ์ง๋ง 18% ๋ฐ๋ณต๋ฅ ์ ์ฌ์ ํ ๋๋ค. `rep_penalty`๋ฅผ ๋์ด๋ฉด ๋ฐ๋ณต์ ์ค์ง๋ง ์์ฑ ๋ค์์ฑ๋ ์ค๊ณ ์ด์ํด์ง๋ค. ๋์ฝ๋ฉ ํ๋ผ๋ฏธํฐ๋ก ํด๊ฒฐํ๊ธฐ์ ๊ตฌ์กฐ์ ํ๊ณ๊ฐ ์๋ค.
|
| 365 |
+
|
| 366 |
+
kobest_copa ๊ธฐ์ค 0.646. ๊ด์ฐฎ์ ์์น์ด์ง๋ง ๋ชฉํ์๋ ๋ฏธ์น์ง ๋ชปํ๋ค.
|
| 367 |
+
|
| 368 |
+
---
|
| 369 |
+
|
| 370 |
+
### Day 3 (Feb 27) โ "์ ์คํฐ์ค๋ฆฌ๊ทธ vs ์ด๋ฒค์ ์ค": 3B ์ ํ ๊ฒฐ์
|
| 371 |
+
|
| 372 |
+
๋ฐ๋ณต๋ฅ 18%๋ฅผ ๋๊ณ ํ ๋ด๋ถ ํ ๋ก ์ด ๋ฒ์ด์ก๋ค. ํต์ฌ ์ง๋ฌธ์ ํ๋์๋ค:
|
| 373 |
+
|
| 374 |
+
> **ORPO๋ก ๋ฐ๋ณต์ ์ก์ ์ ์๋๊ฐ, ์๋๋ฉด 3B๋ก ๊ฐ์ผ ํ๋๊ฐ?**
|
| 375 |
+
|
| 376 |
+
์ด ์ง๋ฌธ์ ๋ตํ๊ธฐ ์ํด **๋ฉํฐ์์ด์ ํธ ํ ๋ก **์ ์ํํ๋ค (์ฝ๋๋ช
: "์ ์คํฐ์ค๋ฆฌ๊ทธ vs ์ด๋ฒค์ ์ค"). ๊ฐ ์์ด์ ํธ๊ฐ ๋ค๋ฅธ ์
์ฅ์ ๋งก์ ๋
ผ์ฆํ๋ค.
|
| 377 |
+
|
| 378 |
+
ํ ๋ก ์ ํต์ฌ ๋ฐ๊ฒฌ:
|
| 379 |
+
|
| 380 |
+
1. **18% ๋ฐ๋ณต์ 1B ํ๋ผ๋ฏธํฐ์ ๊ตฌ์กฐ์ ํ๊ณ**๋ค. 1B ๋ชจ๋ธ์ ์ฅ๊ฑฐ๋ฆฌ ์์กด์ฑ(long-range dependency)์ ์ถฉ๋ถํ ํฌ์ฐฉํ์ง ๋ชปํ๋ค. ORPO ๊ฐ์ ์ ํธ๋ ์ ๋ ฌ์ ๋ฐ๋ณต์ ์ค์ด๋ ๋ฐ ์ผ๋ถ ๋์์ด ๋์ง๋ง, ๊ทผ๋ณธ ์์ธ(ํ๋ผ๋ฏธํฐ ๋ถ์กฑ)์ ํด๊ฒฐํ์ง๋ ๋ชปํ๋ค.
|
| 381 |
+
|
| 382 |
+
2. **์ค์ผ์ผ๋ง ๋ฒ์น ๋ถ์**: Chinchilla ๋ฒ์น๊ณผ ์คํ ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก 3B ๋ชจ๋ธ์ ๋์ผ ๋ฐ์ดํฐ์์ ๋ฐ๋ณต๋ฅ ์ 5~8%๊น์ง ๋ฎ์ถ ์ ์๋ค๋ ์ถ์ ์ด ๋์๋ค.
|
| 383 |
+
|
| 384 |
+
3. **๋น์ฉ-ํธ์ต ๋ถ์**: ORPO๋ฅผ 1B์ ํฌ์ํ๋ ๊ฒ๋ณด๋ค 3B ํ๋ฆฌํธ๋ ์ธ์ ํฌ์ํ๋ ๊ฒ์ด ์ต์ข
๋ชจ๋ธ ํ์ง ์ธก๋ฉด์์ ์ฐ์ํ๋ค.
|
| 385 |
+
|
| 386 |
+
**๊ฒฐ๋ก : 3B ์ ํ**. 1B๋ ์์นด์ด๋ธํ๊ณ 3B ํ๋ฆฌํธ๋ ์ธ์ ์์ํ๋ค.
|
| 387 |
+
|
| 388 |
+
์ด ๊ฒฐ์ ์ `eval/debate/justice_league_3b_case.md`์ ์ ์ฒด ๋
ผ์ฆ๊ณผ ํจ๊ป ๊ธฐ๋ก๋ผ ์๋ค.
|
| 389 |
+
|
| 390 |
+
---
|
| 391 |
+
|
| 392 |
+
### Day 3 (Feb 27) โ 640GB+ ๋ฐ์ดํฐ ์กฐ๋ฆฝ
|
| 393 |
+
|
| 394 |
+
3B ์ ํ์ด ๊ฒฐ์ ๋์๋ง์ ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ์ ๊ฐ๋ํ๋ค. 1B์ ๋นํด ํจ์ฌ ๋ง์ ๋ฐ์ดํฐ๊ฐ ํ์ํ๋ค (Chinchilla ์ต์ ๋น์จ: 3B ๋ชจ๋ธ ร 20 = 60B tokens).
|
| 395 |
+
|
| 396 |
+
์ต์ข
์ ์ผ๋ก ์กฐ๋ฆฝํ ๋ฐ์ดํฐ:
|
| 397 |
+
- **์ด ํ ํฐ**: 41.12B tokens (์ต์ข
์ด์ง ํ์ผ)
|
| 398 |
+
- **์์ ๋ฐ์ดํฐ**: 640GB+ ๋ค๊ตญ์ด ํ
์คํธ
|
| 399 |
+
- **์์ค**: C4 Korean, ๋๋ฌด์ํค, Wikipedia Korean, korean_extra ๋ฐ์ดํฐ์
|
| 400 |
+
|
| 401 |
+
๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ(ํ ํฌ๋์ด์ฆ, ์
ํ, ์ด์ง ๋ณํ)๊ฐ ์๋ฃ๋ `data/3b_train.bin`์ 82GB๋ค. ๊ฒ์ฆ์
`data/3b_val.bin`์ 151MB.
|
| 402 |
+
|
| 403 |
+
---
|
| 404 |
+
|
| 405 |
+
### Mar 2 โ Phase 0: OOM ๊ฒฉํด ๋ฐ ์ต์ ํ
|
| 406 |
+
|
| 407 |
+
3B ํ์ต์ ์ฒ์ ์์ํ์ OOM(Out of Memory)์ด ๋ฐ์ํ๋ค. 183GB VRAM์ธ๋ฐ 3B ๋ชจ๋ธ์ด OOM์ด ๋๋ค๋ ๊ฒ ์ด์ํ์ง๋ง, ์์ธ์ ์์๋ค.
|
| 408 |
+
|
| 409 |
+
**GQA FlashAttention ๊ตฌํ ๋ฌธ์ **์๋ค. GQA(Grouped-Query Attention)์์ KV ์บ์๋ฅผ expandํ๋ ๋ฐฉ์์ด ๋ฉ๋ชจ๋ฆฌ๋ฅผ ๋ถํ์ํ๊ฒ ๋ณต์ฌํ๊ณ ์์๋ค. FlashAttention์ native GQA support๋ฅผ ์ ๋๋ก ํ์ฉํ์ง ์์ ๊ฒ์ด๋ค.
|
| 410 |
+
|
| 411 |
+
Phase 0์์ ์ํํ ์ต์ ํ ๋ชฉ๋ก:
|
| 412 |
+
|
| 413 |
+
| ์ต์ ํ | ๋ฐฉ๋ฒ | ํจ๊ณผ |
|
| 414 |
+
|--------|------|------|
|
| 415 |
+
| GQA FA Native | `flash_attn_varlen_func` native GQA ๊ฒฝ๋ก ์ฌ์ฉ | VRAM 60.4GB โ 48.3GB (**-20%**) |
|
| 416 |
+
| DDP ์ต์ ํ | `gradient_as_bucket_view=True` | GPU-CPU ๋๊ธฐํ ์ค๋ฒํค๋ -87.5% |
|
| 417 |
+
| NCCL NVLS | Ring+Tree ํ ํด๋ก์ง, NVLS ํ์ฑํ | AllReduce ํจ์จ ๊ฐ์ |
|
| 418 |
+
| ๋ฐฐ์น ํฌ๊ธฐ ๋ถ์ | GPU 2,4,6์ NCCL relay node ์ญํ ํ์
| bs=5 ์ต์ , bs=6 ์ํ ํ์ |
|
| 419 |
+
| SIGHUP ๋ฐฉ์ด | nohup+setsid + Python signal handler + emergency ckpt | 3์ค ๋ณดํธ |
|
| 420 |
+
| ๋ชจ๋ํฐ๋ง | Telegram Bot (B200Bot) + cron | 10๋ถ ์์น๋
, 1์๊ฐ ์ํ ๋ฆฌํฌํธ |
|
| 421 |
+
|
| 422 |
+
**torch.compile ํ
์คํธ**: ํจ๊ณผ ์์(1.00x). ์์ธ์ TransformerEngine์ opaque kernel์ด graph break๋ฅผ ์ ๋ฐํ๊ณ , `/tmp` ๋๋ ํ ๋ฆฌ์ noexec ํ๋๊ทธ๊ฐ ๊ฑธ๋ ค ์์ด ์ปดํ์ผ๋ kernel ์บ์๊ฐ ์ฐ์ด์ง ์์๋ค. ์๊ฐ ๋ญ๋น๋ฅผ ํ ์
์ด์ง๋ง, "ํจ๊ณผ ์๋ค"๋ ๊ฒ์ ์ค์ธก์ผ๋ก ํ์ธํ ๊ฒ๋ ์ฑ๊ณผ๋ค.
|
| 423 |
+
|
| 424 |
+
**bs=5์ ์ด์ **: NCCL ring topology์์ GPU 2, 4, 6์ด relay node ์ญํ ์ ๋งก๋๋ค. ์ด GPU๋ค์ ๋ค๋ฅธ GPU๋ณด๋ค ์ฝ 11GB๋ฅผ ๋ ์ฌ์ฉํ๋ค. bs=5์์๋ ์ฌ์ ๊ฐ ์์ง๋ง, bs=6์ผ๋ก ์ฌ๋ฆฌ๋ฉด ์ด relay GPU๋ค์ด 183GB ๊ฒฝ๊ณ์ ๋๋ฌด ๊ฐ๊น์์ง๋ค. ์์ ๋ง์ง์ ์ํด bs=5๋ฅผ ์ ์งํ๋ค.
|
| 425 |
+
|
| 426 |
+
---
|
| 427 |
+
|
| 428 |
+
### Mar 2~Mar 5 โ Phase 1: 3B ํ๋ฆฌํธ๋ ์ธ ์๋ฃ
|
| 429 |
+
|
| 430 |
+
Phase 0 ์ต์ ํ๊ฐ ์๋ฃ๋ ํ Phase 1์ด ์์๋๋ค.
|
| 431 |
+
|
| 432 |
+
์ด๊ธฐ ์งํ (step 3150):
|
| 433 |
+
- Loss: 2.38
|
| 434 |
+
- ์ฒ๋ฆฌ ์๋: 36K tok/s per rank
|
| 435 |
+
- ์์คํ
์ ์ฒด: ~292K tok/s (8 GPU)
|
| 436 |
+
- MFU: ~33.5%
|
| 437 |
+
|
| 438 |
+
MFU 33.5%๋ ์ฒ์์๋ ๋ฎ์ ๋ณด์ผ ์ ์๋ค. ํ์ง๋ง TE MXFP8๊ฐ ์ด๋ฏธ ์ต์ ํ๋ ์ํ์์ ๋์จ ์์น๋ค. ์ด๋ก ์ ํผํฌ(18,000 TFLOPS) ๋๋น ์คํจ์จ์ด๋ค. ์ถ๊ฐ ์ต์ ํ ์ฌ์ง๋ก QKV fusion (+8~12%), NUMA affinity (+4~9%), FA2 native RoPE (+3~5%)๊ฐ ๋จ์์๋ค.
|
| 439 |
+
|
| 440 |
+
**Phase 1 ์๋ฃ (2026-03-05)**:
|
| 441 |
+
|
| 442 |
+
- **57,000 steps ์๋ฃ**, ์ต์ข
loss **1.466**
|
| 443 |
+
- 41.12B ํ ํฐ ์ฒ๋ฆฌ, ์ด ํ์ต ์๊ฐ ์ฝ 63์๊ฐ
|
| 444 |
+
- ๋ฌด์ฌ๊ณ ์๋ฃ (SIGHUP, OOM, NCCL ์ด์ ์์)
|
| 445 |
+
|
| 446 |
+
์ข
ํฉ ํ๊ฐ ๊ฒฐ๊ณผ ์์ฝ (v2 ์ฌํ๊ฐ ๋ฐ์):
|
| 447 |
+
|
| 448 |
+
| ํญ๋ชฉ | ๊ฒฐ๊ณผ |
|
| 449 |
+
|------|------|
|
| 450 |
+
| PPL (ํตํฉ ๊ฒ์ฆ์
) | 5.2263 (์ด๊ธฐ v1 ํ๊ฐ: 5.709) |
|
| 451 |
+
| PPL (C4 Korean) | 5.717 |
|
| 452 |
+
| KoBEST ํ๊ท (5ํ์คํฌ) | 43.69% |
|
| 453 |
+
| MMLU-KO ํ๊ท (6์นดํ
๊ณ ๋ฆฌ) | 22.75% |
|
| 454 |
+
| HAE-RAE | 19.71% |
|
| 455 |
+
| winogrande / piqa | 50.59% / 52.50% |
|
| 456 |
+
| Calibration Top-1 | 68.75% |
|
| 457 |
+
| Greedy 3-gram ๋ฐ๋ณต๋ฅ | 60.99% (SFT ํ ๊ฐ์ ์์ ) |
|
| 458 |
+
| ์ต์ ์์ฑ ํ๋ผ๋ฏธํฐ | temp=0.7, rep_penalty=1.3 โ ๋ฐ๋ณต๋ฅ 0% |
|
| 459 |
+
|
| 460 |
+
**SFT ์งํ ๊ฒฐ์ **: loss 1.466์ ๊ฑด๊ฐํ ํ์ต ์๋ฃ ์๊ทธ๋. PPL/๋ฐ๋ณต๋ฅ /๋ฒค์น๋งํฌ ๋ชจ๋ SFT๊ฐ ํด๊ฒฐํ ์์ญ. ๋ชจ๋ธ ๊ตฌ์กฐ ๋ฌธ์ ์งํ ์์. โ Phase 2 SFT ์งํ.
|
| 461 |
+
|
| 462 |
+
---
|
| 463 |
+
|
| 464 |
+
### Mar 5~ โ Phase 2: 3B SFT ์์ โ 2.44M ์ํ, val_loss 1.956
|
| 465 |
+
|
| 466 |
+
Phase 1 ์๋ฃ ์งํ, ๋๊ท๋ชจ SFT ๋ฐ์ดํฐ๋ฅผ ์ค๋นํ๊ณ ํ์ต์ ์์ํ๋ค.
|
| 467 |
+
|
| 468 |
+
**๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ**:
|
| 469 |
+
- **24๊ฐ ์์ค**์์ 6.59M raw samples ์์ง
|
| 470 |
+
- `prepare_sft_combined.sh`: ํฌ๋งท ํต์ผ(6๊ฐ์ง ํฌ๋งท โ messages), MD5 ์ค๋ณต ์ ๊ฑฐ, 98:2 split
|
| 471 |
+
- `filter_sft_v2.py`: 5๋จ๊ณ ํ์ง ํํฐ (EOS strip, QA marker ์ ๊ฑฐ, ๊ธธ์ด ํํฐ, 4-gram ๋ฐ๋ณต ํํฐ)
|
| 472 |
+
- ์ต์ข
: **2,439,397 train + 49,801 val** (7.48 GB)
|
| 473 |
+
|
| 474 |
+
๋ฐ์ดํฐ ๊ตฌ์ฑ์ ์ถ๋ก /CoT(38%), ํ๊ตญ์ด ์ง์(22.5%), ์์ด ๋ค๋ชฉ์ (16%), ์ํ(12%), ๋ํ/์ฝ๋(11.5%)๋ก ๊ท ํ์ ๋ง์ท๋ค. 1B SFT์ 161K์์ **15๋ฐฐ ํ๋**ํ ๊ท๋ชจ๋ค.
|
| 475 |
+
|
| 476 |
+
**SFT ์ค๊ณ โ 1B ์คํจ์์ ๋ฐฐ์ด ๊ตํ ๋ฐ์**:
|
| 477 |
+
|
| 478 |
+
| 1B ๊ตํ | 3B SFT ์ ์ฉ |
|
| 479 |
+
|---------|-------------|
|
| 480 |
+
| Label off-by-one โ loss=0 | Loss masking ๊ฒ์ฆ (prompt=-1, response๋ง ํ์ต) |
|
| 481 |
+
| EOS ์ ๋จ โ ์ข
๋ฃ ๋ถ๊ฐ | Chat template `<\|user\|>...<\|assistant\|>...</s>` EOS ํฌํจ |
|
| 482 |
+
| Static padding โ GPU ๋ญ๋น | Dynamic padding (64-token ์ ๋ ฌ) |
|
| 483 |
+
| ๊ฒ์ฆ ์์ โ ์ค๋ฒํผํ
๋ฏธ๊ฐ์ง | 49,801 val samples, 500 step ๊ฐ๊ฒฉ eval |
|
| 484 |
+
| ๋ฐ์ดํฐ ๋
ธ์ด์ฆ | 5๋จ๊ณ ํ์ง ํํฐ (1B์๋ ์์์) |
|
| 485 |
+
| ๋ฐ๋ณต๋ฅ 18% | **NEFTune alpha=5.0** ์ถ๊ฐ (์๋ฒ ๋ฉ ๋
ธ์ด์ฆ ์ฃผ์
) |
|
| 486 |
+
|
| 487 |
+
**ํ์ต ์ค์ **:
|
| 488 |
+
- LR: **1e-5** (pretrain์ 1/15 โ catastrophic forgetting ๋ฐฉ์ง)
|
| 489 |
+
- Effective batch: 2 ร 8 GPU ร 4 accum = 64 sequences
|
| 490 |
+
- 33,000 steps (~3.3 epochs)
|
| 491 |
+
- MXFP8, gradient checkpointing, NCCL Ring+Tree
|
| 492 |
+
|
| 493 |
+
**์ด๊ธฐ ๊ฒฐ๊ณผ** (step 2,000, 6%):
|
| 494 |
+
- Val loss: 2.073 โ 2.004 โ 1.975 โ **1.956** (๋จ์กฐ ๊ฐ์)
|
| 495 |
+
- Train-Val ๊ฐญ ~0.1 (์ค๋ฒํผํ
์งํ ์์)
|
| 496 |
+
- VRAM 24.2 GB (13.2%) โ pretrain์ ์ ๋ฐ, ๋งค์ฐ ์์
|
| 497 |
+
- Grad norm 1.0 ์ผ์ (ํ์ต๋ฅ ์ ์ )
|
| 498 |
+
|
| 499 |
+
์์ธ ๋ณด๊ณ ์: `reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md`
|
| 500 |
+
|
| 501 |
+
---
|
| 502 |
+
|
| 503 |
+
### Mar 6 โ Phase 2 ์๋ฃ: SFT Early Stopping (val_loss 1.8851)
|
| 504 |
+
|
| 505 |
+
SFT๋ 33,000 steps ์ค **25,500 steps**์์ early stopping์ผ๋ก ์ข
๋ฃ๋์๋ค. Val loss๋ step 23,000์์ 1.8851์ ๋๋ฌํ ๋ค, 5ํ ์ฐ์ ๊ฐ์ ์์ด ํ์ต์ด ์๋ ์ค๋จ๋์๋ค.
|
| 506 |
+
|
| 507 |
+
**์ด ํ์ต ์๊ฐ**: ~15์๊ฐ 41๋ถ (2026-03-05 22:15 ~ 2026-03-06 13:56)
|
| 508 |
+
|
| 509 |
+
์ด ๊ฒฐ๊ณผ๋ LR 1e-5์ cosine decay๊ฐ step 20K ์ดํ ์ฌ์ค์ 0์ ์๋ ดํ ๊ฒ๊ณผ ์ผ์นํ๋ค. ๋ชจ๋ธ์ ์ฃผ์ด์ง LR schedule ํ์์ ํ์ต ๊ฐ๋ฅํ ๋งํผ ์์ ํ ํ์ตํ๋ค.
|
| 510 |
+
|
| 511 |
+
---
|
| 512 |
+
|
| 513 |
+
### Mar 6 โ SFT 6์ฐจ์ ์ข
ํฉ ํ๊ฐ: 4/6 PASS โ ORPO ๊ฒฐ์
|
| 514 |
+
|
| 515 |
+
SFT ์ฒดํฌํฌ์ธํธ(`checkpoint-best`, step 23000)์ ๋ํด 6์ฐจ์ ์ข
ํฉ ํ๊ฐ๋ฅผ ์ํํ๋ค. 49๋ถ 27์ด ์์.
|
| 516 |
+
|
| 517 |
+
**ํต์ฌ ๊ฒฐ๊ณผ**:
|
| 518 |
+
- **Perplexity**: forgetting 0.9% (19๊ฐ ๋ฐ์ดํฐ์
์ ์ฒด PASS) โ ์ง์ ๋ณด์กด ์ฐ์
|
| 519 |
+
- **๋ฐ๋ณต๋ฅ **: greedy 72.97% (Base 60.99%๋ณด๋ค **์
ํ**) โ FAIL
|
| 520 |
+
- **EOS ์ข
๋ฃ์จ**: 0% โ 60% โ ๊ฐ์ ๋์ง๋ง ๋ชฉํ(90%) ๋ฏธ๋ฌ
|
| 521 |
+
- **KoBEST**: 43.26% (Base 43.69%์ ๊ฑฐ์ ๋์ผ) โ FAIL
|
| 522 |
+
- **MMLU-KO**: 22.75% โ 26.00% (+3.2pp) โ ๋ถ๋ถ ๊ฐ์
|
| 523 |
+
- **Calibration**: Top-1 68.59% โ PASS
|
| 524 |
+
|
| 525 |
+
**๊ฒฐ์ **: greedy ๋ฐ๋ณต๋ฅ 72.97%๋ SFT๋ง์ผ๋ก ํด๊ฒฐ ๋ถ๊ฐ. ๊ทธ๋ฌ๋ `rep_penalty=1.2` ์ ์ฉ ์ ๋ฐ๋ณต๋ฅ 0%๊ฐ ๋ฌ์ฑ๋๋ฏ๋ก, ORPO(์ ํธ๋ ์ ๋ ฌ)๋ก ์ด ํ๋์ ๋ด์ฌํํ๋ ๊ฒ์ด ์ฌ๋ฐ๋ฅธ ๊ฒฝ๋ก๋ค.
|
| 526 |
+
|
| 527 |
+
---
|
| 528 |
+
|
| 529 |
+
### Mar 6 โ ์ฝ๋ ๊ฐ์ ๋ฐ ORPO ์ค๋น
|
| 530 |
+
|
| 531 |
+
SFT ํ๊ฐ์ ๋ณํํ์ฌ ๋ค์์ ์ฝ๋ ๊ฐ์ ๋ฐ Phase 3 ์ค๋น๋ฅผ ์๋ฃํ๋ค:
|
| 532 |
+
|
| 533 |
+
| ๋ณ๊ฒฝ | ๋ด์ฉ | ์ํฅ |
|
| 534 |
+
|------|------|------|
|
| 535 |
+
| `train/sft.py` +238์ค | MixingDataLoader (SFT+pretrain ์ธํฐ๋ฆฌ๋น), DDP rank 0 ํ ํฌ๋์ด์ง | forgetting ๋ฐฉ์ง, ๋ฉ๋ชจ๋ฆฌ 8๋ฐฐ ์ ๊ฐ |
|
| 536 |
+
| `train/trainer.py` +17์ค | DDP early stopping broadcast (hang ๋ฐฉ์ง), patience 5โ10 | DDP ์์ ์ฑ |
|
| 537 |
+
| `train/orpo.py` +30์ค | YAML config ์ง์, 3B ๊ธฐ๋ณธ๊ฐ | ORPO ์คํ ์ค๋น |
|
| 538 |
+
| `eval/report_generator.py` +831์ค | Base vs SFT ๋น๊ต ๋ณด๊ณ ์ ์๋ ์์ฑ | ํ๊ฐ ์๋ํ |
|
| 539 |
+
| `eval/sft_eval_pipeline.py` ์ ๊ท | SFT 6์ฐจ์ ํ๊ฐ ํ์ดํ๋ผ์ธ | ์ข
ํฉ ํ๊ฐ |
|
| 540 |
+
| `eval/tasks/generation_task.py` +75์ค | Chat template, ๋ค์์ฑ ๋ฉํธ๋ฆญ | SFT ํ๊ฐ |
|
| 541 |
+
| `configs/korean_3b_sft_v2.yaml` ์ ๊ท | SFT v2 ์ค์ (lr=5e-5, data mixing 70/30) | ๋ฐฑ์
๊ฒฝ๋ก |
|
| 542 |
+
| `configs/korean_3b_orpo.yaml` ์ ๊ท | ORPO ์ค์ (lr=5e-6, beta=0.1) | Phase 3 |
|
| 543 |
+
|
| 544 |
+
์์ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
|
| 545 |
+
|
| 546 |
+
---
|
| 547 |
+
|
| 548 |
+
## 6. ๋ชจ๋ธ ์ํคํ
์ฒ
|
| 549 |
+
|
| 550 |
+
### 1B (์์นด์ด๋ธ)
|
| 551 |
+
|
| 552 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 553 |
+
|------|-----|
|
| 554 |
+
| vocab_size | 64,000 |
|
| 555 |
+
| d_model | 2,048 |
|
| 556 |
+
| n_layers | 24 |
|
| 557 |
+
| n_heads | 16 |
|
| 558 |
+
| n_kv_heads | 4 (GQA 4:1) |
|
| 559 |
+
| d_ffn | 5,461 (SwiGLU) |
|
| 560 |
+
| ํ๋ผ๋ฏธํฐ ์ | ~1.19B |
|
| 561 |
+
| context | 2,048 |
|
| 562 |
+
| rope_theta | 500,000 |
|
| 563 |
+
|
| 564 |
+
### 3B (ํ์ฌ)
|
| 565 |
+
|
| 566 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 567 |
+
|------|-----|
|
| 568 |
+
| vocab_size | 64,000 |
|
| 569 |
+
| d_model | 3,072 |
|
| 570 |
+
| n_layers | 28 |
|
| 571 |
+
| n_heads | 24 |
|
| 572 |
+
| n_kv_heads | 8 (GQA 3:1) |
|
| 573 |
+
| d_ffn | 8,192 (SwiGLU) |
|
| 574 |
+
| ํ๋ผ๋ฏธํฐ ์ | ~3.0B |
|
| 575 |
+
| context | 2,048 |
|
| 576 |
+
| rope_theta | 500,000 |
|
| 577 |
+
|
| 578 |
+
### ๊ณตํต ์ค๊ณ ์์น
|
| 579 |
+
|
| 580 |
+
| ์ปดํฌ๋ํธ | ์ ํ | ์ด์ |
|
| 581 |
+
|----------|------|------|
|
| 582 |
+
| ์ ๊ทํ | Pre-norm RMSNorm | Post-norm๋ณด๋ค ํ์ต ์์ ์ |
|
| 583 |
+
| ํ์ฑํ | SwiGLU FFN | Llama ๊ณ์ด์์ ๊ฒ์ฆ๋ ์ ํ |
|
| 584 |
+
| ์์น ์ธ์ฝ๋ฉ | RoPE (ฮธ=500K) | ๊ธด ์ปจํ
์คํธ ํ์ฅ ๊ฐ๋ฅ์ฑ |
|
| 585 |
+
| ์ดํ
์
| GQA (Grouped-Query Attention) | KV ์บ์ ๋ฉ๋ชจ๋ฆฌ ์ ๊ฐ |
|
| 586 |
+
| ๊ตฌํ | FlashAttention-2 | IO-aware, VRAM ํจ์จ |
|
| 587 |
+
| ์ ๋ฐ๋ | FP8 (MXFP8 via TransformerEngine) | B200 ์ต์ ํ์ฉ |
|
| 588 |
+
|
| 589 |
+
### GQA ๋น์จ ์ ํ ๊ทผ๊ฑฐ
|
| 590 |
+
|
| 591 |
+
1B๋ GQA 4:1 (head 16๊ฐ, kv_head 4๊ฐ), 3B๋ GQA 3:1 (head 24๊ฐ, kv_head 8๊ฐ)์ ์ ํํ๋ค. 3B์์ ๋น์จ์ ๋ค์ ์ํํ ์ด์ ๋, ํ๋ผ๋ฏธํฐ ์๊ฐ ๋์ด๋๋ฉด์ ์ดํ
์
ํ์ง์ ๋ค์ ํฌ์ํ๋ ๊ฒ์ด 3B ๊ท๋ชจ์์๋ ์ํด๋ผ๋ ํ๋จ์ด์๋ค. Mistral 7B (GQA 8:1)์ Llama 3 (GQA 8:1)๋ฅผ ์ฐธ๊ณ ํ๋ค.
|
| 592 |
+
|
| 593 |
+
### rope_theta=500,000์ ์๋ฏธ
|
| 594 |
+
|
| 595 |
+
ํ์ค RoPE์ ฮธ=10,000์์ 500,000์ผ๋ก ๋๋ฆฐ ๊ฒ์ ๊ธด ์ปจํ
์คํธ์์ ์ฃผํ์ ๊ฐ์ญ์ ์ค์ด๊ธฐ ์ํด์๋ค. Code Llama, Llama 3 ๋ฑ์ด ์ฑํํ ๋ฐฉ์์ด๋ค. ํ์ฌ max_seq_len=2048์ด๋ฏ๋ก ๋น์ฅ ํจ๊ณผ๋ฅผ ๋ณด๊ธฐ๋ ์ด๋ ต์ง๋ง, ํฅํ ์ปจํ
์คํธ ํ์ฅ ํ์ธํ๋์ ์ํ ๊ธฐ๋ฐ์ด๋ค.
|
| 596 |
+
|
| 597 |
+
---
|
| 598 |
+
|
| 599 |
+
## 7. ํ์ต ๋ฐ์ดํฐ
|
| 600 |
+
|
| 601 |
+
### 7.1 ํ ํฌ๋์ด์
|
| 602 |
+
|
| 603 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 604 |
+
|------|-----|
|
| 605 |
+
| ์ข
๋ฅ | SentencePiece Unigram |
|
| 606 |
+
| ์ดํ ํฌ๊ธฐ | 64,000 |
|
| 607 |
+
| ํ๊ตญ์ด ๋ฌธ์ ์ปค๋ฒ๋ฆฌ์ง | 99.95% |
|
| 608 |
+
| ์์น | `tokenizer/korean_sp/` |
|
| 609 |
+
| HF ํฌ๋งท | `tokenizer/tokenizer.json` (2.4MB) |
|
| 610 |
+
|
| 611 |
+
64K ์ดํ๋ 32K(๋๋ฌด ์์, ํ๊ตญ์ด ์๋ธ์๋ ๋จํธํ ์ฌํจ)์ 128K(๋๋ฌด ํผ, ์๋ฒ ๋ฉ ๋ ์ด์ด ์ค๋ฒํค๋ ์ฆ๊ฐ) ์ฌ์ด์ ๊ท ํ์ด๋ค. Llama 3(128K)์ GPT-4(100K)๊ฐ ํฐ ์ดํ๋ฅผ ์ฌ์ฉํ๋ ์ถ์ธ์ง๏ฟฝ๏ฟฝ๏ฟฝ, 3B ๋ชจ๋ธ์์ 128K ์ดํ๋ ์๋ฒ ๋ฉ ๋ ์ด์ด๋ง์ผ๋ก๋ ํ๋ผ๋ฏธํฐ ๋น์ค์ด ์ง๋์น๊ฒ ์ปค์ง๋ค.
|
| 612 |
+
|
| 613 |
+
### 7.2 ํ๋ฆฌํธ๋ ์ธ ๋ฐ์ดํฐ โ ์ ์ฒด ๊ตฌ์ฑ
|
| 614 |
+
|
| 615 |
+
์ต์ข
ํ์ต ํ์ผ: `data/3b_train.bin` (77GB, ~38.5B tokens) + `data/3b_val.bin` (145MB)
|
| 616 |
+
|
| 617 |
+
Chinchilla ๋ฒ์น ๊ธฐ์ค: 3B ร 20 = **60B ํ ํฐ**์ด ์ต์ ์ด๋ค. ํ์ฌ 38.5B ํ ํฐ์ 57,000 ์คํ
(batch 5 ร accum 8 ร seq 2048 ร 8 GPU)์ผ๋ก ๋ฐ๋ณต ์๋นํ๋ฉฐ, ์ฒ์ 3B ํ์ต์ผ๋ก์ ํฉ๋ฆฌ์ ์ธ ๋ฒ์๋ค.
|
| 618 |
+
|
| 619 |
+
#### ํ๊ตญ์ด โ ์นํฌ๋กค (Web Crawl)
|
| 620 |
+
|
| 621 |
+
| ๋ฐ์ดํฐ์
| HuggingFace ID | ํ ํฐํ ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ | ์ค๋ช
|
|
| 622 |
+
|----------|---------------|------------|------|----------|------|
|
| 623 |
+
| C4 Korean | `allenai/c4` (ko subset) | `korean_c4_train.bin` | 15GB | ~7.5B | Google C4 ํ๊ตญ์ด ํํฐ๋ง, ๋๊ท๋ชจ ํด๋ฆฐ ์น ํ
์คํธ |
|
| 624 |
+
| CC-100 Korean | `cc100` (ko subset) | `cc100_ko_train.bin` | 4.3GB | ~2.15B | Common Crawl ๊ธฐ๋ฐ ๋จ์ผ์ธ์ด ์ฝํผ์ค |
|
| 625 |
+
| HPLT Korean | `HPLT/hplt_monolingual_v2` (ko) | `hplt_ko_train.bin` | 15GB | ~7.5B | High Performance Language Technologies ์น ๋ฐ์ดํฐ |
|
| 626 |
+
|
| 627 |
+
#### ํ๊ตญ์ด โ ๋ฐฑ๊ณผ์ฌ์ (Encyclopedia)
|
| 628 |
+
|
| 629 |
+
| ๋ฐ์ดํฐ์
| HuggingFace ID | ํ ํฐํ ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ | ์ค๋ช
|
|
| 630 |
+
|----------|---------------|------------|------|----------|------|
|
| 631 |
+
| ์ํค๋ฐฑ๊ณผ ํ๊ตญ์ด | `wikimedia/wikipedia` (20231101.ko) | `wikipedia_ko_train.bin` | 566MB | ~283M | ํ๊ตญ์ด ์ํค๋ฐฑ๊ณผ ์ ์ฒด, ๊ตฌ์กฐํ๋ ๋ฌธ์ด์ฒด |
|
| 632 |
+
| ์ํค๋ฐฑ๊ณผ ํ๊ตญ์ด (v2) | `wikimedia/wikipedia` (ko) | `korean_wiki_train.bin` | 500MB | ~250M | ์ํค๋ฐฑ๊ณผ ๋ณ๋ ๋ฒ์ |
|
| 633 |
+
| ๋๋ฌด์ํค | `heegyu/namuwiki-extracted` | `korean_namuwiki_train.bin` | 2.1GB | ~1.05B | ๋๋ฌด์ํค ์ถ์ถ๋ณธ, ์๋ธ์ปฌ์ฒยท์์ฌ ํ๋ถ |
|
| 634 |
+
| ๋๋ฌด์ํค 2023b | `heegyu/namuwiki-extracted` (2023b) | `namuwiki_2023b_train.bin` | 2.5GB | ~1.25B | 2023๋
์
๋ฐ์ดํธ ์ค๋
์ท |
|
| 635 |
+
|
| 636 |
+
#### ์์ด/๋ค๊ตญ์ด โ ๊ต์ก (Educational)
|
| 637 |
+
|
| 638 |
+
| ๋ฐ์ดํฐ์
| HuggingFace ID | ํ ํฐํ ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ | ์ค๋ช
|
|
| 639 |
+
|----------|---------------|------------|------|----------|------|
|
| 640 |
+
| Cosmopedia Stories | `HuggingFaceTB/cosmopedia` | `cosmo_stories_train.bin` | 5.9GB | ~2.95B | ํฉ์ฑ ๊ต์ก์ฉ ์คํ ๋ฆฌ |
|
| 641 |
+
| Cosmopedia Web v2 | `HuggingFaceTB/cosmopedia` | `cosmo_web_v2_train.bin` | 2.7GB | ~1.35B | ์น ๊ธฐ๋ฐ ๊ต์ก ํ
์คํธ |
|
| 642 |
+
| Cosmopedia Stanford | `HuggingFaceTB/cosmopedia` | `cosmo_stanford_train.bin` | 2.1GB | ~1.05B | Stanford ๊ฐ์ ๊ธฐ๋ฐ |
|
| 643 |
+
| Cosmopedia WikiHow | `HuggingFaceTB/cosmopedia` | `cosmo_wikihow_train.bin` | 382MB | ~191M | WikiHow ๊ฐ์ด๋ |
|
| 644 |
+
| Cosmopedia OpenStax | `HuggingFaceTB/cosmopedia` | `cosmo_openstax_train.bin` | 224MB | ~112M | ์คํ ๊ต๊ณผ์ |
|
| 645 |
+
| Cosmopedia Khan Academy | `HuggingFaceTB/cosmopedia` | `cosmo_khanacademy_train.bin` | 46MB | ~23M | ์นธ ์์นด๋ฐ๋ฏธ |
|
| 646 |
+
|
| 647 |
+
#### ์์ด/๋ค๊ตญ์ด โ ์ํยท๊ณผํ (Math & Science)
|
| 648 |
+
|
| 649 |
+
| ๋ฐ์ดํฐ์
| HuggingFace ID | ํ ํฐํ ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ | ์ค๋ช
|
|
| 650 |
+
|----------|---------------|------------|------|----------|------|
|
| 651 |
+
| Open Web Math | `open-web-math/open-web-math` | `open_web_math_train.bin` | 4.8GB | ~2.4B | ์น์์ ์ถ์ถํ ์ํ ํ
์คํธ |
|
| 652 |
+
| MathPile | `GAIR/MathPile` | `mathpile_train.bin` | 2.9GB | ~1.45B | ์ํ ๊ต๊ณผ์ยท๋
ผ๋ฌธยทํฌ๋ผ |
|
| 653 |
+
| Cosmopedia AutoMath | `HuggingFaceTB/cosmopedia` | `cosmo_auto_math_text_train.bin` | 2.5GB | ~1.25B | ํฉ์ฑ ์ํ ๋ฌธ์ ยทํ์ด |
|
| 654 |
+
|
| 655 |
+
#### ํ๊ตญ์ด โ ํผํฉ (Legacy Merged)
|
| 656 |
+
|
| 657 |
+
| ๋ฐ์ดํฐ์
| ํ ํฐํ ํ์ผ | ํฌ๊ธฐ | ์ถ์ ํ ํฐ | ์ค๋ช
|
|
| 658 |
+
|----------|------------|------|----------|------|
|
| 659 |
+
| ์ด๊ธฐ ํผํฉ (C4+๋๋ฌด+์ํค) | `korean_train.bin` | 17GB | ~8.5B | 1B ํ์ต์ ์ฌ์ฉ๋ ์๋ณธ ํผํฉ ๋ฐ์ดํฐ |
|
| 660 |
+
| 125M ๊ฒ์ฆ์ฉ | `train.bin` | 1.2GB | ~600M | ์ต์ด FP8 ๊ฒ์ฆ์ ์ฌ์ฉ |
|
| 661 |
+
|
| 662 |
+
#### ๋ฏธ์ฌ์ฉ ์์ง ๋ฐ์ดํฐ (korean_extra/ โ 640GB+)
|
| 663 |
+
|
| 664 |
+
`data/korean_extra/` ์ 39๊ฐ ์๋ธ๋๋ ํ ๋ฆฌ๋ก ์์ง๋์์ผ๋, ํ ํฐํยท๋ณํฉ์ ์ผ๋ถ๋ง ์๋ฃ๋ ๋๊ท๋ชจ ์์ ๋ฐ์ดํฐ:
|
| 665 |
+
|
| 666 |
+
| ๋ถ๋ฅ | ๋ฐ์ดํฐ์
| ์ค๋ช
| ๋น๊ณ |
|
| 667 |
+
|------|----------|------|------|
|
| 668 |
+
| ์นํฌ๋กค | CulturaX Korean | ๋๊ท๋ชจ ๋ค๊ตญ์ด ์น ์ฝํผ์ค ํ๊ตญ์ด | ~50B+ tokens |
|
| 669 |
+
| ์นํฌ๋กค | FineWeb2 Educational Korean | ๊ต์ก์ ํ์ง ํํฐ๋ง ์น ๋ฐ์ดํฐ | 234GB raw |
|
| 670 |
+
| ์นํฌ๋กค | Korean Web Collection | KORMo ์น ์ปฌ๋ ์
| 175GB raw |
|
| 671 |
+
| ์นํฌ๋กค | OSCAR Korean | ๋ค๊ตญ์ด ์น ์ฝํผ์ค ํ๊ตญ์ด | |
|
| 672 |
+
| ๊ต์ก | Korean Textbooks | ํ๊ตญ์ด ๊ต๊ณผ์ ํ
์คํธ | 45๊ฐ ์๋ธ์นดํ
๊ณ ๋ฆฌ |
|
| 673 |
+
| ๊ต์ก | FinePDFs Educational Korean | PDF ๊ธฐ๋ฐ ๊ต์ก ์๋ฃ | |
|
| 674 |
+
| ๋ฒ๋ฅ | Korean Law | ํ๊ตญ ๋ฒ๋ฅ ํ
์คํธ | 15GB |
|
| 675 |
+
| ๋ด์ค | Korean News Archive | ํ๊ตญ์ด ๋ด์ค ์์นด์ด๋ธ | |
|
| 676 |
+
| ๊ณต๊ฐ์ฝํผ์ค | Korean Public Corpus | KORMo ๊ณต๊ฐ ์ฝํผ์ค | 26GB |
|
| 677 |
+
| ์ฝ๋ | Code Pretrain | ํ๋ก๊ทธ๋๋ฐ ์ฝ๋ | |
|
| 678 |
+
| ํ์ | Academic Pretrain | ํ์ ๋
ผ๋ฌธยท๋ฆฌํฌํธ | |
|
| 679 |
+
| ๋ฒ์ฉ | SlimPajama | RedPajama ๊ฒฝ๋ ๋ฒ์ | |
|
| 680 |
+
|
| 681 |
+
> ์ด ๋ฐ์ดํฐ๋ Extended Pretrain (80-100B tokens) ๋จ๊ณ์์ ํ์ฉ ์์ ์ด๋ค.
|
| 682 |
+
|
| 683 |
+
#### ํ๋ฆฌํธ๋ ์ธ ๋ฐ์ดํฐ ๋ถ์ผ๋ณ ๋น์จ
|
| 684 |
+
|
| 685 |
+
```
|
| 686 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 687 |
+
โ 3b_train.bin ํ ํฐ ๊ตฌ์ฑ (~38.5B) โ
|
| 688 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
|
| 689 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ํ๊ตญ์ด ์นํฌ๋กค 44.7% โ
|
| 690 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ํผํฉ ๋ ๊ฑฐ์ 22.1% โ
|
| 691 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๊ต์ก (EN) 14.7% โ
|
| 692 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ์ํยท๊ณผํ 13.2% โ
|
| 693 |
+
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ๋ฐฑ๊ณผ์ฌ์ (KO) 5.3% โ
|
| 694 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 695 |
+
```
|
| 696 |
+
|
| 697 |
+
### 7.3 SFT ๋ฐ์ดํฐ โ 2.44M ์ํ (ํ์ฌ ํ์ต ์ค)
|
| 698 |
+
|
| 699 |
+
**24๊ฐ ์์ค**์์ 6.59M raw โ ํตํฉยท์ค๋ณต ์ ๊ฑฐ โ ํ์ง ํํฐ๋ง โ **2,439,397 train + 49,801 val**
|
| 700 |
+
|
| 701 |
+
#### ์ฃผ์ SFT ์์ค (์์ 12, ์ ์ฒด์ 96%)
|
| 702 |
+
|
| 703 |
+
| # | ๋ฐ์ดํฐ์
| ์ํ ์ | ํฌ๊ธฐ | ๋๋ฉ์ธ |
|
| 704 |
+
|---|---------|---------|------|--------|
|
| 705 |
+
| 1 | reasoning_r1_1.4m | 1,400,000 | 14.77 GB | ์ถ๋ก (CoT) |
|
| 706 |
+
| 2 | openhermes_2.5 | 1,001,551 | 1.82 GB | ์์ด ๋ค๋ชฉ์ |
|
| 707 |
+
| 3 | AI-MO_NuminaMath-CoT | 859,494 | 2.51 GB | ์ํ CoT |
|
| 708 |
+
| 4 | korean_instruction_mix | 515,911 | 1.39 GB | ํ๊ตญ์ด ํผํฉ |
|
| 709 |
+
| 5 | lemon-mint_smol-koreantalk | 460,281 | 5.23 GB | ํ๊ตญ์ด ๋ํ |
|
| 710 |
+
| 6 | open_korean_instructions | 375,159 | 0.73 GB | ํ๊ตญ์ด ์ง์ |
|
| 711 |
+
| 7 | magpie_reasoning_v2 | 249,922 | 3.99 GB | ์ถ๋ก (์์ด) |
|
| 712 |
+
| 8 | magpie_reasoning_ko | 224,929 | 3.19 GB | ์ถ๋ก (ํ๊ตญ์ด) |
|
| 713 |
+
| 9 | ultrachat_200k | 207,865 | 1.34 GB | ๋ํ |
|
| 714 |
+
| 10 | kuotient_orca-math-ko | 193,789 | 0.61 GB | ์ํ (ํ๊ตญ์ด) |
|
| 715 |
+
| 11 | data/sft/train.jsonl (์๋ณธ) | 161,848 | 0.27 GB | ์๋ณธ SFT |
|
| 716 |
+
| 12 | kullm_v2 | 152,630 | 0.42 GB | ํ๊ตญ์ด ์ง์ |
|
| 717 |
+
|
| 718 |
+
๊ธฐํ 12๊ฐ ์์ค: DeepMath-103K, Evol-Instruct-Code-80k-ko, ShareGPT-74k-ko, evol-instruct-korean, alpaca-gpt4-korean, ko_wikidata_QA, Ko.WizardLM, KOR-OpenOrca-Platypus-v3, korean-writing-style-instruct, ko_lima, koalpaca_v1_1a, OpenAssistant_oasst1_ko
|
| 719 |
+
|
| 720 |
+
#### ๋ฐ์ดํฐ ์ฒ๋ฆฌ ํ์ดํ๋ผ์ธ
|
| 721 |
+
|
| 722 |
+
```
|
| 723 |
+
24๊ฐ ์์ค (6.59M raw)
|
| 724 |
+
โ prepare_sft_combined.sh (ํฌ๋งท ํต์ผ, MD5 ์ค๋ณต ์ ๊ฑฐ, 98:2 split)
|
| 725 |
+
ํตํฉ: 2,559,492 train + 52,234 val (7.95 GB)
|
| 726 |
+
โ filter_sft_v2.py (5๋จ๊ณ: EOS strip, QA marker ์ ๊ฑฐ, ๊ธธ์ด 50~20K, 4-gram ๋ฐ๋ณต >30% ์ ๊ฑฐ)
|
| 727 |
+
์ต์ข
: 2,439,397 train + 49,801 val (7.63 GB) โ ์ ๊ฑฐ์จ 4.69%
|
| 728 |
+
```
|
| 729 |
+
|
| 730 |
+
#### ๋๋ฉ์ธ ๋น์จ
|
| 731 |
+
|
| 732 |
+
```
|
| 733 |
+
์ถ๋ก /CoT 38.0% โโโโโโโโโโโโโโโโโโโโโโโโ
|
| 734 |
+
ํ๊ตญ์ด ์ง์ 22.5% โโโโโโโโโโโโโโ
|
| 735 |
+
์์ด ๋ค๋ชฉ์ 16.0% โโโโโโโโโโ
|
| 736 |
+
์ํ 12.0% โโโโโโโโ
|
| 737 |
+
๋ํ/์ฝ๋/๊ธฐํ 11.5% โโโโโโโ
|
| 738 |
+
```
|
| 739 |
+
|
| 740 |
+
### 7.4 ์ ํธ๋ ๋ฐ์ดํฐ (ORPO์ฉ) โ 795K ์
|
| 741 |
+
|
| 742 |
+
์ด **795,468 preference pairs** (7.9GB, `data/preference/combined_preference.jsonl`)
|
| 743 |
+
|
| 744 |
+
| HuggingFace ID | ํฌ๊ธฐ | ๋ถ์ผ | ํฌ๋งท |
|
| 745 |
+
|---------------|------|------|------|
|
| 746 |
+
| `nayohan/preference-collection-ko-full` | 4.9GB | ๋ฒ์ฉ ์ ํธ๋ ํ๊ฐ | instruction + response_A/B + preference |
|
| 747 |
+
| `heegyu/orca-math-korean-preference-cleaned` | 1.6GB | ์ํ ์ถ๋ก | prompt + chosen + rejected |
|
| 748 |
+
| `kuotient/orca-math-korean-dpo-pairs` | 750MB | ์ํ DPO | prompt + chosen + rejected |
|
| 749 |
+
| `maywell/ko_Ultrafeedback_binarized` | 394MB | ํผ๋๋ฐฑ ๊ธฐ๋ฐ ์ ๋ ฌ | prompt + winning/losing response |
|
| 750 |
+
| `tellang/yeji-preference-ko-v1` | 171MB | ๋ฒ์ฉ ์ ํธ๋ | prompt + chosen + rejected |
|
| 751 |
+
| `jojo0217/korean_rlhf_dataset` | 137MB | RLHF ์ | prompt + chosen + rejected |
|
| 752 |
+
| `lemon-mint/korean-realqa-reasoning-v01-preference` | 58MB | QA ์ถ๋ก | prompt + chosen + rejected |
|
| 753 |
+
|
| 754 |
+
**ํํฐ๋ง ๊ธฐ์ค**: ์ต์ ๊ธธ์ด 20์, EOS ์ ๊ฑฐ, ํฌ๋งท ์ ๊ทํ ํ ํตํฉ
|
| 755 |
+
|
| 756 |
+
> ORPO๋ Phase 3์์ ๋ฐ๋ณต๋ฅ ์ด 5% ์ด๊ณผํ ๊ฒฝ์ฐ์๋ง ์คํํ๋ค. 3B ๋ชจ๋ธ์ด 1B์ ๊ตฌ์กฐ์ ๋ฐ๋ณต ๋ฌธ์ ๋ฅผ ์ค์ค๋ก ํด๊ฒฐํ๋ค๋ฉด ORPO ์์ด ๋ฐฐํฌํ ์ ์๋ค.
|
| 757 |
+
|
| 758 |
+
### 7.5 ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ ์์ฝ
|
| 759 |
+
|
| 760 |
+
```
|
| 761 |
+
[HuggingFace / ์น ์์ง]
|
| 762 |
+
โ
|
| 763 |
+
โผ
|
| 764 |
+
โโโโ ์์ ์์ง โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 765 |
+
โ korean_extra/ (39๊ฐ ๋๋ ํ ๋ฆฌ, 640GB+) โ
|
| 766 |
+
โ sft_extra/ (27๊ฐ ๋๋ ํ ๋ฆฌ, 1.08M ์ํ) โ
|
| 767 |
+
โ preference/ (7๊ฐ JSONL, 795K ์) โ
|
| 768 |
+
๏ฟฝ๏ฟฝโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 769 |
+
โ
|
| 770 |
+
โผ
|
| 771 |
+
โโโโ ํ ํฐํ (SentencePiece 64K) โโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 772 |
+
โ tokenize_extra.py โ ์๋ ํฌ๋งท ๊ฐ์ง (Arrow/Parquet/JSONL) โ
|
| 773 |
+
โ 8 workers ๋ณ๋ ฌ ์ฒ๋ฆฌ, uint16 memmap (.bin) ์ถ๋ ฅ โ
|
| 774 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 775 |
+
โ
|
| 776 |
+
โผ
|
| 777 |
+
โโโโ ์ต์ข
๋ณํฉ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 778 |
+
โ Pretrain: 3b_train.bin (77GB, ~38.5B tokens) โ
|
| 779 |
+
โ SFT: sft_combined/train_filtered.jsonl (7.48GB, 2.44M ์ํ) โ
|
| 780 |
+
โ ORPO: preference/combined_preference.jsonl (7.9GB) โ
|
| 781 |
+
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
|
| 782 |
+
```
|
| 783 |
+
|
| 784 |
+
---
|
| 785 |
+
|
| 786 |
+
## 8. ํ์ต ์ค์ ๋ฐ ์ต์ ํ
|
| 787 |
+
|
| 788 |
+
### ํ์ฌ ํ์ต ์ค์ (`configs/korean_3b_fp8.yaml`)
|
| 789 |
+
|
| 790 |
+
```yaml
|
| 791 |
+
model:
|
| 792 |
+
vocab_size: 64000
|
| 793 |
+
d_model: 3072
|
| 794 |
+
n_layers: 28
|
| 795 |
+
n_heads: 24
|
| 796 |
+
n_kv_heads: 8
|
| 797 |
+
d_ffn: 8192
|
| 798 |
+
max_seq_len: 2048
|
| 799 |
+
rope_theta: 500000.0
|
| 800 |
+
|
| 801 |
+
training:
|
| 802 |
+
batch_size: 5
|
| 803 |
+
gradient_accumulation_steps: 8
|
| 804 |
+
learning_rate: 1.5e-4
|
| 805 |
+
min_lr: 1.5e-5
|
| 806 |
+
warmup_steps: 2000
|
| 807 |
+
max_steps: 57000
|
| 808 |
+
weight_decay: 0.1
|
| 809 |
+
grad_clip: 1.0
|
| 810 |
+
optimizer: adamw
|
| 811 |
+
scheduler: cosine
|
| 812 |
+
|
| 813 |
+
fp8:
|
| 814 |
+
enabled: true
|
| 815 |
+
recipe: "mxfp8"
|
| 816 |
+
use_transformer_engine: true
|
| 817 |
+
|
| 818 |
+
distributed:
|
| 819 |
+
strategy: ddp
|
| 820 |
+
gradient_as_bucket_view: true
|
| 821 |
+
find_unused_parameters: false
|
| 822 |
+
|
| 823 |
+
nccl:
|
| 824 |
+
timeout_seconds: 7200
|
| 825 |
+
nvls_enabled: true
|
| 826 |
+
```
|
| 827 |
+
|
| 828 |
+
์ ํจ ๋ฐฐ์น ํฌ๊ธฐ = `batch_size(5) ร grad_accum(8) ร num_gpus(8)` = **320**
|
| 829 |
+
|
| 830 |
+
LR ์ค์ผ์ค: warmup 2000 ์คํ
โ cosine decay โ min_lr=1.5e-5 (max_lr์ 10%)
|
| 831 |
+
|
| 832 |
+
### Phase 0์์ ๋ฐฐ์ด ์ต์ ํ ๊ตํ
|
| 833 |
+
|
| 834 |
+
#### GQA FlashAttention Native
|
| 835 |
+
|
| 836 |
+
๊ฐ์ฅ ํฐ VRAM ์ ๊ฐ์ ๊ฐ์ ธ์จ ์ต์ ํ. ํต์ฌ์ FlashAttention์ด GQA๋ฅผ native๋ก ์ง์ํ๋ค๋ ์ ์ด๋ค. KV head๋ฅผ expandํ์ฌ MHA์ฒ๋ผ ์ฒ๋ฆฌํ๋ฉด ๋ฉ๋ชจ๋ฆฌ ๋ณต์ฌ๊ฐ ๋ฐ์ํ์ง๋ง, native path๋ฅผ ์ฐ๋ฉด ๋ด๋ถ์์ ์ง์ ์ฒ๋ฆฌํ๋ค.
|
| 837 |
+
|
| 838 |
+
```python
|
| 839 |
+
# Before (๋นํจ์จ์ ): KV expand โ MHA์ฒ๋ผ ์ฒ๋ฆฌ
|
| 840 |
+
k = k.repeat_interleave(n_heads // n_kv_heads, dim=1)
|
| 841 |
+
v = v.repeat_interleave(n_heads // n_kv_heads, dim=1)
|
| 842 |
+
out = flash_attn_func(q, k, v)
|
| 843 |
+
|
| 844 |
+
# After (native GQA): flash_attn์ด ๋ด๋ถ์์ GQA ์ฒ๋ฆฌ
|
| 845 |
+
out = flash_attn_func(q, k, v) # q: [B, S, H, D], k/v: [B, S, Hkv, D]
|
| 846 |
+
# VRAM 60.4GB โ 48.3GB (-20%)
|
| 847 |
+
```
|
| 848 |
+
|
| 849 |
+
#### DDP ์ต์ ํ
|
| 850 |
+
|
| 851 |
+
```python
|
| 852 |
+
# gradient_as_bucket_view=True: gradient tensor๋ฅผ bucket ๋ฉ๋ชจ๋ฆฌ์ view๋ก ์ง์ ๋งคํ
|
| 853 |
+
# โ ๋ถํ์ํ ๋ฉ๋ชจ๋ฆฌ ๋ณต์ฌ ์ ๊ฑฐ, GPU-CPU ๋๊ธฐํ ์ค๋ฒํค๋ -87.5%
|
| 854 |
+
model = torch.nn.parallel.DistributedDataParallel(
|
| 855 |
+
model,
|
| 856 |
+
device_ids=[local_rank],
|
| 857 |
+
gradient_as_bucket_view=True,
|
| 858 |
+
find_unused_parameters=False, # ๋ชจ๋ ํ๋ผ๋ฏธํฐ๊ฐ ์ฌ์ฉ๋จ
|
| 859 |
+
)
|
| 860 |
+
```
|
| 861 |
+
|
| 862 |
+
**์ฃผ์**: `static_graph=True`๋ ์ฌ์ฉํ์ง ์๋๋ค. TransformerEngine์ `te.Linear`๊ฐ ์ผ๋ถ ์ผ์ด์ค์์ dynamic graph๋ฅผ ์๊ตฌํ๋๋ฐ, static_graph๋ฅผ ์ผ๋ฉด ๋ฐํ์ ์๋ฌ๊ฐ ๋ฐ์ํ๋ค.
|
| 863 |
+
|
| 864 |
+
#### NCCL NVLS
|
| 865 |
+
|
| 866 |
+
```bash
|
| 867 |
+
export NCCL_ALGO=NVLSTree # NVLink SHARP (NVLS) ํ์ฑํ
|
| 868 |
+
export NCCL_PROTO=Simple
|
| 869 |
+
export NCCL_P2P_DISABLE=0
|
| 870 |
+
export NCCL_TIMEOUT=7200 # ๊ธด backward์ ๋๋นํ ํ์์์ ์ฌ์
|
| 871 |
+
```
|
| 872 |
+
|
| 873 |
+
NVSwitch๊ฐ All-to-All single hop์ ์ง์ํ๋ฏ๋ก Ring topology๋ณด๋ค NVLSTree๊ฐ ํจ์จ์ ์ด๋ค.
|
| 874 |
+
|
| 875 |
+
#### SIGHUP 3์ค ๋ฐฉ์ด
|
| 876 |
+
|
| 877 |
+
์ฅ์๊ฐ ํ์ต์์ ์ธ์
์ฐ๊ฒฐ ๋๊น(SIGHUP)์ ์น๋ช
์ ์ด๋ค. 3์ค ๋ณดํธ๋ฅผ ๊ตฌ์ถํ๋ค:
|
| 878 |
+
|
| 879 |
+
```bash
|
| 880 |
+
# 1์ค: nohup + setsid (์ ์ธ์
๊ทธ๋ฃน)
|
| 881 |
+
nohup setsid torchrun --nproc_per_node=8 train/pretrain.py ... &
|
| 882 |
+
|
| 883 |
+
# 2์ค: Python signal handler (Python ๋ ๋ฒจ SIGHUP ๋ฌด์)
|
| 884 |
+
import signal
|
| 885 |
+
signal.signal(signal.SIGHUP, signal.SIG_IGN)
|
| 886 |
+
|
| 887 |
+
# 3์ค: emergency checkpoint (SIGTERM์๋ ์ฒดํฌํฌ์ธํธ ์ ์ฅ)
|
| 888 |
+
def emergency_save(signum, frame):
|
| 889 |
+
save_checkpoint(model, optimizer, step, "emergency")
|
| 890 |
+
sys.exit(0)
|
| 891 |
+
signal.signal(signal.SIGTERM, emergency_save)
|
| 892 |
+
```
|
| 893 |
+
|
| 894 |
+
#### torch.compile โ ํ
์คํธ ๊ฒฐ๊ณผ: ํจ๊ณผ ์์
|
| 895 |
+
|
| 896 |
+
`torch.compile`์ ์ ์ฉํด speedup์ ๊ธฐ๋ํ์ง๋ง ์ค์ธก ๊ฒฐ๊ณผ **1.00x (ํจ๊ณผ ์์)**์ด์๋ค. ๋ ๊ฐ์ง ์ด์ :
|
| 897 |
+
|
| 898 |
+
1. TransformerEngine์ kernel์ด opaqueํ์ฌ graph break๊ฐ ๋ฐ์ํ๋ค. `torch.compile`์ Python ์ฐ์ฐ ๊ทธ๋ํ๋ฅผ ์ต์ ํํ๋๋ฐ, TE kernel์ ๊ทธ ๊ทธ๋ํ ๋ฐ์ ์๋ค.
|
| 899 |
+
2. `/tmp` ๋๋ ํ ๋ฆฌ์ `noexec` ๋ง์ดํธ ํ๋๊ทธ๊ฐ ์์ด ์ปดํ์ผ๋ kernel์ ์บ์ํ์ง ๋ชปํ๋ค.
|
| 900 |
+
|
| 901 |
+
**๊ตํ**: "์ผ๋จ ์จ๋ณด์"๋ณด๋ค "์ ํจ๊ณผ๊ฐ ์๋์ง ๋จผ์ ์ดํดํ์"๊ฐ ์ค์ํ๋ค.
|
| 902 |
+
|
| 903 |
+
### ๋ชจ๋ํฐ๋ง ์์คํ
|
| 904 |
+
|
| 905 |
+
```
|
| 906 |
+
ํ
๋ ๊ทธ๋จ ์๋ฆผ ์์คํ
|
| 907 |
+
โโโ B200Bot (token ๏ฟฝ๏ฟฝ๏ฟฝ์ ๋จ)
|
| 908 |
+
โโโ training_watchdog.sh โ 10๋ถ ๊ฐ๊ฒฉ cron
|
| 909 |
+
โ โโโ loss ์ด์, ํ๋ก์ธ์ค ์ข
๋ฃ ๊ฐ์ง โ ์ฆ์ ์๋ฆผ
|
| 910 |
+
โโโ hourly_status.sh โ 1์๊ฐ ๊ฐ๊ฒฉ cron
|
| 911 |
+
โโโ step, loss, ์๋, VRAM, eta โ ์ ๊ธฐ ๋ฆฌํฌํธ
|
| 912 |
+
```
|
| 913 |
+
|
| 914 |
+
```python
|
| 915 |
+
# curl์ด ์ฐจ๋จ๋ผ ์์ด urllib ์ฌ์ฉ
|
| 916 |
+
import urllib.request, json
|
| 917 |
+
|
| 918 |
+
def send_telegram(message):
|
| 919 |
+
url = f"https://api.telegram.org/bot{TOKEN}/sendMessage"
|
| 920 |
+
data = json.dumps({"chat_id": CHAT_ID, "text": message}).encode()
|
| 921 |
+
req = urllib.request.Request(url, data=data,
|
| 922 |
+
headers={"Content-Type": "application/json"})
|
| 923 |
+
urllib.request.urlopen(req)
|
| 924 |
+
```
|
| 925 |
+
|
| 926 |
+
---
|
| 927 |
+
|
| 928 |
+
## 9. ์คํ ๊ฒฐ๊ณผ โ 1B ๋ฒ ์ด์ค๋ผ์ธ
|
| 929 |
+
|
| 930 |
+
1B ๋ชจ๋ธ์ ์คํ ๊ฒฐ๊ณผ๋ฅผ ์ ์งํ๊ฒ ๊ธฐ๋กํ๋ค. ์ฑ๊ณต๊ณผ ์คํจ ๋ชจ๋.
|
| 931 |
+
|
| 932 |
+
### ํ๋ฆฌํธ๋ ์ธ ๊ฒฐ๊ณผ
|
| 933 |
+
|
| 934 |
+
| ์งํ | ๊ฐ |
|
| 935 |
+
|------|-----|
|
| 936 |
+
| ์ต์ข
Loss | 1.904 |
|
| 937 |
+
| PPL (C4 Korean) | 5.67 |
|
| 938 |
+
| ํ์ต ์คํ
| 34,000 |
|
| 939 |
+
| ํ์ต ์๊ฐ | ~2์ผ |
|
| 940 |
+
|
| 941 |
+
### SFT v1 ๊ฒฐ๊ณผ โ ์คํจ
|
| 942 |
+
|
| 943 |
+
| ์งํ | ๊ฐ |
|
| 944 |
+
|------|-----|
|
| 945 |
+
| val_loss | 0.0 (๋น์ ์) |
|
| 946 |
+
| ์์ธ | label off-by-one ๋ฒ๊ทธ (๋ฐ์ดํฐ ๋์) |
|
| 947 |
+
| ๊ฒฐ๋ก | ์ ๋ฉด ํ๊ธฐ |
|
| 948 |
+
|
| 949 |
+
### SFT v2 ๊ฒฐ๊ณผ โ ๋ถ๋ถ ์ฑ๊ณต
|
| 950 |
+
|
| 951 |
+
| ์งํ | ๊ฐ |
|
| 952 |
+
|------|-----|
|
| 953 |
+
| val_loss | 2.2062 |
|
| 954 |
+
| ๋ฐ๋ณต๋ฅ | 18% (rep_penalty=1.1 ์ ์ฉ) |
|
| 955 |
+
| kobest_copa | 0.646 |
|
| 956 |
+
| ๊ฒฐ๋ก | ๊ธฐ๋ฅํ์ง๋ง ๊ตฌ์กฐ์ ํ๊ณ ์กด์ฌ |
|
| 957 |
+
|
| 958 |
+
### 3B ๊ธฐ๋ ๋ชฉํ์น (์ค์ผ์ผ๋ง ๋ฒ์น ๊ธฐ๋ฐ ์์ธก)
|
| 959 |
+
|
| 960 |
+
| ๋ฒค์น๋งํฌ | 1B ํ์ฌ | 3B ๋ชฉํ |
|
| 961 |
+
|----------|---------|---------|
|
| 962 |
+
| kobest_copa | 0.646 | >0.72 |
|
| 963 |
+
| kobest_hellaswag | ~0.42 | >0.52 |
|
| 964 |
+
| ๋ฐ๋ณต๋ฅ | 18% | <5% |
|
| 965 |
+
| PPL (C4 Korean) | 5.67 | <4.5 |
|
| 966 |
+
|
| 967 |
+
1B์์ 3B๋ก์ ์ค์ผ์ผ์
์ ๋จ์ํ ํ๋ผ๋ฏธํฐ๋ฅผ ๋๋ฆฌ๋ ๊ฒ์ด ์๋๋ค. ๋ชจ๋ธ์ด ๋ ๊ธด ๋งฅ๋ฝ์ ๊ธฐ์ตํ๊ณ , ๋ ๋ค์ํ ํจํด์ ํ์ตํ ์ ์์ด์ผ ๋ฐ๋ณต๋ฅ ์ด ๊ตฌ์กฐ์ ์ผ๋ก ๋ฎ์์ง๋ค. 3B ๋ชฉํ์น๋ Chinchilla ์ค์ผ์ผ๋ง ๊ณก์ ๊ณผ ์ ์ฌ ๊ท๋ชจ ๋ชจ๋ธ๋ค์ ๋ฒค์น๋งํฌ๋ฅผ ์ฐธ๊ณ ํ ์์ธก๊ฐ์ด๋ค.
|
| 968 |
+
|
| 969 |
+
---
|
| 970 |
+
|
| 971 |
+
## 10. ์คํ ๊ฒฐ๊ณผ โ 3B Base ์ข
ํฉ ํ๊ฐ (v2)
|
| 972 |
+
|
| 973 |
+
3B ์ฌ์ ํ์ต ์๋ฃ ํ checkpoint-0057000 ๊ธฐ์ค์ผ๋ก ์ํํ ์ข
ํฉ ํ๊ฐ.
|
| 974 |
+
v2 ์ฌํ๊ฐ๋ 8-GPU ๋ณ๋ ฌ ํ์ดํ๋ผ์ธ์ผ๋ก 13+ ๋ฒค์น๋งํฌ, 0/5-shot ๋น๊ต, calibration, ์ฐธ๊ณ ๋ชจ๋ธ ๋น๊ต๋ฅผ ํฌํจํ๋ค.
|
| 975 |
+
์ด ์์ ์๊ฐ 256.6์ด.
|
| 976 |
+
|
| 977 |
+
> **v1 โ v2 ๋ณ๊ฒฝ์ **: v1(์ด๊ธฐ ํ๊ฐ)์์๋ PPL 3๊ฐ ๋ฐ์ดํฐ์
+ belebele/MMLU 2๊ฐ ๋ฒค์น๋งํฌ๋ง ์ธก์ ํ๋ค. v2๋ PPL 19๊ฐ ๋ฐ์ดํฐ์
, KoBEST 5๊ฐ, HAE-RAE ์ ์ฒด, MMLU-KO 6์นดํ
๊ณ ๋ฆฌ, MMLU-EN 61๊ณผ๋ชฉ, ์์ด 5๋ ๋ฒค์น๋งํฌ, Calibration, 0/5-shot ๋น๊ต, 12์กฐํฉ ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น๋ฅผ ํฌํจํ๋ค.
|
| 978 |
+
|
| 979 |
+
### 10.1 ํ์ต ์ปค๋ธ
|
| 980 |
+
|
| 981 |
+
| Step | Loss | LR | ๋น๊ณ |
|
| 982 |
+
|------|------|----|------|
|
| 983 |
+
| 10 | 11.657 | 1.50e-06 | ์ด๊ธฐ (warmup ์์) |
|
| 984 |
+
| 500 | 5.047 | 7.50e-05 | warmup ์งํ |
|
| 985 |
+
| 2,000 | 2.851 | 3.00e-04 | warmup ์๋ฃ, peak LR |
|
| 986 |
+
| 10,000 | 2.057 | 2.86e-04 | ์์ ํ๊ฐ |
|
| 987 |
+
| 30,000 | 1.789 | 1.61e-04 | ์ค๋ฐ, epoch 1 ์ง์
|
|
| 988 |
+
| 57,000 | 1.466 | 3.00e-05 | ์ต์ข
(cosine min) |
|
| 989 |
+
|
| 990 |
+
> ์ฒ๋ฆฌ ์๋๋ ์ ๊ตฌ๊ฐ 36~38K tok/s๋ก ์์ . ์ด ํ์ต ์๊ฐ ์ฝ 63์๊ฐ.
|
| 991 |
+
|
| 992 |
+
### Base Model ๋ฐฑ์
|
| 993 |
+
|
| 994 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 995 |
+
|------|-----|
|
| 996 |
+
| ์๋ณธ ์ฒดํฌํฌ์ธํธ | `checkpoints/korean_3b_fp8_run1/checkpoint-0057000/` (34GB) |
|
| 997 |
+
| ๋ฐฑ์
| `checkpoints/korean_3b_fp8_run1/checkpoint-0057000_BASE_BACKUP/` |
|
| 998 |
+
| MD5 ๊ฒ์ฆ | `4f493d7bcc843727d32453bb3a4e6b7d` (์ผ์น ํ์ธ) |
|
| 999 |
+
| HF ๋ณํ | `eval/outputs/hf_3b_base/` (11GB safetensors) |
|
| 1000 |
+
|
| 1001 |
+
### 10.2 PPL (Perplexity) โ 19๊ฐ ๋ฐ์ดํฐ์
|
| 1002 |
+
|
| 1003 |
+
**์ฃผ์ PPL (3b_val ํตํฉ): 5.2263** (์ด๊ธฐ v1 ํ๊ฐ: 5.709)
|
| 1004 |
+
|
| 1005 |
+
| ๋ฐ์ดํฐ์
| PPL | Bits/Token | ํ๊ฐ ํ ํฐ | ์์ ์๊ฐ |
|
| 1006 |
+
|---------|-----|-----------|---------|---------|
|
| 1007 |
+
| korean_namuwiki | 25.88 | 4.694 | 6.5M | 63.7s |
|
| 1008 |
+
| cc100_ko | 21.78 | 4.445 | 13.6M | 133.2s |
|
| 1009 |
+
| namuwiki_2023b | 18.92 | 4.242 | 7.7M | 75.1s |
|
| 1010 |
+
| val | 18.30 | 4.194 | 9.1M | 89.4s |
|
| 1011 |
+
| korean_wiki | 11.84 | 3.565 | 1.6M | 15.5s |
|
| 1012 |
+
| wikipedia_ko | 10.71 | 3.420 | 1.8M | 17.4s |
|
| 1013 |
+
| korean | 7.02 | 2.811 | 53.5M | 521.6s |
|
| 1014 |
+
| open_web_math | 6.93 | 2.792 | 15.7M | 153.5s |
|
| 1015 |
+
| **korean_c4** | **5.72** | **2.515** | **45.4M** | **443.1s** |
|
| 1016 |
+
| **3b (ํตํฉ)** | **5.23** | **2.386** | **226.9M** | **2227.3s** |
|
| 1017 |
+
| cosmo_web_v2 | 4.17 | 2.059 | 8.6M | 84.6s |
|
| 1018 |
+
| cosmo_stories | 3.96 | 1.984 | 18.9M | 185.2s |
|
| 1019 |
+
| cosmo_openstax | 3.87 | 1.951 | 0.7M | 7.2s |
|
| 1020 |
+
| cosmo_stanford | 3.36 | 1.750 | 6.6M | 65.3s |
|
| 1021 |
+
| cosmo_wikihow | 3.31 | 1.727 | 1.2M | 11.8s |
|
| 1022 |
+
| cosmo_auto_math_text | 3.15 | 1.655 | 7.9M | 77.3s |
|
| 1023 |
+
| cosmo_khanacademy | 2.93 | 1.552 | 0.1M | 1.5s |
|
| 1024 |
+
| mathpile | 2.72 | 1.446 | 7.1M | 69.9s |
|
| 1025 |
+
| hplt_ko | 2.40 | 1.265 | 48.5M | 475.9s |
|
| 1026 |
+
|
| 1027 |
+
> **ํด์**: in-distribution(ํ์ต์ ํฌํจ๋) ๋ฐ์ดํฐ(hplt_ko: 2.40, mathpile: 2.72)๊ฐ ๋ฎ๊ณ , OOD(ํ์ต ๋น์ค ๋ฎ์) ๋ฐ์ดํฐ(cc100_ko: 21.78, namuwiki: 25.88)๊ฐ ๋์ ๊ฒ์ ์์๋ ํจํด. korean_c4 5.72๋ v1์ 5.717๊ณผ ์ผ์นํ์ฌ ํ๊ฐ ์ฌํ์ฑ์ ํ์ธ.
|
| 1028 |
+
|
| 1029 |
+
### 10.3 ํ๊ตญ์ด ๋ฒค์น๋งํฌ
|
| 1030 |
+
|
| 1031 |
+
#### KoBEST (0-shot) โ ํ๊ท 43.69%
|
| 1032 |
+
|
| 1033 |
+
| ํ์คํฌ | Accuracy | F1 |
|
| 1034 |
+
|--------|----------|-----|
|
| 1035 |
+
| kobest_boolq | 50.28% | 0.3457 |
|
| 1036 |
+
| kobest_copa | 49.30% | 0.4921 |
|
| 1037 |
+
| kobest_hellaswag | 21.60% | 0.2153 |
|
| 1038 |
+
| kobest_sentineg | 48.61% | 0.4737 |
|
| 1039 |
+
| kobest_wic | 48.65% | 0.3286 |
|
| 1040 |
+
| **ํ๊ท ** | **43.69%** | |
|
| 1041 |
+
|
| 1042 |
+
#### HAE-RAE (0-shot) โ ์ ์ฒด 19.71%
|
| 1043 |
+
|
| 1044 |
+
| ์๋ธํ์คํฌ | Accuracy |
|
| 1045 |
+
|-----------|----------|
|
| 1046 |
+
| haerae_general_knowledge | 21.59% |
|
| 1047 |
+
| haerae_history | 23.40% |
|
| 1048 |
+
| haerae_loan_word | 21.30% |
|
| 1049 |
+
| haerae_rare_word | 18.77% |
|
| 1050 |
+
| haerae_standard_nomenclature | 13.73% |
|
| 1051 |
+
| **์ ์ฒด** | **19.71%** |
|
| 1052 |
+
|
| 1053 |
+
#### MMLU-KO (0-shot) โ 6์นดํ
๊ณ ๋ฆฌ ํ๊ท 22.75%
|
| 1054 |
+
|
| 1055 |
+
| ์นดํ
๊ณ ๋ฆฌ | Accuracy |
|
| 1056 |
+
|----------|----------|
|
| 1057 |
+
| medical | 30.56% |
|
| 1058 |
+
| humanities | 24.51% |
|
| 1059 |
+
| business | 24.14% |
|
| 1060 |
+
| social_sciences | 20.59% |
|
| 1061 |
+
| other | 19.64% |
|
| 1062 |
+
| stem | 19.57% |
|
| 1063 |
+
| **ํ๊ท ** | **22.75%** |
|
| 1064 |
+
|
| 1065 |
+
> Base model์ instruction-following ์์ด 4์ง์ ๋ค ํ์ ๋ฒค์น๋งํฌ๋ฅผ ํ๋๋ก ์ต์ ํ๋์ง ์์. KoBEST boolq/copa/sentineg/wic๋ ~50% ์์ค์ผ๋ก 2์ง/4์ง์ ๋ค ๋๋ค ๊ธฐ์ค ๋ถ๊ทผ์ด๋ฉฐ, SFT ํ ํฅ์ ๊ธฐ๋.
|
| 1066 |
+
|
| 1067 |
+
### 10.4 ์์ด ๋ฒค์น๋งํฌ
|
| 1068 |
+
|
| 1069 |
+
#### ์ฃผ์ ๋ฒค์น๋งํฌ (0-shot)
|
| 1070 |
+
|
| 1071 |
+
| ํ์คํฌ | Accuracy | Acc (norm) |
|
| 1072 |
+
|--------|----------|-----------|
|
| 1073 |
+
| hellaswag | 26.00% | 26.15% |
|
| 1074 |
+
| arc_easy | 25.63% | 26.64% |
|
| 1075 |
+
| arc_challenge | 21.67% | 27.90% |
|
| 1076 |
+
| winogrande | 50.59% | โ |
|
| 1077 |
+
| piqa | 52.50% | 48.31% |
|
| 1078 |
+
|
| 1079 |
+
> winogrande(50.59%)์ piqa(52.50%)๋ 2์ง์ ๋ค๋ก ๋๋ค ๊ธฐ์ค 50%์ ๊ทผ์ . hellaswag/arc๋ 4์ง์ ๋ค๋ก ๋๋ค ๊ธฐ์ค 25%.
|
| 1080 |
+
|
| 1081 |
+
#### MMLU-EN (0-shot) โ 61๊ณผ๋ชฉ ํ๊ท 25.81%
|
| 1082 |
+
|
| 1083 |
+
**์์ 10๊ฐ ๊ณผ๋ชฉ**:
|
| 1084 |
+
|
| 1085 |
+
| ๊ณผ๋ชฉ | Accuracy |
|
| 1086 |
+
|------|----------|
|
| 1087 |
+
| college_physics | 37.25% |
|
| 1088 |
+
| college_computer_science | 34.00% |
|
| 1089 |
+
| high_school_statistics | 33.80% |
|
| 1090 |
+
| us_foreign_policy | 32.00% |
|
| 1091 |
+
| security_studies | 31.43% |
|
| 1092 |
+
| world_religions | 30.99% |
|
| 1093 |
+
| professional_medicine | 30.88% |
|
| 1094 |
+
| high_school_government_and_politics | 30.57% |
|
| 1095 |
+
| jurisprudence | 30.56% |
|
| 1096 |
+
| human_sexuality | 30.53% |
|
| 1097 |
+
|
| 1098 |
+
**ํ์ 5๊ฐ ๊ณผ๋ชฉ**:
|
| 1099 |
+
|
| 1100 |
+
| ๊ณผ๋ชฉ | Accuracy |
|
| 1101 |
+
|------|----------|
|
| 1102 |
+
| human_aging | 19.73% |
|
| 1103 |
+
| college_biology | 19.44% |
|
| 1104 |
+
| anatomy | 17.04% |
|
| 1105 |
+
| global_facts | 17.00% |
|
| 1106 |
+
| abstract_algebra | 15.00% |
|
| 1107 |
+
|
| 1108 |
+
### 10.5 Calibration
|
| 1109 |
+
|
| 1110 |
+
| ๋ฉํธ๋ฆญ | ๊ฐ |
|
| 1111 |
+
|--------|-----|
|
| 1112 |
+
| Top-1 Accuracy | 68.75% |
|
| 1113 |
+
| Top-5 Accuracy | 81.64% |
|
| 1114 |
+
| Top-10 Accuracy | 85.93% |
|
| 1115 |
+
| Mean Correct Prob | 0.6152 |
|
| 1116 |
+
| Mean Entropy | 1.5682 |
|
| 1117 |
+
|
| 1118 |
+
**Token NLL ๋ถํฌ**:
|
| 1119 |
+
|
| 1120 |
+
| ํต๊ณ | ๊ฐ |
|
| 1121 |
+
|------|-----|
|
| 1122 |
+
| ํ๊ท NLL | 1.5561 |
|
| 1123 |
+
| ํ์คํธ์ฐจ | 2.4926 |
|
| 1124 |
+
| ์ค์๊ฐ | 0.1221 |
|
| 1125 |
+
| p95 | 7.0312 |
|
| 1126 |
+
| p99 | 10.3125 |
|
| 1127 |
+
| NLL > 5 ๋น์จ | 10.86% |
|
| 1128 |
+
| NLL > 10 ๋น์จ | 1.18% |
|
| 1129 |
+
|
| 1130 |
+
> Top-1 68.75%๋ ๋ชจ๋ธ์ด ๊ฐ์ฅ ํ์ ํ๋ ์์ธก์ด ~69% ํ๋ฅ ๋ก ์ ํํ๋ค๋ ์๋ฏธ. ์ค์๊ฐ NLL 0.12 (โ e^0.12 = 1.13 PPL)๋ก ๋๋ถ๋ถ์ ํ ํฐ์ ๋งค์ฐ ๋์ ํ์ ๋๋ก ์์ธกํ๊ณ , ์์์ ๊ณ ๋์ด๋ ํ ํฐ์ด ํ๊ท NLL์ ๋์ด์ฌ๋ฆฌ๋ ์ ํ์ ์ธ ๋ถํฌ.
|
| 1131 |
+
|
| 1132 |
+
### 10.6 0-shot vs 5-shot ๋น๊ต
|
| 1133 |
+
|
| 1134 |
+
18๊ฐ ํ๊ตญ์ด ํ์คํฌ์์ 0-shot๊ณผ 5-shot ์ฑ๋ฅ์ ๋น๊ตํ๋ค.
|
| 1135 |
+
|
| 1136 |
+
| ํ์คํฌ | 0-shot | 5-shot | ๋ณํ |
|
| 1137 |
+
|--------|--------|--------|------|
|
| 1138 |
+
| global_mmlu_ko | 22.75% | 26.75% | **+4.00pp** |
|
| 1139 |
+
| global_mmlu_ko_business | 24.14% | 31.03% | **+6.90pp** |
|
| 1140 |
+
| global_mmlu_ko_humanities | 24.51% | 28.43% | +3.92pp |
|
| 1141 |
+
| global_mmlu_ko_medical | 30.56% | 36.11% | **+5.56pp** |
|
| 1142 |
+
| global_mmlu_ko_other | 19.64% | 23.21% | +3.57pp |
|
| 1143 |
+
| global_mmlu_ko_social_sciences | 20.59% | 23.53% | +2.94pp |
|
| 1144 |
+
| global_mmlu_ko_stem | 19.57% | 21.74% | +2.17pp |
|
| 1145 |
+
| haerae | 19.71% | 20.26% | +0.55pp |
|
| 1146 |
+
| haerae_general_knowledge | 21.59% | 22.73% | +1.14pp |
|
| 1147 |
+
| haerae_history | 23.40% | 14.89% | -8.51pp |
|
| 1148 |
+
| haerae_loan_word | 21.30% | 24.26% | +2.96pp |
|
| 1149 |
+
| haerae_rare_word | 18.77% | 18.02% | -0.74pp |
|
| 1150 |
+
| haerae_standard_nomenclature | 13.73% | 25.49% | **+11.76pp** |
|
| 1151 |
+
| kobest_boolq | 50.28% | 50.21% | -0.07pp |
|
| 1152 |
+
| kobest_copa | 49.30% | 46.80% | -2.50pp |
|
| 1153 |
+
| kobest_hellaswag | 21.60% | 20.80% | -0.80pp |
|
| 1154 |
+
| kobest_sentineg | 48.61% | 47.86% | -0.76pp |
|
| 1155 |
+
| kobest_wic | 48.65% | 48.97% | +0.32pp |
|
| 1156 |
+
|
| 1157 |
+
**ํ๊ท ๋ณํ: +1.80pp** | ๊ฐ์ : 12 | ํ๋ฝ: 6
|
| 1158 |
+
|
| 1159 |
+
> MMLU-KO๋ 5-shot์์ ์ผ๊ด๋๊ฒ ๊ฐ์ (+2~7pp)๋์ด in-context learning ๋ฅ๋ ฅ์ด ์๋ํจ์ ํ์ธ. KoBEST๋ ๊ฑฐ์ ๋ณ๋ ์๊ฑฐ๋ ์ํญ ํ๋ฝโ์ด๋ฏธ 0-shot์์ ํจํด ๋งค์นญ์ ์ํ๊ณ ์์ด few-shot ์์๊ฐ ์คํ๋ ค ๋ฐฉํด๊ฐ ๋๋ ํจํด. haerae_standard_nomenclature์ +11.76pp๋ ์ด ํ์คํฌ์ ํน์ํ ํฌ๋งท์ few-shot์์ ํ์ตํ ๊ฒฐ๊ณผ.
|
| 1160 |
+
|
| 1161 |
+
### 10.7 ์ฐธ๊ณ ๋ชจ๋ธ ๋น๊ต
|
| 1162 |
+
|
| 1163 |
+
| ๋ชจ๋ธ | ํ๋ผ๋ฏธํฐ | MMLU-KO | MMLU-EN | KoBEST ํ๊ท | PPL |
|
| 1164 |
+
|------|---------|---------|---------|------------|-----|
|
| 1165 |
+
| **FRANKENSTALLM 3B** | **3B** | **22.75%** | **25.81%** | **43.69%** | **5.2263** |
|
| 1166 |
+
| Llama-3.2-3B | 3B | ~42% | ~58% | ~55% | โ |
|
| 1167 |
+
| Qwen2.5-3B | 3B | ~48% | ~65% | ~60% | โ |
|
| 1168 |
+
| EXAONE-3.5-2.4B | 2.4B | ~35% | ~50% | ~50% | โ |
|
| 1169 |
+
|
| 1170 |
+
> ์ฐธ๊ณ ๋ชจ๋ธ๋ค์ ์์กฐ ํ ํฐ ๊ท๋ชจ์ ํ์ต ๋ฐ์ดํฐ์ ์์ฒ GPU-hour๋ฅผ ํฌ์
ํ ๊ฒฐ๊ณผ. FRANKENSTALLM 3B๋ 41.12B ํ ํฐ(Chinchilla ์ต์ ์ ~68%), 63์๊ฐ, 8 GPU๋ก ํ์ตํ ์ ์ ๊ฐ๏ฟฝ๏ฟฝ๏ฟฝํด์ผ ํ๋ค. SFT + ํ์ฅ ํ๋ฆฌํธ๋ ์ธ(80-100B ํ ํฐ) ์ดํ ๊ฒฉ์ฐจ ์ถ์ ์์.
|
| 1171 |
+
|
| 1172 |
+
### 10.8 ์์ฑ ํ์ง ๋ฐ ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น
|
| 1173 |
+
|
| 1174 |
+
#### ๋ฐ๋ณต๋ฅ ์์ฝ
|
| 1175 |
+
|
| 1176 |
+
| ์ค์ | 3-gram ๋ฐ๋ณต๋ฅ | 4-gram ๋ฐ๋ณต๋ฅ |
|
| 1177 |
+
|------|--------------|--------------|
|
| 1178 |
+
| greedy (temp=0.0) | 60.99% | 57.02% |
|
| 1179 |
+
| temp=0.5 | 60.12% | 58.68% |
|
| 1180 |
+
| temp=0.7 | 47.69% | 43.40% |
|
| 1181 |
+
| temp=1.0 | 3.58% | 2.81% |
|
| 1182 |
+
|
| 1183 |
+
> ์ด๊ธฐ v1 ํ๊ฐ์ greedy 71.1% ๋ฐ๋ณต๋ฅ ์ `no_repeat_ngram_size=3` ์ ์ฉ ๊ธฐ์ค์ด์๋ค. v2์์๋ ๋ฏธ์ ์ฉ ๊ธฐ์ค(raw)์ผ๋ก ํต์ผํ์ฌ 60.99%๋ฅผ ๊ธฐ๋ก.
|
| 1184 |
+
|
| 1185 |
+
#### 12์กฐํฉ ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น ๊ฒฐ๊ณผ
|
| 1186 |
+
|
| 1187 |
+
| ์ค์ | Temp | Rep Pen | 3-gram | 4-gram | ๋น๊ณ |
|
| 1188 |
+
|------|------|---------|--------|--------|------|
|
| 1189 |
+
| **t0.7_rep1.3** | **0.70** | **1.30** | **0.00%** | **0.00%** | **์ต์ ** |
|
| 1190 |
+
| t0.9_rep1.2 | 0.90 | 1.20 | 0.00% | 0.00% | ์ฐจ์ |
|
| 1191 |
+
| t0.7_rep1.2 | 0.70 | 1.20 | 0.88% | 0.00% | |
|
| 1192 |
+
| t0.9_rep1.1 | 0.90 | 1.10 | 0.94% | 0.13% | |
|
| 1193 |
+
| t1.0_rep1.1 | 1.00 | 1.10 | 1.21% | 0.48% | |
|
| 1194 |
+
| t0.5_rep1.1 | 0.50 | 1.10 | 1.92% | 1.19% | |
|
| 1195 |
+
| t1.0 | 1.00 | 1.00 | 3.58% | 2.81% | |
|
| 1196 |
+
| t0.9 | 0.90 | 1.00 | 8.39% | 4.64% | |
|
| 1197 |
+
| t0.7_rep1.1 | 0.70 | 1.10 | 8.51% | 5.51% | |
|
| 1198 |
+
| t0.7 | 0.70 | 1.00 | 47.69% | 43.40% | |
|
| 1199 |
+
| t0.5 | 0.50 | 1.00 | 60.12% | 58.68% | |
|
| 1200 |
+
| greedy | 0.00 | 1.00 | 60.99% | 57.02% | |
|
| 1201 |
+
|
| 1202 |
+
#### ๊ถ์ฅ ์ถ๋ก ํ๋ผ๋ฏธํฐ (base ์คํ์ฉ)
|
| 1203 |
+
|
| 1204 |
+
```python
|
| 1205 |
+
# v2 ๊ทธ๋ฆฌ๋ ์์น ์ต์ ๊ฐ
|
| 1206 |
+
temp=0.7, repetition_penalty=1.3
|
| 1207 |
+
# ๋๋ (๋ ๋ค์ํ ์์ฑ)
|
| 1208 |
+
temp=0.9, repetition_penalty=1.2
|
| 1209 |
+
```
|
| 1210 |
+
|
| 1211 |
+
> ์ด๊ธฐ v1 ๊ถ์ฅ๊ฐ(`temp=0.9, top_p=0.9, no_repeat_ngram=3, repetition_penalty=1.1`)์์ `repetition_penalty=1.3`์ผ๋ก ์ํฅ ์กฐ์ . `no_repeat_ngram_size`๋ ๊ทธ๋ฆฌ๋ ์์น์์ `repetition_penalty`๋ง์ผ๋ก ์ถฉ๋ถํ ๋ฐ๋ณต ์ ๊ฑฐ๊ฐ ๊ฐ๋ฅํจ์ ํ์ธํ์ฌ ๋ถํ์.
|
| 1212 |
+
|
| 1213 |
+
### 10.9 ํ๊ฐ ํ์ดํ๋ผ์ธ
|
| 1214 |
+
|
| 1215 |
+
v2 ์ฌํ๊ฐ๋ ๋ชจ๋ํ๋ 8-GPU ๋ณ๋ ฌ ํ์ดํ๋ผ์ธ(`eval/reeval_pipeline.py`)์ผ๋ก ์ํ๋์๋ค.
|
| 1216 |
+
|
| 1217 |
+
#### ์ํคํ
์ฒ
|
| 1218 |
+
|
| 1219 |
+
```
|
| 1220 |
+
reeval_pipeline.py
|
| 1221 |
+
โโโ ๋ชจ๋ธ 1ํ ๋ก๋ (GPU 0์ HF ๋ชจ๋ธ)
|
| 1222 |
+
โโโ Phase 1: PPL ํ๊ฐ (19๊ฐ ๋ฐ์ดํฐ์
, ์์ฐจ)
|
| 1223 |
+
โโโ Phase 2: Calibration + Token NLL
|
| 1224 |
+
โโโ Phase 3: ์์ฑ ํ์ง + ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น (12์กฐํฉ)
|
| 1225 |
+
โโโ Phase 4: lm-evaluation-harness (0-shot, 8-GPU ๋ณ๋ ฌ)
|
| 1226 |
+
โโโ Phase 5: lm-evaluation-harness (5-shot, 8-GPU ๋ณ๋ ฌ)
|
| 1227 |
+
โโโ Phase 6: ๋ฆฌํฌํธ ์๋ ์์ฑ (5๊ฐ ๊ฐ๋ณ + 1๊ฐ ์ข
ํฉ)
|
| 1228 |
+
```
|
| 1229 |
+
|
| 1230 |
+
#### Pipeline Mode
|
| 1231 |
+
|
| 1232 |
+
๋ชจ๋ธ์ 1ํ ๋ก๋ํ์ฌ 0-shot๊ณผ 5-shot์ ์ฐ์ ์คํํ๋ค. ๊ธฐ์กด ๋ฐฉ์(๋ณ๋ ํ๋ก์ธ์ค 2ํ)์ ๋นํด ๋ชจ๋ธ ๋ก๋ฉ ์๊ฐ์ ์ ๋ฐ์ผ๋ก ์ค์ธ๋ค.
|
| 1233 |
+
|
| 1234 |
+
#### GPU๋ณ ํ์คํฌ ๋ถ๋ฐฐ
|
| 1235 |
+
|
| 1236 |
+
| GPU | 0-shot ํ์คํฌ | 5-shot ํ์คํฌ |
|
| 1237 |
+
|-----|--------------|--------------|
|
| 1238 |
+
| 0 | kobest_boolq, kobest_copa, kobest_hellaswag | ๋์ผ |
|
| 1239 |
+
| 1 | kobest_sentineg, kobest_wic | ๋์ผ |
|
| 1240 |
+
| 2 | haerae (์ ์ฒด + 5๊ฐ ์๋ธ) | ๋์ผ |
|
| 1241 |
+
| 3 | global_mmlu_ko (6์นดํ
๊ณ ๋ฆฌ) | ๋์ผ |
|
| 1242 |
+
| 4 | hellaswag, arc_easy | ๋์ผ |
|
| 1243 |
+
| 5 | arc_challenge, winogrande | ๋์ผ |
|
| 1244 |
+
| 6 | piqa, global_mmlu_en (61๊ณผ๋ชฉ) | ๋์ผ |
|
| 1245 |
+
| 7 | (์๋น โ PPL/calibration ์ ๋ด) | โ |
|
| 1246 |
+
|
| 1247 |
+
NUMA affinity ์ ์ฉ: GPU 0-3์ NUMA node 0 (cores 0-35), GPU 4-7์ NUMA node 1 (cores 36-71).
|
| 1248 |
+
|
| 1249 |
+
**์ด ์์ ์๊ฐ: 256.6์ด** (๋ชจ๋ธ ๋ก๋ ํฌํจ)
|
| 1250 |
+
|
| 1251 |
+
### SFT ์งํ ํ๋จ
|
| 1252 |
+
|
| 1253 |
+
**๊ฒฐ๋ก : SFT ์งํ** โ loss 1.466 ๊ฑด๊ฐํ ์๋ฃ ์๊ทธ๋, ๊ตฌ์กฐ ๋ฌธ์ ์์. โ **Phase 2 SFT ์์ (2026-03-05)**
|
| 1254 |
+
|
| 1255 |
+
์์ธ ๋ณด๊ณ ์:
|
| 1256 |
+
- v2 ์ข
ํฉ: `eval/outputs/3b_reeval_20260305_1451/reports/` (5๊ฐ ๊ฐ๋ณ ๋ฆฌํฌํธ + ์ข
ํฉ)
|
| 1257 |
+
- v1 ๋ ๊ฑฐ์: `reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md`
|
| 1258 |
+
|
| 1259 |
+
---
|
| 1260 |
+
|
| 1261 |
+
## 11. ์คํ ๊ฒฐ๊ณผ โ 3B SFT ์ข
ํฉ ํ๊ฐ
|
| 1262 |
+
|
| 1263 |
+
Phase 2 SFT๊ฐ early stopping์ผ๋ก ์๋ฃ๋ ํ ์ํํ 6์ฐจ์ ์ข
ํฉ ํ๊ฐ.
|
| 1264 |
+
|
| 1265 |
+
### 11.1 SFT ํ์ต ๊ฒฐ๊ณผ
|
| 1266 |
+
|
| 1267 |
+
| ํญ๋ชฉ | ๊ฐ |
|
| 1268 |
+
|------|-----|
|
| 1269 |
+
| ์ต์ข
Step | 25,500 / 33,000 (77.3%, early stopping) |
|
| 1270 |
+
| Best val_loss | **1.8851** (step 23,000) |
|
| 1271 |
+
| ํ์ต ์๊ฐ | ~15์๊ฐ 41๋ถ |
|
| 1272 |
+
| ๋ฐ์ดํฐ | 24๊ฐ ์์ค โ 2,439,397 samples (7.48 GB) |
|
| 1273 |
+
| ์ค์ | LR=1e-5, eff_batch=64, NEFTune alpha=5.0 |
|
| 1274 |
+
|
| 1275 |
+
**Val Loss ์ถ์ด**:
|
| 1276 |
+
```
|
| 1277 |
+
Step 500: 2.0732 (warmup ์๋ฃ)
|
| 1278 |
+
Step 2,000: 1.9558 (๊ธ์ ํ๊ฐ)
|
| 1279 |
+
Step 5,000: 1.9107 (์์ ์๋ ด)
|
| 1280 |
+
Step 10,000: 1.8917 (๋ฏธ์ธ ๊ฐ์)
|
| 1281 |
+
Step 15,000: 1.8864 (plateau ์ง์
)
|
| 1282 |
+
Step 20,000: 1.8853 (๋ณ๋ < 0.001)
|
| 1283 |
+
Step 23,000: 1.8851 โ BEST (early stopping ๊ธฐ์ค์ )
|
| 1284 |
+
Step 25,500: Early Stop (patience 5/5 ์์ง)
|
| 1285 |
+
```
|
| 1286 |
+
|
| 1287 |
+
### 11.2 6์ฐจ์ ํ๊ฐ ์์ฝ
|
| 1288 |
+
|
| 1289 |
+
| # | ์ฐจ์ | ๊ฒฐ๊ณผ | ํต์ฌ ์์น |
|
| 1290 |
+
|---|------|------|-----------|
|
| 1291 |
+
| 1 | Perplexity (์ง์ ๋ณด์กด) | **PASS** | ์ต๋ forgetting 0.9%, 19๊ฐ ๋ฐ์ดํฐ์
์ ์ฒด PASS |
|
| 1292 |
+
| 2 | ์์ฑ ํ์ง | **FAIL** | Greedy ๋ฐ๋ณต๋ฅ 72.97% (๋ชฉํ <5%), EOS 60% (๋ชฉํ >90%) |
|
| 1293 |
+
| 3 | ํ๊ตญ์ด ๋ฒค์น๋งํฌ | **FAIL** | KoBEST ํ๊ท 43.26% (๋ชฉํ >55%) |
|
| 1294 |
+
| 4 | ์์ด ๋ฒค์น๋งํฌ | **PASS** | hellaswag 26.1%, winogrande 50.8%, piqa 52.6% (์ ํญ๋ชฉ ํํ ์ด๊ณผ) |
|
| 1295 |
+
| 5 | Calibration | **PASS** | Top-1 68.59%, Top-5 81.55%, Entropy 1.54 |
|
| 1296 |
+
| 6 | SFT Chat ๋ฅ๋ ฅ | **PASS** | EOS ์ข
๋ฃ์จ 0%โ60%, Chat template ์๋ต |
|
| 1297 |
+
|
| 1298 |
+
### 11.3 Base vs SFT ๋น๊ต
|
| 1299 |
+
|
| 1300 |
+
| ์งํ | Base | SFT | ๋ณํ | ํ์ |
|
| 1301 |
+
|------|------|-----|------|------|
|
| 1302 |
+
| PPL (ํตํฉ) | 5.2263 | 5.2529 | +0.5% forgetting | PASS |
|
| 1303 |
+
| Greedy 3-gram ๋ฐ๋ณต๋ฅ | 60.99% | 72.97% | +12pp (์
ํ) | FAIL |
|
| 1304 |
+
| EOS ์ข
๋ฃ์จ | 0% | 60% | +60pp (๋ํญ ๊ฐ์ ) | ๋ถ๋ถ PASS |
|
| 1305 |
+
| KoBEST ํ๊ท | 43.69% | 43.26% | -0.4pp | FAIL |
|
| 1306 |
+
| MMLU-KO | 22.75% | 26.00% | +3.2pp | ๋ถ๋ถ ๊ฐ์ |
|
| 1307 |
+
| ์์ด ๋ฒค์น๋งํฌ | โ | โ | ยฑ0.3pp ์ด๋ด | PASS (์ ์ง) |
|
| 1308 |
+
| Calibration Top-1 | 68.75% | 68.59% | -0.2pp | PASS (์ ์ง) |
|
| 1309 |
+
|
| 1310 |
+
**Repetition ํ๋ผ๋ฏธํฐ ๊ฒ์** (ํฌ๋ง์ ):
|
| 1311 |
+
|
| 1312 |
+
| ์ค์ | ๋ฐ๋ณต๋ฅ | EOS Rate |
|
| 1313 |
+
|------|--------|----------|
|
| 1314 |
+
| t0.7_rep1.2 | **0.00%** | **100%** |
|
| 1315 |
+
| t1.0_rep1.1 | **0.00%** | **100%** |
|
| 1316 |
+
| greedy (raw) | 72.97% | 60% |
|
| 1317 |
+
|
| 1318 |
+
> rep_penalty 1.1~1.3 ์ ์ฉ ์ ๋ฐ๋ณต๋ฅ 0% ๋ฌ์ฑ โ ๋ชจ๋ธ์ด ๋ฐ๋ณตํ์ง ์๋ ๋ฅ๋ ฅ ์์ฒด๋ ๋ณด์ . ORPO๋ก ๋ด์ฌํ ๊ฐ๋ฅ.
|
| 1319 |
+
|
| 1320 |
+
### 11.4 ์ฝ๋ ๊ฐ์ ์ฌํญ
|
| 1321 |
+
|
| 1322 |
+
์ด๋ฒ Phase์์ ์ํํ ์ฃผ์ ์ฝ๋ ๋ณ๊ฒฝ:
|
| 1323 |
+
|
| 1324 |
+
| ํ์ผ | ๋ณ๊ฒฝ | ์ค ์ | ๋ชฉ์ |
|
| 1325 |
+
|------|------|-------|------|
|
| 1326 |
+
| `train/sft.py` | MixingDataLoader, DDP rank 0 ํ ํฌ๋์ด์ง | +238 | SFT+pretrain ์ธํฐ๋ฆฌ๋น, ๋ฉ๋ชจ๋ฆฌ 8๋ฐฐ ์ ๊ฐ |
|
| 1327 |
+
| `train/trainer.py` | DDP early stop broadcast | +17 | DDP hang ๋ฐฉ์ง, patience 5โ10 |
|
| 1328 |
+
| `train/orpo.py` | YAML config, 3B ๊ธฐ๋ณธ๊ฐ | +30 | ORPO ์คํ ์ค๋น |
|
| 1329 |
+
| `eval/report_generator.py` | SFT ๋น๊ต ๋ณด๊ณ ์ ์๋ ์์ฑ | +831 | ํ๊ฐ ์๋ํ |
|
| 1330 |
+
| `eval/sft_eval_pipeline.py` | 6์ฐจ์ ํ๊ฐ ํ์ดํ๋ผ์ธ | ์ ๊ท | SFT ์ข
ํฉ ํ๊ฐ |
|
| 1331 |
+
| `eval/tasks/generation_task.py` | Chat template, diversity metrics | +75 | SFT ํ๊ฐ ์ง์ |
|
| 1332 |
+
|
| 1333 |
+
### 11.5 ORPO ์งํ ํ์
|
| 1334 |
+
|
| 1335 |
+
**ํ์ : Phase 3 ORPO ์งํ**
|
| 1336 |
+
|
| 1337 |
+
| ๊ทผ๊ฑฐ | ์์ธ |
|
| 1338 |
+
|------|------|
|
| 1339 |
+
| ์ง์ ๋ณด์กด ์ํธ | forgetting 0.9% โ SFT๊ฐ base ์ง์์ ํ๊ดดํ์ง ์์ |
|
| 1340 |
+
| ๋ฐ๋ณต ๋ฏธํด๊ฒฐ | greedy 72.97% โ ์ ํธ๋ ์ ๋ ฌ์ด ์ง์ ์ ํด๊ฒฐ ๊ฒฝ๋ก |
|
| 1341 |
+
| ํฌ๋ง์ ์ ํธ | rep_penalty ์ ์ฉ ์ 0% โ ORPO๊ฐ ๋ด์ฌํ ๊ฐ๋ฅ |
|
| 1342 |
+
| ๋ฐ์ดํฐ ์ค๋น ์๋ฃ | 795,468 preference pairs (7.9 GB) |
|
| 1343 |
+
| ์ฝ๋/์ค์ ์๋น | `train/orpo.py` + `configs/korean_3b_orpo.yaml` |
|
| 1344 |
+
|
| 1345 |
+
**ORPO ํ ํ์ ๊ธฐ์ค**:
|
| 1346 |
+
- ๋ฐ๋ณต๋ฅ < 5% AND KoBEST > 50% โ GGUF + Ollama ๋ฐฐํฌ
|
| 1347 |
+
- ๋ฐ๋ณต๋ฅ 5~15% โ ํ์ดํผํ๋ผ๋ฏธํฐ ์กฐ์ ํ ์ฌ์๋
|
| 1348 |
+
- ๋ฐ๋ณต๋ฅ > 15% โ SFT v2 (lr=5e-5, data mixing) ํ ์ฌ๋์
|
| 1349 |
+
|
| 1350 |
+
์์ธ: `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md`
|
| 1351 |
+
|
| 1352 |
+
---
|
| 1353 |
+
|
| 1354 |
+
## 12. Phase 3 โ ORPO (์ ํธ๋ ์ ๋ ฌ)
|
| 1355 |
+
|
| 1356 |
+
### 12.1 ORPO ์ ํ ๋ฐฐ๊ฒฝ
|
| 1357 |
+
|
| 1358 |
+
SFT 6์ฐจ์ ํ๊ฐ์์ greedy ๋ฐ๋ณต๋ฅ 72.97%, EOS ์ข
๋ฃ์จ 0%๋ผ๋ ์น๋ช
์ ๋ฌธ์ ๊ฐ ๋ฐ๊ฒฌ๋๋ค. SFT๋ "์ข์ ์๋ต๋ง ๋ชจ๋ฐฉ"ํ๋ ํ์ต์ด๋ฏ๋ก, "๋์ ์๋ต์ ์ต์ "ํ๋ ์ ํธ๊ฐ ์๋ค. ๋ฐ๋ณต ๋ฌธ์ ํด๊ฒฐ์๋ preference optimization์ด ํ์์ ์ด๋ค.
|
| 1359 |
+
|
| 1360 |
+
**ORPO vs DPO**:
|
| 1361 |
+
| ํญ๋ชฉ | ORPO | DPO |
|
| 1362 |
+
|------|------|-----|
|
| 1363 |
+
| Reference model | ๋ถํ์ | ํ์ (VRAM 2๋ฐฐ) |
|
| 1364 |
+
| ๊ตฌํ ๋ณต์ก๋ | ๋ฎ์ | ์ค๊ฐ |
|
| 1365 |
+
| ๋ฉ๋ชจ๋ฆฌ ํจ์จ | ๋์ (3B 1๊ฐ๋ง ๋ก๋) | ๋ฎ์ (3B 2๊ฐ ๋ก๋) |
|
| 1366 |
+
| ํ์ต ์์ ์ฑ | ์ค๊ฐ | ๋์ |
|
| 1367 |
+
|
| 1368 |
+
ORPO๋ฅผ 1์ฐจ ์ ํ, DPO๋ฅผ Plan B๋ก ์ค์ ํ๋ค.
|
| 1369 |
+
|
| 1370 |
+
### 12.2 ๋ฐ์ดํฐ
|
| 1371 |
+
|
| 1372 |
+
- **์๋ณธ**: 683,181 preference pairs (7๊ฐ ์์ค ํตํฉ)
|
| 1373 |
+
- **ํํฐ ํ**: ~630,000 pairs (NaN ๋ฐฉ์ง ํํฐ ์ ์ฉ)
|
| 1374 |
+
- **Eval split**: 5% (~31,500 pairs, seed=42)
|
| 1375 |
+
- **Effective batch**: 4 ร 8 GPU ร 4 accum = 128
|
| 1376 |
+
|
| 1377 |
+
### 12.3 HP Sweep ์ค๊ณ (6-Config)
|
| 1378 |
+
|
| 1379 |
+
3๊ฐ ์ถ(beta, LR, max_length)์ ์ค์ฌ์ถ ๊ณ ์ ๋ฐฉ์์ผ๋ก 6๊ฐ ์กฐํฉ ์ ์ :
|
| 1380 |
+
|
| 1381 |
+
| Run | Name | Beta | LR | Max Length | ๋ชฉ์ |
|
| 1382 |
+
|-----|------|------|----|-----------|------|
|
| 1383 |
+
| 1 | baseline_b015 | 0.15 | 8e-6 | 1536 | ์ฝํ beta ๋ฒ ์ด์ค๋ผ์ธ |
|
| 1384 |
+
| 2 | baseline_b025 | 0.25 | 8e-6 | 1536 | ์ค๊ฐ beta ๋ฒ ์ด์ค๋ผ์ธ |
|
| 1385 |
+
| 3 | strong_b035 | 0.35 | 8e-6 | 1536 | ๊ฐํ beta โ ์ ๊ทน์ ๋ฐ๋ณต ์ต์ |
|
| 1386 |
+
| 4 | fast_lr12e6 | 0.25 | 1.2e-5 | 1536 | ๋์ LR โ ๋น ๋ฅธ ์๋ ด |
|
| 1387 |
+
| 5 | conserv_lr5e6 | 0.25 | 5e-6 | 1536 | ๋ณด์์ LR โ ์์ ์ฑ |
|
| 1388 |
+
| 6 | short_1024 | 0.25 | 8e-6 | 1024 | ์งง์ max_length โ VRAM ์ ์ฝ |
|
| 1389 |
+
|
| 1390 |
+
๊ฐ 200 steps, eval_steps=100, 8รB200 DDP.
|
| 1391 |
+
|
| 1392 |
+
### 12.4 ์๋ ์ด๋ ฅ โ 5๋ฒ์ ์คํจ
|
| 1393 |
+
|
| 1394 |
+
| # | ๋ฌธ์ | ์์ธ | ์์ |
|
| 1395 |
+
|---|------|------|------|
|
| 1396 |
+
| 1 | NCCL Timeout | ํ ํฌ๋์ด์ง 30๋ถ > timeout 1800s | ddp_timeout=7200, num_proc=64 |
|
| 1397 |
+
| 2 | Config ์ถฉ๋ | save_steps โ eval_steps ๋ฐฐ์ | --no_load_best --save_steps 200 |
|
| 1398 |
+
| 3 | ํฌํธ ์ถฉ๋ + QKV ๋๋ฝ | ์ข๋น ํ๋ก์ธ์ค + fused QKV ๋ฏธ๋ถ๋ฆฌ | pkill + QKV split ๋ก์ง |
|
| 1399 |
+
| 4 | TRL NaN ๋ฒ๊ทธ | tokenize_row ์์ชฝ response ๋์ ์๋ฆผ | 3์ค ํจ์น (clamp, truncation) |
|
| 1400 |
+
| 5 | Tokenizer ํธํ | zip(strict=True) + ํ๊ตญ์ด merge ops | TRL ์์ค 8๊ฑด ํจ์น |
|
| 1401 |
+
|
| 1402 |
+
๊ฐ์ฅ ์ฌ๊ฐํ๋ ๊ฒ์ TRL NaN ๋ฒ๊ทธ๋ก, 0 response tokens โ log(0) = -inf โ NaN ์ ํ ์ฒด์ธ์ ์ผ์ผ์ผฐ๋ค. ์์ธ: `reports/2026-03-08_ORPO_TRAINING_JOURNEY.md`
|
| 1403 |
+
|
| 1404 |
+
### 12.5 ์ค์ ์ต์ข
๊ฒฐ๊ณผ
|
| 1405 |
+
|
| 1406 |
+
| Run | Name | Beta | LR | MaxLen | Train Loss | Eval Loss | Margin | Status |
|
| 1407 |
+
|-----|------|------|----|--------|-----------|-----------|--------|--------|
|
| 1408 |
+
| 1 | baseline_b015 | 0.15 | 8e-6 | 1536 | 1.811 | 1.827 | 0.004 | โ
|
|
| 1409 |
+
| 2 | baseline_b025 | 0.25 | 8e-6 | 1536 | 1.890 | 1.906 | 0.009 | โ
|
|
| 1410 |
+
| 3 | strong_b035 | 0.35 | 8e-6 | 1536 | 2.055 | 1.985 | 0.007 | โ
|
|
| 1411 |
+
| **4** | **fast_lr12e6** | **0.25** | **1.2e-5** | **1536** | **1.917** | **1.862** | **0.009** | **๐ Best** |
|
| 1412 |
+
| 5 | conserv_lr5e6 | 0.25 | 5e-6 | 1536 | 1.833 | 1.910 | 0.004 | โ
|
|
| 1413 |
+
| 6 | short_1024 | 0.25 | 8e-6 | 1024 | 1.664 | 1.695 | 0.007 | โ
|
|
| 1414 |
+
|
| 1415 |
+
**Best config: Run 4** (eval_loss 1.862 ์ต์ , margin 0.009 ์ต๊ณ , ๋น ๋ฅธ ์๋ ด).
|
| 1416 |
+
|
| 1417 |
+
### 12.6 Throughput ๋ฒค์น๋งํฌ โ ๋ณธ ํ์ต ์ค์
|
| 1418 |
+
|
| 1419 |
+
๋ณธ ํ์ต ์ batch/grad_accum ์กฐํฉ์ throughput์ ์ธก์ ํ์ฌ ์ต์ ์ค์ ์ ๊ฒฐ์ :
|
| 1420 |
+
|
| 1421 |
+
| batch_size | grad_accum | eff_batch | Throughput | ๋น๊ณ |
|
| 1422 |
+
|-----------|-----------|----------|-----------|------|
|
| 1423 |
+
| **4** | **4** | **128** | **80.63 samples/s** | **์ ์ ** |
|
| 1424 |
+
| 2 | 8 | 128 | 73.14 samples/s | ๊ธฐ์กด ์ค์ |
|
| 1425 |
+
| 8 | 2 | 128 | OOM | |
|
| 1426 |
+
|
| 1427 |
+
### 12.7 ORPO ๋ณธ ํ์ต (์งํ ์ค, 2026-03-09)
|
| 1428 |
+
|
| 1429 |
+
| ํ๋ผ๋ฏธํฐ | ๊ฐ |
|
| 1430 |
+
|---------|-----|
|
| 1431 |
+
| Beta / LR | 0.25 / 1.2e-5 (Sweep Run 4) |
|
| 1432 |
+
| Batch / Accum / Eff | 4 / 4 / 128 (๋ฒค์น๋งํฌ ์ต์ ) |
|
| 1433 |
+
| Max length | 1536 |
|
| 1434 |
+
| Epochs | 2 (~9,840 steps) |
|
| 1435 |
+
| GPU VRAM | ~52GB / 183GB (28%) |
|
| 1436 |
+
| ์๋ | ~1.75 s/step |
|
| 1437 |
+
| ์์ ์๊ฐ | ~4.8์๊ฐ |
|
| 1438 |
+
|
| 1439 |
+
**ํ์ต ์งํ ์ถ์ด (step ~1,660 ๊ธฐ์ค)**:
|
| 1440 |
+
|
| 1441 |
+
| Step | Eval Loss | Pref Accuracy | Reward Margin | NLL Loss |
|
| 1442 |
+
|-----:|----------:|--------------:|--------------:|---------:|
|
| 1443 |
+
| ~1,000 | 1.791 | 66.8% | 0.107 | 1.647 |
|
| 1444 |
+
| ~2,000 | 1.713 | 70.1% | 0.293 | 1.591 |
|
| 1445 |
+
| ~3,000 | 1.681 | 71.9% | 0.372 | 1.567 |
|
| 1446 |
+
|
| 1447 |
+
- Train loss: 2.34 โ **1.68** (-0.66)
|
| 1448 |
+
- rewards/accuracies: 0.43 โ **0.74** (chosen/rejected ๊ตฌ๋ถ ๋ฅ๋ ฅ ๊ธ์์น)
|
| 1449 |
+
- rewards/margins: -0.005 โ **0.387** (preference signal ํ์ต ํ์ธ)
|
| 1450 |
+
- ์๋ ~1.76 s/step, GPU 92~100% utilization, ์์ ์ ์งํ ์ค
|
| 1451 |
+
|
| 1452 |
+
**ํ์ต ์๋ฃ ํ ์๋ ํ๊ฐ**: `scripts/orpo_eval_watchdog.sh` ๊ฐ ํ์ต ํ๋ก์ธ์ค๋ฅผ ๊ฐ์ํ๋ฉฐ, ์๋ฃ ์ ์๋์ผ๋ก 10์ฐจ์ ์ข
ํฉ ํ๊ฐ ํ์ดํ๋ผ์ธ ์คํ
|
| 1453 |
+
|
| 1454 |
+
### 12.8 ORPO ์ข
ํฉ ํ๊ฐ ํ์ดํ๋ผ์ธ
|
| 1455 |
+
|
| 1456 |
+
SFT v2 ํ๊ฐ์ 6์ฐจ์์ ORPO ๊ณ ์ 4์ฐจ์์ ์ถ๊ฐํ **10์ฐจ์ ์ข
ํฉ ํ๊ฐ**.
|
| 1457 |
+
ํ์ต ์๋ฃ ์ `eval/orpo_eval_pipeline.py`๊ฐ ์๋ ์คํ๋์ด Base vs SFT vs ORPO 3-way ๋น๊ต ๋ณด๊ณ ์๋ฅผ ์์ฑํ๋ค.
|
| 1458 |
+
|
| 1459 |
+
**ํ๊ฐ ๊ตฌ์กฐ**:
|
| 1460 |
+
|
| 1461 |
+
| Phase | ๋ด์ฉ | GPU | ์์ ์๊ฐ |
|
| 1462 |
+
|-------|------|-----|----------|
|
| 1463 |
+
| Pre-phase | train.log์์ ํ์ต ๊ณก์ ์ถ์ถ | - | ~1์ด |
|
| 1464 |
+
| Phase 1 | ๋ด๋ถ ํ๊ฐ (PPL 19์
, Calibration, Generation, Repetition Grid) | 8 GPU ๋ณ๋ ฌ | ~30๋ถ |
|
| 1465 |
+
| Phase 2 | ๋ฒค์น๋งํฌ (KoBEST, HAE-RAE, MMLU-KO/EN, hellaswag, arc, piqa) | 8 GPU ๋ณ๋ ฌ | ~1์๊ฐ |
|
| 1466 |
+
| Phase 3 | 3-way ๋น๊ต ๋ณด๊ณ ์ ์๋ ์์ฑ | - | ~10์ด |
|
| 1467 |
+
|
| 1468 |
+
**10์ฐจ์ ํ๊ฐ ํญ๋ชฉ**:
|
| 1469 |
+
|
| 1470 |
+
| # | ์ฐจ์ | ๊ธฐ์ค | SFT v2 ๊ฒฐ๊ณผ | ORPO ๋ชฉํ |
|
| 1471 |
+
|---|------|------|------------|----------|
|
| 1472 |
+
| 1 | ์ง์ ๋ณด์กด (PPL) | forgetting < 15% | 0.9% | < 5% |
|
| 1473 |
+
| 2 | ์์ฑ ํ์ง | greedy ๋ฐ๋ณต๋ฅ < 5%, EOS > 90% | **72.97% / 60%** | **< 5% / > 90%** |
|
| 1474 |
+
| 3 | ํ๊ตญ์ด ๋ฒค์น๋งํฌ | KoBEST ํ๊ท > 55% | 43.26% | โฅ 43% |
|
| 1475 |
+
| 4 | ์์ด ๋ฒค์น๋งํฌ | ํํ ์ด๊ณผ | PASS | ์ ์ง |
|
| 1476 |
+
| 5 | Calibration | Top-1 โฅ 65% | 68.59% | โฅ 65% |
|
| 1477 |
+
| 6 | Chat ๋ฅ๋ ฅ | EOS ์ข
๋ฃ์จ | 60% | > 90% |
|
| 1478 |
+
| 7 | Preference Accuracy | > 65% | โ | > 65% |
|
| 1479 |
+
| 8 | Reward Margins | > 0.1 | โ | > 0.1 |
|
| 1480 |
+
| 9 | ๋ฐ๋ณต ํ๋ผ๋ฏธํฐ ๋ฏผ๊ฐ๋ | rep_penalty=1.0์์๋ < 5% | โ | PASS |
|
| 1481 |
+
| 10 | SFTโORPO ๊ฐ์ | ๋ฐ๋ณต๋ฅ โ + EOSโ | โ | PASS |
|
| 1482 |
+
|
| 1483 |
+
**ํต์ฌ ํ์ผ**:
|
| 1484 |
+
- `eval/orpo_eval_pipeline.py` โ ORPO ํ๊ฐ ์ค์ผ์คํธ๋ ์ดํฐ
|
| 1485 |
+
- `eval/report_generator.py` โ 3-way ๋น๊ต ๋ณด๊ณ ์ ์์ฑ๊ธฐ (`generate_three_way_report()`)
|
| 1486 |
+
- `scripts/orpo_eval_watchdog.sh` โ ํ์ต ์๋ฃ ๊ฐ์ง + ์๋ ํ๊ฐ ์คํ
|
| 1487 |
+
|
| 1488 |
+
**๋ฐฐํฌ ๊ธฐ์ค**: greedy ๋ฐ๋ณต๋ฅ < 5% AND EOS > 90% AND forgetting < 5% AND KoBEST โฅ 43% โ **DEPLOY**
|
| 1489 |
+
|
| 1490 |
+
---
|
| 1491 |
+
|
| 1492 |
+
## 13. ์คํ ๋ฐฉ๋ฒ
|
| 1493 |
+
|
| 1494 |
+
### ์ฌ์ ์๊ตฌ์ฌํญ
|
| 1495 |
+
|
| 1496 |
+
```bash
|
| 1497 |
+
# PyTorch๋ ์ฌ์ค์น ๊ธ์ง (NVIDIA ์ปค์คํ
๋น๋)
|
| 1498 |
+
# ์๋ ํจํค์ง๋ง ์ถ๊ฐ ์ค์น
|
| 1499 |
+
pip install transformers accelerate peft trl deepspeed \
|
| 1500 |
+
bitsandbytes sentencepiece wandb
|
| 1501 |
+
```
|
| 1502 |
+
|
| 1503 |
+
### 3B ํ๋ฆฌํธ๋ ์ธ
|
| 1504 |
+
|
| 1505 |
+
```bash
|
| 1506 |
+
# NCCL ํ๊ฒฝ๋ณ์์ ํจ๊ป 8-GPU ํ์ต ์คํ
|
| 1507 |
+
bash scripts/launch_3b_pretrain.sh
|
| 1508 |
+
|
| 1509 |
+
# ์๋ ์คํ (์ง์ ์ ์ด)
|
| 1510 |
+
torchrun --nproc_per_node=8 \
|
| 1511 |
+
--master_port=29500 \
|
| 1512 |
+
train/pretrain.py \
|
| 1513 |
+
--config configs/korean_3b_fp8.yaml
|
| 1514 |
+
```
|
| 1515 |
+
|
| 1516 |
+
### SFT
|
| 1517 |
+
|
| 1518 |
+
```bash
|
| 1519 |
+
bash scripts/launch_3b_sft.sh
|
| 1520 |
+
|
| 1521 |
+
# ๋๋ ์ง์ ์คํ
|
| 1522 |
+
torchrun --nproc_per_node=8 \
|
| 1523 |
+
train/sft.py \
|
| 1524 |
+
--config configs/korean_3b_sft.yaml \
|
| 1525 |
+
--pretrain_ckpt checkpoints/3b_pretrain_best.pt
|
| 1526 |
+
```
|
| 1527 |
+
|
| 1528 |
+
### ORPO (์ ํธ๋ ์ ๋ ฌ)
|
| 1529 |
+
|
| 1530 |
+
```bash
|
| 1531 |
+
# ORPO ํ์ต
|
| 1532 |
+
bash scripts/launch_3b_orpo.sh
|
| 1533 |
+
|
| 1534 |
+
# ํ์ต ์๋ฃ ํ ์๋ ํ๊ฐ (watchdog)
|
| 1535 |
+
nohup bash scripts/orpo_eval_watchdog.sh \
|
| 1536 |
+
> checkpoints/korean_3b_orpo_v1/watchdog.log 2>&1 &
|
| 1537 |
+
```
|
| 1538 |
+
|
| 1539 |
+
### ํ๊ฐ
|
| 1540 |
+
|
| 1541 |
+
```bash
|
| 1542 |
+
# Base ๋ชจ๋ธ ์ ์ฒด ํ๊ฐ (8 GPU ๋ณ๋ ฌ)
|
| 1543 |
+
python eval/full_eval_pipeline.py
|
| 1544 |
+
|
| 1545 |
+
# SFT ๋ชจ๋ธ ํ๊ฐ (Base vs SFT 2-way ๋น๊ต)
|
| 1546 |
+
python eval/sft_eval_pipeline.py --skip-phase0 \
|
| 1547 |
+
--hf-model-path eval/outputs/hf_3b_sft_best
|
| 1548 |
+
|
| 1549 |
+
# ORPO ๋ชจ๋ธ ํ๊ฐ (Base vs SFT vs ORPO 3-way ๋น๊ต)
|
| 1550 |
+
python eval/orpo_eval_pipeline.py # ์๋์ผ๋ก ์ต์ checkpoint ๊ฐ์ง
|
| 1551 |
+
python eval/orpo_eval_pipeline.py --dry-run # ์คํ ๊ณํ๋ง ํ์ธ
|
| 1552 |
+
|
| 1553 |
+
# ๋น ๋ฅธ ํ๊ฐ (kobest_copa + PPL)
|
| 1554 |
+
bash scripts/run_eval_quick.sh
|
| 1555 |
+
|
| 1556 |
+
# ์์ฑ ํ๋ผ๋ฏธํฐ ํ์
|
| 1557 |
+
python eval/test_generation_params.py \
|
| 1558 |
+
--checkpoint checkpoints/3b_best.pt
|
| 1559 |
+
```
|
| 1560 |
+
|
| 1561 |
+
### ๋ฐฐํฌ
|
| 1562 |
+
|
| 1563 |
+
```bash
|
| 1564 |
+
# Step 1: GGUF ๋ณํ (llama.cpp ํฌ๋งท)
|
| 1565 |
+
bash scripts/convert_3b_gguf.sh
|
| 1566 |
+
|
| 1567 |
+
# Step 2: Ollama ๋ชจ๋ธ ๋ฑ๋ก ๋ฐ ์๋น
|
| 1568 |
+
bash scripts/deploy_3b_ollama.sh
|
| 1569 |
+
|
| 1570 |
+
# Ollama๋ก ํ
์คํธ
|
| 1571 |
+
ollama run frankenstallm-3b "ํ๊ตญ์ ์ฒ ๊ฐ ์ฐ์
์ ๋ํด ์ค๋ช
ํด์ค."
|
| 1572 |
+
```
|
| 1573 |
+
|
| 1574 |
+
### ํ์ต ๋ชจ๋ํฐ๋ง
|
| 1575 |
+
|
| 1576 |
+
```bash
|
| 1577 |
+
# ์ค์๊ฐ ๋ชจ๋ํฐ (tail -f ๋ฐฉ์)
|
| 1578 |
+
bash scripts/monitor_3b.sh
|
| 1579 |
+
|
| 1580 |
+
# ํ๋ก์ธ์ค ์ํ ํ์ธ
|
| 1581 |
+
ps aux | grep pretrain
|
| 1582 |
+
|
| 1583 |
+
# GPU ์ํ
|
| 1584 |
+
nvidia-smi --query-gpu=index,name,memory.used,memory.total,utilization.gpu \
|
| 1585 |
+
--format=csv -l 5
|
| 1586 |
+
```
|
| 1587 |
+
|
| 1588 |
+
### ๋จ์ผ GPU ํ
์คํธ (๊ฐ๋ฐ/๋๋ฒ๊ทธ)
|
| 1589 |
+
|
| 1590 |
+
```bash
|
| 1591 |
+
python train/pretrain.py \
|
| 1592 |
+
--config configs/korean_3b_fp8.yaml \
|
| 1593 |
+
--device cuda:0 \
|
| 1594 |
+
--max_steps 100 \
|
| 1595 |
+
--debug
|
| 1596 |
+
```
|
| 1597 |
+
|
| 1598 |
+
---
|
| 1599 |
+
|
| 1600 |
+
## 14. ๋ก๋๋งต
|
| 1601 |
+
|
| 1602 |
+
### ๋จ๊ธฐ (2026๋
3์)
|
| 1603 |
+
|
| 1604 |
+
| ํญ๋ชฉ | ์ํ | ๋น๊ณ |
|
| 1605 |
+
|------|------|------|
|
| 1606 |
+
| Phase 1 (3B Pretrain) ์๋ฃ | โ
์๋ฃ | 57K steps, loss 1.466, 2026-03-05 |
|
| 1607 |
+
| Phase 2 (SFT) ์๋ฃ | โ
์๋ฃ | 25.5K steps, val_loss 1.8851, 2026-03-06 |
|
| 1608 |
+
| SFT 6์ฐจ์ ํ๊ฐ | โ
์๋ฃ | 4/6 PASS, ORPO ํ์ |
|
| 1609 |
+
| Phase 3 (ORPO Sweep) | โ
์๋ฃ | 6-config sweep ์๋ฃ, best config ์ ์ |
|
| 1610 |
+
| **Phase 3 (ORPO ๋ณธ ํ์ต)** | **๐ ์งํ ์ค** | **lr=1.2e-5, beta=0.25, 2 epochs, ~9,840 steps** |
|
| 1611 |
+
| Phase 3.5 (ORPO ์ข
ํฉ ํ๊ฐ) | ๐ ๋๊ธฐ | 10์ฐจ์ ํ๊ฐ (6 ๊ธฐ๋ณธ + 4 ORPO ๊ณ ์ ), 3-way ๋น๊ต ๋ณด๊ณ ์ |
|
| 1612 |
+
| GGUF ๋ณํ + Ollama ๋ฐฐํฌ | ๐ ๋๊ธฐ | Phase 4 (ORPO ํ๊ฐ PASS ์) |
|
| 1613 |
+
|
| 1614 |
+
### ์ค๊ธฐ (2026๋
2๋ถ๊ธฐ)
|
| 1615 |
+
|
| 1616 |
+
| ํญ๋ชฉ | ๋น๊ณ |
|
| 1617 |
+
|------|------|
|
| 1618 |
+
| ํ์ฅ ํ๋ฆฌํธ๋ ์ธ (80~100B ํ ํฐ) | Chinchilla ์ต์ ์ ๋ฌ์ฑ |
|
| 1619 |
+
| QKV Fusion | +8~12% MFU ๊ธฐ๋ |
|
| 1620 |
+
| NUMA Affinity ์ค์ | +4~9% ์์ |
|
| 1621 |
+
| FA2 native RoPE | +3~5% ์์ |
|
| 1622 |
+
| Context length ํ์ฅ (4096) | RoPE ฮธ=500K ๊ธฐ๋ฐ |
|
| 1623 |
+
|
| 1624 |
+
### ์ฅ๊ธฐ (2026๋
ํ๋ฐ๊ธฐ)
|
| 1625 |
+
|
| 1626 |
+
| ํญ๋ชฉ | ๋น๊ณ |
|
| 1627 |
+
|------|------|
|
| 1628 |
+
| 7B ์คํ | FSDP ์ ๋ต ํ์ |
|
| 1629 |
+
| vLLM serving | PagedAttention ๊ธฐ๋ฐ ์ถ๋ก ์๋ฒ |
|
| 1630 |
+
| ๋๋ฉ์ธ ํนํ ํ์ธํ๋ | ์ฒ ๊ฐ/์ ์กฐ์
๋๋ฉ์ธ |
|
| 1631 |
+
| ๊ณต๊ฐ ๋ฐฐํฌ | HuggingFace Hub ์
๋ก๋ |
|
| 1632 |
+
|
| 1633 |
+
### ์๋ ค์ง ๋ฏธ์ ์ฉ ์ต์ ํ
|
| 1634 |
+
|
| 1635 |
+
Phase 0 ๋ถ์์์ ๋ฐ๊ฒฌํ์ง๋ง ์์ง ์ ์ฉํ์ง ์์ ์ต์ ํ๋ค:
|
| 1636 |
+
|
| 1637 |
+
| ์ต์ ํ | ์์ ํจ๊ณผ | ๊ตฌํ ๋ณต์ก๋ |
|
| 1638 |
+
|--------|-----------|-------------|
|
| 1639 |
+
| QKV Fusion | +8~12% MFU | ์ค๊ฐ |
|
| 1640 |
+
| NUMA Affinity | +4~9% | ๋ฎ์ |
|
| 1641 |
+
| FA2 Native RoPE | +3~5% | ๋ฎ์ |
|
| 1642 |
+
| HugePages | +1~3% (TLB ์ต์ ํ) | ๋ฎ์ (sysctl) |
|
| 1643 |
+
|
| 1644 |
+
์ด ์ต์ ํ๋ค์ ๋ชจ๋ ์ ์ฉํ๋ฉด ํ์ฌ 33.5% MFU์์ 45~50%๊น์ง ๋๋ฌํ ๊ฐ๋ฅ์ฑ์ด ์๋ค.
|
| 1645 |
+
|
| 1646 |
+
---
|
| 1647 |
+
|
| 1648 |
+
## 15. ์ฐธ๊ณ ๋ฌธ์
|
| 1649 |
+
|
| 1650 |
+
| ๋ฌธ์ | ์์น | ๋ด์ฉ |
|
| 1651 |
+
|------|------|------|
|
| 1652 |
+
| ํ๋ก์ ํธ ์ ์ฒด ์ฌ์ | `docs/PROJECT_HISTORY.md` | ์ผ๋ณ ์์ธ ์งํ ๊ธฐ๋ก |
|
| 1653 |
+
| 3B ์์
๊ณํ | `docs/3B_WORKPLAN.md` | 3B ๋จ๊ณ๋ณ ์์
๊ณํ ์์ธ |
|
| 1654 |
+
| ์ ์คํฐ์ค๋ฆฌ๊ทธ ๋
ผ์ฆ | `eval/debate/justice_league_3b_case.md` | 1Bโ3B ์ ํ ๋ฉํฐ์์ด์ ํธ ํ ๋ก ์ ๋ฌธ |
|
| 1655 |
+
| SFT ์ฌ์์ ํ๊ฒฐ | `eval/decision/FINAL_DECISION_REPORT.md` | SFT v1 ์คํจ โ v2 ์ค๊ณ ํ๊ฒฐ๋ฌธ |
|
| 1656 |
+
| 3B ๋ง์คํฐ ํ๋ | `eval/plan/3B_MASTER_PLAN.md` | ์ ์ฒด ํ์ต ํ์ดํ๋ผ์ธ ๋ง์คํฐ ํ๋ |
|
| 1657 |
+
| Phase 0 ์ต์ ํ ๋ณด๊ณ ์ | `reports/2026-03-02_0200_FRANKENSTALLM_phase0_optimization_report.md` | VRAM/MFU ์ต์ ํ ์ ์ฒด ๋ณด๊ณ |
|
| 1658 |
+
| 3B Base ํ๊ฐ ๋ณด๊ณ ์ (v1) | `reports/2026-03-05_3B_BASE_EVALUATION_REPORT.md` | ์ด๊ธฐ PPL/๋ฒค์น๋งํฌ/๋ฐ๋ณต๋ฅ ํ๊ฐ |
|
| 1659 |
+
| PPL ํ๊ฐ ๋ณด๊ณ ์ (v1) | `reports/2026-03-05_PPL_EVALUATION.md` | 4๊ฐ ๊ฒ์ฆ์
PPL ์์ธ |
|
| 1660 |
+
| ๋ฒค์น๋งํฌ ๊ฒฐ๊ณผ (v1) | `reports/2026-03-05_BENCHMARK_RESULTS.md` | belebele, MMLU ์์ธ |
|
| 1661 |
+
| ์์ฑ ํ์ง ๋ถ์ (v1) | `reports/2026-03-05_GENERATION_QUALITY.md` | ๋ฐ๋ณต๋ฅ , ๋์ฝ๋ฉ ํ๋ผ๋ฏธํฐ |
|
| 1662 |
+
| SFT ํ์ต ๋ณด๊ณ ์ | `reports/2026-03-05_3B_SFT_PROGRESS_REPORT.md` | Phase 2 SFT ํ์ต ๊ณผ์ ๊ธฐ๋ก |
|
| 1663 |
+
| **SFT ์๋ฃ ์ข
ํฉ ๋ณด๊ณ ์** | `reports/2026-03-06_3B_SFT_COMPLETION_AND_EVAL_SUMMARY.md` | **SFT ์๋ฃ + ํ๊ฐ + ์ฝ๋ ๊ฐ์ + ORPO ๊ฒฐ์ (์ต์ )** |
|
| 1664 |
+
| SFT ํ๊ฐ ๊ณํ์ | `reports/2026-03-06_3B_SFT_EVAL_PLAN.md` | 6์ฐจ์ ํ๊ฐ ์ค๊ณ |
|
| 1665 |
+
| SFT ํ๊ฐ ๊ฒฐ๊ณผ | `reports/2026-03-06_3B_SFT_EVALUATION_REPORT.md` | 6์ฐจ์ ํ๊ฐ ์์ธ ๊ฒฐ๊ณผ |
|
| 1666 |
+
| 3B ํ์ ๋จ๊ณ ์ฐธ์กฐ | `reports/2026-03-05_3B_NEXT_STEPS_REFERENCE.md` | SFT ํ ๋ฐฉํฅ์ฑ |
|
| 1667 |
+
| Nemotron Nano ํ๋น์ฑ | `reports/2026-03-05_NEMOTRON_NANO_FEASIBILITY_STUDY.md` | Hybrid ์ํคํ
์ฒ ๊ฒํ |
|
| 1668 |
+
| **v2 ์ข
ํฉ ํ๊ฐ ๋ฆฌํฌํธ** | `eval/outputs/3b_reeval_20260305_1451/full_eval_report.md` | **13+ ๋ฒค์น๋งํฌ ์ข
๏ฟฝ๏ฟฝ๏ฟฝ** |
|
| 1669 |
+
| v2 PPL ๋ฆฌํฌํธ | `eval/outputs/3b_reeval_20260305_1451/reports/01_perplexity_report.md` | 19๊ฐ ๋ฐ์ดํฐ์
PPL ์์ธ |
|
| 1670 |
+
| v2 Calibration ๋ฆฌํฌํธ | `eval/outputs/3b_reeval_20260305_1451/reports/02_calibration_report.md` | Top-K ์ ํ๋, NLL ๋ถํฌ |
|
| 1671 |
+
| v2 ์์ฑ ํ์ง ๋ฆฌํฌํธ | `eval/outputs/3b_reeval_20260305_1451/reports/03_generation_quality.md` | 12์กฐํฉ ํ๋ผ๋ฏธํฐ ๊ทธ๋ฆฌ๋ ์์น |
|
| 1672 |
+
| v2 ๋ฒค์น๋งํฌ ๋ฆฌํฌํธ | `eval/outputs/3b_reeval_20260305_1451/reports/04_benchmark_report.md` | KoBEST, HAE-RAE, MMLU, 0/5-shot |
|
| 1673 |
+
| ์งํ ๊ธฐ๋ก | `PROGRESS.md` | ๋ ์ง๋ณ ์ฒดํฌํฌ์ธํธ, ์งํ, ๊ฒฐ์ ๋ก๊ทธ |
|
| 1674 |
+
| **ORPO ๋ถ์ ๋ฐ ๊ณํ** | `reports/2026-03-07_ORPO_ANALYSIS_AND_PLAN.md` | **ORPO ์งํ ๊ทผ๊ฑฐ, HP ์ค๊ณ, ์คํ ์ ์ฐจ** |
|
| 1675 |
+
| **ORPO Sweep ๋๋ฒ๊ทธ** | `reports/2026-03-08_ORPO_SWEEP_DEBUG_REPORT.md` | **QKV ๋ฒ๊ทธ, NCCL timeout, TRL ํจ์น ์์ธ** |
|
| 1676 |
+
| **ORPO ํ์ต ์ฌ์ ** | `reports/2026-03-08_ORPO_TRAINING_JOURNEY.md` | **ORPO ์ ์ฒด ๊ณผ์ : 5๋ฒ์ ์คํจ์ HP sweep (์ต์ )** |
|
| 1677 |
+
|
| 1678 |
+
---
|
| 1679 |
+
|
| 1680 |
+
## 16. ๊ธฐ์ ์คํ ์์ฝ
|
| 1681 |
+
|
| 1682 |
+
| ์์ญ | ๊ธฐ์ | ๋ฒ์ |
|
| 1683 |
+
|------|------|------|
|
| 1684 |
+
| ๋ฅ๋ฌ๋ ํ๋ ์์ํฌ | PyTorch (NVIDIA ์ปค์คํ
๋น๋) | nv25.12 |
|
| 1685 |
+
| ์ดํ
์
| FlashAttention-2 | 2.7.4.post1+25.12 |
|
| 1686 |
+
| FP8 / ํผํฉ ์ ๋ฐ๋ | TransformerEngine (MXFP8) | 2.10.0 |
|
| 1687 |
+
| ๋ถ์ฐ ํ์ต | DDP + NCCL (NVLS) | NCCL 2.28.9 |
|
| 1688 |
+
| ์ปค๋ ์ปดํ์ผ | Triton | 3.5.1 |
|
| 1689 |
+
| ํ ํฌ๋์ด์ | SentencePiece Unigram 64K | - |
|
| 1690 |
+
| ๋ชจ๋ํฐ๋ง | Telegram Bot (B200Bot) + cron watchdog | - |
|
| 1691 |
+
| ์ถ๋ก ์๋น | GGUF + Ollama | - |
|
| 1692 |
+
| GPU | 8ร NVIDIA B200 (NVLink 5.0, NVSwitch) | CUDA 13.1 |
|
| 1693 |
+
| CPU | 2ร AMD EPYC 9365 (Zen 5) | - |
|
| 1694 |
+
|
| 1695 |
+
---
|
| 1696 |
+
|
| 1697 |
+
## ๊ด๋ จ ํ๋ก์ ํธ
|
| 1698 |
+
|
| 1699 |
+
### [EVAFRILL-Mo](https://github.com/pathcosmos/EVAFRILL-Mo)
|
| 1700 |
+
|
| 1701 |
+
**ํ์ด๋ธ๋ฆฌ๋ Mamba-2 + Transformer ์ธ์ด ๋ชจ๋ธ** โ FRANKENSTALLM์ ์๋งค ํ๋ก์ ํธ.
|
| 1702 |
+
|
| 1703 |
+
NVIDIA [Nemotron-H](https://arxiv.org/abs/2504.03624) ์ํคํ
์ฒ์์ ์๊ฐ์ ๋ฐ์ ๋ฐ๋ฐ๋ฅ๋ถํฐ ์ง์ ๊ตฌํํ 3B ํ์ด๋ธ๋ฆฌ๋ ๋ชจ๋ธ์ด๋ค. FRANKENSTALLM์ด ์์ Transformer ๊ธฐ๋ฐ์ด๋ผ๋ฉด, EVAFRILL-Mo๋ **Mamba-2 SSM + ํฌ์ Transformer ์ดํ
์
** ํ์ด๋ธ๋ฆฌ๋ ๊ตฌ์กฐ๋ฅผ ์ฑํํ๋ค.
|
| 1704 |
+
|
| 1705 |
+
| ํญ๋ชฉ | FRANKENSTALLM | EVAFRILL-Mo |
|
| 1706 |
+
|------|:---:|:---:|
|
| 1707 |
+
| ์ํคํ
์ฒ | ์์ Transformer (28L) | Mamba-2 24L + Attention 2L |
|
| 1708 |
+
| ํ๋ผ๋ฏธํฐ | 3.17B | 2.94B |
|
| 1709 |
+
| ํต์ฌ ๊ธฐ์ | GQA, FP8, FlashAttention-2 | Selective Scan, SwiGLU FFN in Mamba, GQA |
|
| 1710 |
+
| ์ค๊ณ ์์น | ๊ฒ์ฆ๋ Transformer ์ํคํ
์ฒ | Nemotron-H ๋จํธํ ๋์
|
|
| 1711 |
+
| GPU | 8ร B200 | 7ร B200 |
|
| 1712 |
+
| ํ์ต ์ ๋ต | Chinchilla-optimal | Chinchilla 93% ๋ฌ์ฑ ๋ชฉํ |
|
| 1713 |
+
|
| 1714 |
+
๋ ํ๋ก์ ํธ๋ ๋์ผํ ํ ํฌ๋์ด์ (64K SentencePiece), ํ์ต ๋ฐ์ดํฐ ํ์ดํ๋ผ์ธ, DDP/FP8 ์ธํ๋ผ๋ฅผ ๊ณต์ ํ๋ค. "๊ฐ์ ์ฌ๋ฃ, ๋ค๋ฅธ ๋ ์ํผ"๋ก ์ํคํ
์ฒ ์ฐจ์ด๊ฐ ์ฑ๋ฅ์ ๋ฏธ์น๋ ์ํฅ์ ๋น๊ต ์คํํ ์ ์๋ค.
|
| 1715 |
+
|
| 1716 |
+
> *์ด๋ฆ์ ์ ๋: Bride **Eva** (ํ๋์ผ์ํ์ธ์ ์ ๋ถ) + **FRI**DAY (์์ด์ธ๋งจ AI ๋น์) + **LL**M + Nemotron์ **Mo***
|
| 1717 |
+
|
| 1718 |
+
---
|
| 1719 |
+
|
| 1720 |
+
## 18. ๋ค์ ์ต์ ํ ๊ณํ โ MFU 33.5% โ 47% ๋ชฉํ
|
| 1721 |
+
|
| 1722 |
+
> ์์ธ ๋ฌธ์: [`docs/NEXT_OPTIMIZATION_PLAN.md`](docs/NEXT_OPTIMIZATION_PLAN.md)
|
| 1723 |
+
|
| 1724 |
+
### ํ์ฌ ์ฑ๋ฅ ์ง๋จ
|
| 1725 |
+
|
| 1726 |
+
Phase 1 ํ๋ฆฌํธ๋ ์ธ ์ค์ธก:
|
| 1727 |
+
- **57,000 steps**, ~38.5B tokens, **์ฝ 63์๊ฐ**
|
| 1728 |
+
- ์ฒ๋ฆฌ ์๋: 36~38K tok/s per rank โ ์ ์ฒด **~292K tok/s** (8GPU)
|
| 1729 |
+
- **MFU: ~33.5%**
|
| 1730 |
+
|
| 1731 |
+
### ํต์ฌ ๋ณ๋ชฉ: NUMA Misalignment
|
| 1732 |
+
|
| 1733 |
+
```
|
| 1734 |
+
AMD EPYC 9365 ร 2์์ผ:
|
| 1735 |
+
GPU 0~3 โ NUMA node 0 (core 0-35)
|
| 1736 |
+
GPU 4~7 โ NUMA node 1 (core 36-71)
|
| 1737 |
+
|
| 1738 |
+
์ด๊ธฐ DDP ๋ฐ์นญ ์ 5/8 rank๊ฐ ์๋ชป๋ NUMA ๋
ธ๋์์ ์คํ.
|
| 1739 |
+
69%์ DataLoader worker๊ฐ ํฌ๋ก์ค-NUMA โ ~2๋ฐฐ ์ง์ฐ ๋ฐ์.
|
| 1740 |
+
```
|
| 1741 |
+
|
| 1742 |
+
### ์ต์ ํ ํญ๋ชฉ๋ณ ์์ ํจ๊ณผ
|
| 1743 |
+
|
| 1744 |
+
| ์ต์ ํ | ์์ MFU ๊ฐ์ | ๋์ด๋ |
|
| 1745 |
+
|--------|-------------|--------|
|
| 1746 |
+
| NUMA affinity ๊ณ ์ | +4~9% | ๋ฎ์ (launch script ์์ ) |
|
| 1747 |
+
| QKV fusion (TransformerEngine) | +8~12% | ์ค๊ฐ (๋ชจ๋ธ ์ฝ๋ ์์ ) |
|
| 1748 |
+
| FA2 native RoPE | +3~5% | ์ค๊ฐ (FA2 ๋ฒ์ ์์กด) |
|
| 1749 |
+
| NCCL ํ๊ฒฝ๋ณ์ ํ๋ | +1~2% | ๋ฎ์ (ํ ์ค ์ถ๊ฐ) |
|
| 1750 |
+
|
| 1751 |
+
### ์ต์ ํ ์ ํ ์์ ๋น๊ต
|
| 1752 |
+
|
| 1753 |
+
| ํญ๋ชฉ | ํ์ฌ | ์ต์ ํ ํ |
|
| 1754 |
+
|------|------|----------|
|
| 1755 |
+
| MFU | 33.5% | ~45~47% |
|
| 1756 |
+
| ์ฒ๋ฆฌ์๋ | 292K tok/s | ~390~410K tok/s |
|
| 1757 |
+
| 50B ํ ํฐ ํ์ต | ~47์๊ฐ | ~34~36์๊ฐ |
|
| 1758 |
+
|
| 1759 |
+
### ์ฆ์ ์ ์ฉ ๊ฐ๋ฅํ ์ฝ๋
|
| 1760 |
+
|
| 1761 |
+
**NUMA affinity (launch script):**
|
| 1762 |
+
|
| 1763 |
+
```bash
|
| 1764 |
+
numactl --cpunodebind=0 --membind=0 torchrun \
|
| 1765 |
+
--nproc_per_node=4 --node_rank=0 train/pretrain.py ... &
|
| 1766 |
+
numactl --cpunodebind=1 --membind=1 torchrun \
|
| 1767 |
+
--nproc_per_node=4 --node_rank=1 train/pretrain.py ... &
|
| 1768 |
+
```
|
| 1769 |
+
|
| 1770 |
+
**NCCL ํ๊ฒฝ๋ณ์:**
|
| 1771 |
+
|
| 1772 |
+
```bash
|
| 1773 |
+
export NCCL_MIN_NCHANNELS=4
|
| 1774 |
+
export NCCL_SOCKET_NTHREADS=4
|
| 1775 |
+
export CUDA_DEVICE_MAX_CONNECTIONS=1
|
| 1776 |
+
```
|
| 1777 |
+
|
| 1778 |
+
> Phase 3 ORPO ์๋ฃ ํ, ๋ค์ ํ๋ฆฌํธ๋ ์ธ ๋ฐ ์ ์ NUMA affinity๋ฅผ ๋จผ์ ์ ์ฉํ๋ฉด ํ์ต ์๊ฐ์ ~30% ๋จ์ถํ ์ ์๋ค.
|
| 1779 |
+
|
| 1780 |
+
---
|
| 1781 |
+
|
| 1782 |
+
## 19. GPU ํ๋์จ์ด & ๋น์ฉ ๋ถ์ โ 3B ร 60B ํ๋ฆฌํธ๋ ์ธ
|
| 1783 |
+
|
| 1784 |
+
> ์์ธ ๋ฌธ์: [`docs/GPU_COST_ANALYSIS.md`](docs/GPU_COST_ANALYSIS.md)
|
| 1785 |
+
|
| 1786 |
+
### ์ค์ธก ๊ธฐ์ค ๋ฒ ์ด์ค๋ผ์ธ
|
| 1787 |
+
|
| 1788 |
+
```
|
| 1789 |
+
FRANKENSTALLM Phase 1 ์ค์ธก:
|
| 1790 |
+
B200 ร 8, MFU 33.5%, 292K tok/s
|
| 1791 |
+
38.5B ํ ํฐ โ 63์๊ฐ
|
| 1792 |
+
60B ํ ํฐ ํ์ฐ โ ์ฝ 98์๊ฐ
|
| 1793 |
+
```
|
| 1794 |
+
|
| 1795 |
+
### ํด๋ผ์ฐ๋ ๊ฐ์ฑ๋น Top 3 (60B ํ ํฐ, ์ต์ ํ ํ)
|
| 1796 |
+
|
| 1797 |
+
| ์์ | ๊ตฌ์ฑ | ์์์๊ฐ | ์ด ๋น์ฉ |
|
| 1798 |
+
|------|------|---------|--------|
|
| 1799 |
+
| 1 | H100ร8 Cudo | 44.8hr | **$645** (~93๋ง์) |
|
| 1800 |
+
| 2 | H100ร8 Vast.ai | 44.8hr | $670 (~97๋ง์) |
|
| 1801 |
+
| 3 | H100ร8 RunPod | 44.8hr | $713 (~103๋ง์) |
|
| 1802 |
+
|
| 1803 |
+
> B200 Blackwell์ด ๋น ๋ฅด์ง๋ง, ํด๋ผ์ฐ๋ ๋จ๊ฐ๊ฐ H100์ 3๋ฐฐ โ **H100์ด ์ด๋น์ฉ 4.3๋ฐฐ ์ ๋ ด**
|
| 1804 |
+
|
| 1805 |
+
### ๊ฐ์ธ GPU ๊ตฌ์ฑ ์ถ์ฒ
|
| 1806 |
+
|
| 1807 |
+
| ๊ตฌ์ฑ | VRAM | NVLink | ๊ฐ๊ฒฉ | ์ถ์ฒ๋ |
|
| 1808 |
+
|------|------|--------|------|--------|
|
| 1809 |
+
| A6000 Ada ร 2 ์ค๊ณ | 96GB (ํตํฉ) | โ
| ~1,000๋ง์ | โญโญโญโญโญ |
|
| 1810 |
+
| L40S ร 2 | 96GB (ํตํฉ) | โ
| ~1,400๋ง์ | โญโญโญโญ |
|
| 1811 |
+
| RTX Pro 6000 Blackwell | 96GB (๋จ์ผ) | โ | ~1,200๋ง์ | โญโญโญ |
|
| 1812 |
+
|
| 1813 |
+
> ์๋น์์ฉ GPU(RTX 5090/4090)๋ NVLink ๋ฏธ์ง์. 80GB+ ํตํฉ ๋ฉ๋ชจ๋ฆฌ ํ์ ์ ์ ๋ฌธ๊ฐ์ฉ ํ์.
|
| 1814 |
+
|
| 1815 |
+
### ์ถ์ฒ ์ ๋ต: ๋ก์ปฌ + ํด๋ผ์ฐ๋ ํ์ด๋ธ๋ฆฌ๋
|
| 1816 |
+
|
| 1817 |
+
```
|
| 1818 |
+
[๋ก์ปฌ] RTX 4090 ร 4 (880๋ง์) โ ๋ฐ์ดํฐ ์ ์ฒ๋ฆฌ, ์คํ, SFT/ORPO
|
| 1819 |
+
[ํด๋ผ์ฐ๋] H100ร8 (๋ฐ๋น ~103๋ง์) โ ๋ณธ ํ๋ฆฌํธ๋ ์ธ๋ง
|
| 1820 |
+
```
|
| 1821 |
+
|
| 1822 |
+
---
|
| 1823 |
+
|
| 1824 |
+
## ๋ง์น๋ฉฐ
|
| 1825 |
+
|
| 1826 |
+
์ด ํ๋ก์ ํธ์ ๋ชจํ ๋ ํ๋๋ค:
|
| 1827 |
+
|
| 1828 |
+
> **"๋งํ๋ ๊ฒ๋ ๊ธฐ๋กํ๋ค."**
|
| 1829 |
+
|
| 1830 |
+
SFT v1์ loss=0.0 ์คํจ, torch.compile์ด ํจ๊ณผ ์์๋ ๊ฒ, 18% ๋ฐ๋ณต๋ฅ ์ ์ข์ โ ์ด ๋ชจ๋ ๊ฒ์ด ๊ธฐ๋ก์ ๋จ์ ์๋ค. ๊ทธ๋ฆฌ๊ณ ์ด์ Phase 3 ORPO์์๋ ๊ทธ ์ ํต์ ์ด์ด์ง๋ค. **5๋ฒ์ ์คํจ** โ NCCL timeout, config ์ถฉ๋, QKV ๋ณํ ๋ฒ๊ทธ, ํฌํธ ์ถฉ๋, TRL NaN ๋ฒ๊ทธ โ ๋ฅผ ๊ฑฐ์ณ ๋ง์นจ๋ด 6-config HP sweep์ด ๋์๊ฐ๊ณ ์๋ค.
|
| 1831 |
+
|
| 1832 |
+
Frankenstein์ด ์กฐ๊ฐ๋ค์ ์ด์ด ๋ถ์ฌ ์๋ช
์ ๋ง๋ค์๋ฏ, ์ฐ๋ฆฌ๋ ๋ค์ํ ์์ค์ ๋ฐ์ดํฐ์ ๊ธฐ์ ์ ์ด์ด ๋ถ์ฌ ํ๊ตญ์ด๋ฅผ ์ดํดํ๊ณ ๋งํ๋ ๋ชจ๋ธ์ ๋ง๋ค์ด๊ฐ๊ณ ์๋ค. ์์ง ์์ฑ๋์ง ์์์ง๋ง, ๊ทธ ๊ณผ์ ์์ฒด๊ฐ ์ด ํ๋ก์ ํธ์ ๊ฐ์น๋ค.
|
| 1833 |
+
|
| 1834 |
+
Phase 1 ํ๋ฆฌํธ๋ ์ธ์ 57,000 steps, loss 1.466์ผ๋ก ์๋ฃ๋๋ค. Phase 2 SFT๋ 25,500 steps์์ early stopping (val_loss 1.8851). 6์ฐจ์ ์ข
ํฉ ํ๊ฐ์์ 4/6์ ํต๊ณผํ๋ค.
|
| 1835 |
+
|
| 1836 |
+
**์ข์ ์์**: ์ง์ ๋ณด์กด์ด ๊ฑฐ์ ์๋ฒฝํ๋ค (forgetting 0.9%). SFT๊ฐ base ๋ชจ๋ธ์ ์ง์์ ํ๊ดดํ์ง ์์๋ค. EOS ์ข
๋ฃ์จ์ 0%์์ 60%๋ก ์ฌ๋ผ๊ฐ๋ค. MMLU-KO๋ +3.2pp ๊ฐ์ ๋์๋ค.
|
| 1837 |
+
|
| 1838 |
+
**์์ฌ์ด ์์**: greedy ๋ฐ๋ณต๋ฅ 72.97%. SFT๋ง์ผ๋ก๋ ๋ฐ๋ณต ๋ฌธ์ ๊ฐ ํด๊ฒฐ๋์ง ์์๋ค. ์คํ๋ ค ์
ํ๋์๋ค (Base 60.99% โ SFT 72.97%). ํ์ง๋ง `rep_penalty=1.2`๋ง ์ ์ฉํ๋ฉด ๋ฐ๋ณต๋ฅ 0%๊ฐ ๋ฌ์ฑ๋๋ค. ๋ชจ๋ธ์ ๋ฐ๋ณตํ์ง ์๋ ๋ฅ๋ ฅ์ ๊ฐ์ง๊ณ ์๋ค. ๋ค๋ง ๊ทธ๊ฒ์ "๊ธฐ๋ณธ ํ๋"์ผ๋ก ํ์ตํ์ง ๋ชปํ์ ๋ฟ์ด๋ค.
|
| 1839 |
+
|
| 1840 |
+
**ํ์ฌ**: Phase 3 ORPO ๋ณธ ํ์ต์ด ์งํ ์ค์ด๋ค. 6-config HP sweep์ ๋ชจ๋ ์๋ฃํ๊ณ , eval_loss ๊ธฐ์ค ์ต์ config (lr=1.2e-5, beta=0.25)๋ฅผ ์ ์ ํ๋ค. Throughput ๋ฒค์น๋งํฌ๋ก batch_size=4, grad_accum=4 ์กฐํฉ์ด 80.63 samples/s๋ก ์ต์ ์์ ํ์ธํ๊ณ , 8รB200 ์ ์ฒด GPU๋ก ๋ณธ ํ์ต์ ์์ํ๋ค. ~9,840 steps, ์์ ~4.8์๊ฐ. ํ์ต ์๋ฃ ์ watchdog์ด ์๋์ผ๋ก 10์ฐจ์ ์ข
ํฉ ํ๊ฐ(Base vs SFT vs ORPO 3-way ๋น๊ต)๋ฅผ ์คํํ๋ค.
|
| 1841 |
+
|
| 1842 |
+
> **ORPO๊ฐ greedy ๋ฐ๋ณต๋ฅ ์ 5% ๋ฏธ๋ง์ผ๋ก ๋์ด๋ด๋ฆด ์ ์๋๊ฐ?**
|
| 1843 |
+
|
| 1844 |
+
๊ทธ ๋ต์ด ๊ณง ๋์จ๋ค. ํ์ต์ด ๋๋๋ฉด 6์ฐจ์ ์ฌํ๊ฐ๋ฅผ ์ํํ๊ณ , ํต๊ณผํ๋ฉด GGUF๋ก ๋ณํ๋์ด Ollama ์์์ ๋์๊ฐ๊ฒ ๋๋ค. ํ๊ตญ์ด๋ฅผ ์ดํดํ๊ณ ๋งํ๋ 3B ๋ชจ๋ธ, ์ฒ์๋ถํฐ ๋ง๋ ๊ฒ.
|
| 1845 |
+
|
| 1846 |
+
---
|
| 1847 |
+
|
| 1848 |
+
*์ต์ข
์
๋ฐ์ดํธ: 2026-03-09*
|
| 1849 |
+
*ํ์ฌ ์ํ: Phase 3 ORPO ๋ณธ ํ์ต ์งํ ์ค (lr=1.2e-5, beta=0.25, step ~1,660/9,840, 17%) โ ํ์ต ์๋ฃ ์ 10์ฐจ์ ์ข
ํฉ ํ๊ฐ ์๋ ์คํ ๋๊ธฐ*
|