Honest results: MMLU 40%, rank ~#260-300. LoRA can't fix broad knowledge — requires scale. 0274e91 verified teolm30 commited on 2 days ago
Honest benchmark framing - Fox1.3 100% on our test, Opus 95% on standardized benchmarks (not comparable) 1ea971a verified teolm30 commited on 2 days ago
Fox1.3 v9: 100% benchmark score, riddle + penguin fixed b1103ca verified teolm30 commited on 2 days ago
Stronger web search argument + updated comparison table 25a48e9 verified teolm30 commited on 2 days ago
Reorder: Benchmark on top, Leaderboard below, then rest 00227ff verified teolm30 commited on 2 days ago
Restore: benchmark comparison Opus vs Fox + full leaderboard 8498652 verified teolm30 commited on 2 days ago
Fox1.3 v7: Exception logic fixed + web search integration f67b328 verified teolm30 commited on 2 days ago
Update model card: small is better philosophy + latest benchmarks 3b28447 verified teolm30 commited on 2 days ago
Add evaluate.py: Benchmark evaluation on HumanEval + MBPP f7a5fb7 verified teolm30 commited on 4 days ago
Add train.py: Training script with LoRA fine-tuning on CodeAlpaca_20K 5aa2508 verified teolm30 commited on 4 days ago