details about the model

#1
by Roman1111111 - opened

could you pls provide details including what dataset did you use to fine tune

@Roman1111111 I used nohurry/Opus-4.6-Reasoning-3000x-filtered dataset. Updated the model card

do you plan on testing it on benchmarks to see how the opus 4.6 dataset align on it?

@Roman1111111 I indeed wanted to, but currently I am busy so I might not be able to evaluate the model myself. I am not really expecting much from this model because the dataset is small, and there is a possibility I messed with wrong hyperparameters, etc(Do note that I am no expert, merely a student). But feel free to evaluate or use the model for any purpose you see fit.

oh, no worries, i will try to test it, im not the expert either))

@Roman1111111 Thank you! I am also curious about the model's performance for I was quite interested with the original Qwen3.5's intelligence. I do forward to see how much of the model's performance was boosted with the distillation

thank you too for the model itself

@Roman1111111 No worries. Don't forget to share your evaluation results if possible :)

Of course, just waiting for the download to finish

you are everywhere roman

u too❀️

you are everywhere roman

❀️ roman sniffing new data for the gemini collection

hello, currently i tested just for to confirm eveything works on gsm8k benchmark - and it showed outstanding results with reduce of tokens use by 40-50% and better scores - by 17% than the base in this becnh, here are scores - original qwen3.5 35b a3b-77%/71.5%/35.15min, this model - 94%/84.5%/21.65min

here is the qwen 3.5 original model - 77%/71.5% - (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ lm_eval --model local-chat-completions --tasks gsm8k --model_args model=qwen/qwen3.5-35b-a3b,base_url=http://172.29.128.1:1234/v1/chat/completions,num_concurrent=4 --ap
ply_chat_template --num_fewshot 5 --gen_kwargs max_tokens=5000 --output_path ./gsm8k_final_results --limit 200
2026-03-02:17:09:16 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-03-02:17:09:16 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-03-02:17:09:19 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-02:17:09:19 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-02:17:09:19 WARNING [evaluator:223] generation_kwargs: {'max_tokens': 5000} specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
2026-03-02:17:09:19 INFO [evaluator:236] Initializing local-chat-completions model, with arguments: {'model': 'qwen/qwen3.5-35b-a3b', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}
2026-03-02:17:09:19 INFO [models.api_models:172] Using max length 2048 - 1
2026-03-02:17:09:19 INFO [models.api_models:193] Using tokenizer None
2026-03-02:17:09:24 INFO [tasks:700] Selected tasks:
2026-03-02:17:09:24 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-02:17:09:24 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0, 'max_tokens': 5000}
2026-03-02:17:09:24 WARNING [evaluator:333] Overwriting default num_fewshot of gsm8k from 5 to 5
2026-03-02:17:09:24 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-03-02:17:09:24 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 200/200 [00:00<00:00, 506.09it/s]
2026-03-02:17:09:24 INFO [evaluator:584] Running generate_until requests
2026-03-02:17:09:24 INFO [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API: 50%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Œ | 100/200 [17:41<21:31, 12.92s/it]Requesting API: 70%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Š | 141/200 [25:03<11:42, 11.91s/it]Requesting API: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 200/200 [35:11<00:00, 10.56s/it]
2026-03-02:17:44:41 INFO [loggers.evaluation_tracker:247] Saving results aggregated
local-chat-completions ({'model': 'qwen/qwen3.5-35b-a3b', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}), gen_kwargs: ({'max_tokens': 5000}), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.770 Β± 0.0298
strict-match 5 exact_match ↑ 0.715 Β± 0.0320
(base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$

and here is your model - (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ lm_eval --model local-chat-completions --tasks gsm8k --model_args model=qwen3.5-opus-4.6,base_url=http://172.29.128.1:1234/v1/chat/completions,num_concurrent=4 --apply_chat_template --num_fewshot 5 --gen_kwargs max_tokens=5000 --output_path ./gsm8k_final_results --limit 200
2026-03-02:16:46:05 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-03-02:16:46:05 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-03-02:16:46:08 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-02:16:46:08 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-02:16:46:08 WARNING [evaluator:223] generation_kwargs: {'max_tokens': 5000} specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
2026-03-02:16:46:08 INFO [evaluator:236] Initializing local-chat-completions model, with arguments: {'model': 'qwen3.5-opus-4.6', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}
2026-03-02:16:46:08 INFO [models.api_models:172] Using max length 2048 - 1
2026-03-02:16:46:08 INFO [models.api_models:193] Using tokenizer None
2026-03-02:16:46:12 INFO [tasks:700] Selected tasks:
2026-03-02:16:46:12 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-02:16:46:12 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0, 'max_tokens': 5000}
2026-03-02:16:46:12 WARNING [evaluator:333] Overwriting default num_fewshot of gsm8k from 5 to 5
2026-03-02:16:46:12 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-03-02:16:46:12 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 200/200 [00:00<00:00, 524.23it/s]
2026-03-02:16:46:12 INFO [evaluator:584] Running generate_until requests
2026-03-02:16:46:12 INFO [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 200/200 [21:41<00:00, 6.51s/it]
2026-03-02:17:07:58 INFO [loggers.evaluation_tracker:247] Saving results aggregated
local-chat-completions ({'model': 'qwen3.5-opus-4.6', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}), gen_kwargs: ({'max_tokens': 5000}), limit: 200.0, num_fewshot: 5, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
gsm8k 3 flexible-extract 5 exact_match ↑ 0.940 Β± 0.0168
strict-match 5 exact_match ↑ 0.845 Β± 0.0257
(base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$

now waiting for all becnhamrks scores, then to compare more reliably

i was running q4_k_m precision

@Roman1111111 Fascinating! Thank you for sharing your results on GSM8K. This is way better than I initially anticipated for this distillation variant. Training smaller models on frontier LLM reasoning dataset is indeed useful on boosting model reasoning performance. I am especially intrigued on the fact that it used substantially less tokens. Again, thank you for sharing this impressive result!

your welcome, wait for the upcoming --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k,bbh_cot_fewshot,triviaqa,piqa,sciq,aime25,mmlu_pro results, and you actually made impressive model fine tuning

Sign up or log in to comment