details about the model
could you pls provide details including what dataset did you use to fine tune
do you plan on testing it on benchmarks to see how the opus 4.6 dataset align on it?
@Roman1111111 I indeed wanted to, but currently I am busy so I might not be able to evaluate the model myself. I am not really expecting much from this model because the dataset is small, and there is a possibility I messed with wrong hyperparameters, etc(Do note that I am no expert, merely a student). But feel free to evaluate or use the model for any purpose you see fit.
oh, no worries, i will try to test it, im not the expert either))
@Roman1111111 Thank you! I am also curious about the model's performance for I was quite interested with the original Qwen3.5's intelligence. I do forward to see how much of the model's performance was boosted with the distillation
thank you too for the model itself
Of course, just waiting for the download to finish
you are everywhere roman
u tooβ€οΈ
you are everywhere roman
β€οΈ roman sniffing new data for the gemini collection
hello, currently i tested just for to confirm eveything works on gsm8k benchmark - and it showed outstanding results with reduce of tokens use by 40-50% and better scores - by 17% than the base in this becnh, here are scores - original qwen3.5 35b a3b-77%/71.5%/35.15min, this model - 94%/84.5%/21.65min
here is the qwen 3.5 original model - 77%/71.5% - (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ lm_eval --model local-chat-completions --tasks gsm8k --model_args model=qwen/qwen3.5-35b-a3b,base_url=http://172.29.128.1:1234/v1/chat/completions,num_concurrent=4 --ap
ply_chat_template --num_fewshot 5 --gen_kwargs max_tokens=5000 --output_path ./gsm8k_final_results --limit 200
2026-03-02:17:09:16 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-03-02:17:09:16 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-03-02:17:09:19 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-02:17:09:19 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-02:17:09:19 WARNING [evaluator:223] generation_kwargs: {'max_tokens': 5000} specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
2026-03-02:17:09:19 INFO [evaluator:236] Initializing local-chat-completions model, with arguments: {'model': 'qwen/qwen3.5-35b-a3b', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}
2026-03-02:17:09:19 INFO [models.api_models:172] Using max length 2048 - 1
2026-03-02:17:09:19 INFO [models.api_models:193] Using tokenizer None
2026-03-02:17:09:24 INFO [tasks:700] Selected tasks:
2026-03-02:17:09:24 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-02:17:09:24 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0, 'max_tokens': 5000}
2026-03-02:17:09:24 WARNING [evaluator:333] Overwriting default num_fewshot of gsm8k from 5 to 5
2026-03-02:17:09:24 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-03-02:17:09:24 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [00:00<00:00, 506.09it/s]
2026-03-02:17:09:24 INFO [evaluator:584] Running generate_until requests
2026-03-02:17:09:24 INFO [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API: 50%|βββββββββββββββββββββββββββββββββ | 100/200 [17:41<21:31, 12.92s/it]Requesting API: 70%|ββββββββββββββββββββββββββββββββββββββββββββββ | 141/200 [25:03<11:42, 11.91s/it]Requesting API: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [35:11<00:00, 10.56s/it]
2026-03-02:17:44:41 INFO [loggers.evaluation_tracker:247] Saving results aggregated
local-chat-completions ({'model': 'qwen/qwen3.5-35b-a3b', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}), gen_kwargs: ({'max_tokens': 5000}), limit: 200.0, num_fewshot: 5, batch_size: 1
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | β | 0.770 | Β± | 0.0298 |
| strict-match | 5 | exact_match | β | 0.715 | Β± | 0.0320 | ||
| (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ |
and here is your model - (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ lm_eval --model local-chat-completions --tasks gsm8k --model_args model=qwen3.5-opus-4.6,base_url=http://172.29.128.1:1234/v1/chat/completions,num_concurrent=4 --apply_chat_template --num_fewshot 5 --gen_kwargs max_tokens=5000 --output_path ./gsm8k_final_results --limit 200
2026-03-02:16:46:05 WARNING [config.evaluate_config:281] --limit SHOULD ONLY BE USED FOR TESTING. REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.
2026-03-02:16:46:05 INFO [config.evaluate_config:301] Using default fewshot_as_multiturn=True.
2026-03-02:16:46:08 INFO [_cli.run:376] Selected Tasks: ['gsm8k']
2026-03-02:16:46:08 INFO [evaluator:211] Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234 | Setting fewshot manual seed to 1234
2026-03-02:16:46:08 WARNING [evaluator:223] generation_kwargs: {'max_tokens': 5000} specified through cli, these settings will update set parameters in yaml tasks. Ensure 'do_sample=True' for non-greedy decoding!
2026-03-02:16:46:08 INFO [evaluator:236] Initializing local-chat-completions model, with arguments: {'model': 'qwen3.5-opus-4.6', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}
2026-03-02:16:46:08 INFO [models.api_models:172] Using max length 2048 - 1
2026-03-02:16:46:08 INFO [models.api_models:193] Using tokenizer None
2026-03-02:16:46:12 INFO [tasks:700] Selected tasks:
2026-03-02:16:46:12 INFO [tasks:691] Task: gsm8k (gsm8k/gsm8k.yaml)
2026-03-02:16:46:12 INFO [evaluator:314] gsm8k: Using gen_kwargs: {'until': ['Question:', '', '<|im_end|>'], 'do_sample': False, 'temperature': 0.0, 'max_tokens': 5000}
2026-03-02:16:46:12 WARNING [evaluator:333] Overwriting default num_fewshot of gsm8k from 5 to 5
2026-03-02:16:46:12 WARNING [evaluator:490] Chat template formatting change affects loglikelihood and multiple-choice tasks. See docs/chat-template-readme.md for details.
2026-03-02:16:46:12 INFO [api.task:311] Building contexts for gsm8k on rank 0...
100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [00:00<00:00, 524.23it/s]
2026-03-02:16:46:12 INFO [evaluator:584] Running generate_until requests
2026-03-02:16:46:12 INFO [models.api_models:733] Tokenized requests are disabled. Context + generation length is not checked.
Requesting API: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 200/200 [21:41<00:00, 6.51s/it]
2026-03-02:17:07:58 INFO [loggers.evaluation_tracker:247] Saving results aggregated
local-chat-completions ({'model': 'qwen3.5-opus-4.6', 'base_url': 'http://172.29.128.1:1234/v1/chat/completions', 'num_concurrent': 4}), gen_kwargs: ({'max_tokens': 5000}), limit: 200.0, num_fewshot: 5, batch_size: 1
| Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
|---|---|---|---|---|---|---|---|---|
| gsm8k | 3 | flexible-extract | 5 | exact_match | β | 0.940 | Β± | 0.0168 |
| strict-match | 5 | exact_match | β | 0.845 | Β± | 0.0257 | ||
| (base) romanyemelyanov@WIN-3KDMBNSD145:~/lm-evaluation-harness$ |
now waiting for all becnhamrks scores, then to compare more reliably
i was running q4_k_m precision
on both
@Roman1111111 Fascinating! Thank you for sharing your results on GSM8K. This is way better than I initially anticipated for this distillation variant. Training smaller models on frontier LLM reasoning dataset is indeed useful on boosting model reasoning performance. I am especially intrigued on the fact that it used substantially less tokens. Again, thank you for sharing this impressive result!
your welcome, wait for the upcoming --tasks mmlu,hellaswag,arc_challenge,truthfulqa_mc2,winogrande,gsm8k,bbh_cot_fewshot,triviaqa,piqa,sciq,aime25,mmlu_pro results, and you actually made impressive model fine tuning