CompactAI commited on
Commit
1304267
·
verified ·
1 Parent(s): 19379dd

Create Jackrong's Perfect Benchmarks And My Suspicious Mind.html

Browse files
Jackrong's Perfect Benchmarks And My Suspicious Mind.html ADDED
@@ -0,0 +1,158 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <!DOCTYPE html>
2
+ <html lang="en">
3
+ <head>
4
+ <meta charset="UTF-8">
5
+ <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>Jackrong's Perfect Benchmarks And My Suspicious Mind | FMN-GPT - CompactAI</title>
7
+ <link rel="preconnect" href="https://fonts.googleapis.com">
8
+ <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
9
+ <link href="https://fonts.googleapis.com/css2?family=Geist:wght@400;500;600;700&family=Geist+Mono&display=swap" rel="stylesheet">
10
+ <style>
11
+ :root {
12
+ --blue-900: #0a1628;
13
+ --blue-800: #0f2240;
14
+ --blue-700: #142d54;
15
+ --blue-600: #1a3a6b;
16
+ --blue-500: #2250a0;
17
+ --blue-400: #3a7bd5;
18
+ --blue-300: #6ba3f0;
19
+ --blue-200: #a8c8f5;
20
+ --blue-100: #d4e4fa;
21
+ --white: #ffffff;
22
+ --white-soft: #f0f4fa;
23
+ --white-muted: #c8d8ec;
24
+ --grid-line: rgba(255, 255, 255, 0.08);
25
+ --grid-line-major: rgba(255, 255, 255, 0.18);
26
+ --accent: #6ba3f0;
27
+ --accent-muted: #3a7bd5;
28
+ --font-sans: 'Geist', -apple-system, BlinkMacSystemFont, sans-serif;
29
+ --font-mono: 'Geist Mono', 'SF Mono', 'Fira Code', monospace;
30
+ --container-max: 1100px;
31
+ }
32
+ * { box-sizing: border-box; margin: 0; padding: 0; }
33
+ html { font-size: 16px; scroll-behavior: smooth; }
34
+ body { font-family: var(--font-sans); background: var(--blue-900); color: var(--white-muted); line-height: 1.7; -webkit-font-smoothing: antialiased; }
35
+ a { color: var(--white); text-decoration: none; transition: color 0.15s ease; }
36
+ a:hover { color: var(--accent); }
37
+ .container { max-width: var(--container-max); margin: 0 auto; padding: 0 24px; }
38
+ nav { position: fixed; top: 0; left: 0; right: 0; z-index: 100; background: rgba(10, 22, 40, 0.92); backdrop-filter: blur(12px); border-bottom: 1px solid var(--blue-600); padding: 16px 0; }
39
+ nav .container { display: flex; justify-content: space-between; align-items: center; }
40
+ .nav-brand { font-size: 18px; font-weight: 600; color: var(--white); display: flex; align-items: center; gap: 8px; }
41
+ .nav-brand span { color: var(--accent); }
42
+ .nav-links { display: flex; gap: 32px; }
43
+ .nav-links a { font-size: 14px; font-weight: 500; color: var(--blue-200); }
44
+ .nav-links a:hover { color: var(--white); }
45
+ .post { padding: 140px 0 80px; }
46
+ .post-back { display: inline-block; color: var(--blue-200); font-size: 14px; margin-bottom: 32px; }
47
+ .post-back:hover { color: var(--accent); }
48
+ .post-back::before { content: '← '; }
49
+ .post-meta { display: flex; gap: 12px; margin-bottom: 20px; }
50
+ .post-date { font-size: 13px; color: var(--blue-200); font-family: var(--font-mono); }
51
+ .post-tag { font-size: 11px; font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; color: var(--accent); background: rgba(107, 163, 240, 0.1); padding: 4px 10px; border-radius: 4px; }
52
+ .post h1 { font-size: 36px; font-weight: 700; color: var(--white); margin-bottom: 32px; line-height: 1.2; letter-spacing: -0.02em; }
53
+ .post-body p { font-size: 17px; line-height: 1.8; margin-bottom: 24px; color: var(--blue-200); }
54
+ .post-body p:first-of-type { font-size: 20px; color: var(--white-muted); }
55
+ .post-body h2 { font-size: 24px; font-weight: 600; color: var(--white); margin: 48px 0 20px; }
56
+ .post-body blockquote { border-left: 3px solid var(--accent); padding: 20px 24px; margin: 32px 0; background: var(--blue-800); border-radius: 0 8px 8px 0; }
57
+ .post-body blockquote p { font-size: 16px; font-style: italic; color: var(--blue-200); margin: 0; }
58
+ .post-body hr { border: none; height: 1px; background: var(--blue-600); margin: 48px 0; }
59
+ .code-block { background: var(--blue-800); border: 1px solid var(--blue-600); border-radius: 8px; padding: 20px; margin: 24px 0; font-family: var(--font-mono); font-size: 13px; overflow-x: auto; }
60
+ .code-block .comment { color: var(--blue-200); font-style: italic; display: block; margin-top: 4px; }
61
+ .post-footer { margin-top: 48px; padding-top: 32px; border-top: 1px solid var(--blue-600); }
62
+ .post-footer p { font-size: 14px; color: var(--blue-200); font-style: italic; margin: 0; }
63
+ footer { padding: 40px 0; background: var(--blue-800); border-top: 1px solid var(--blue-600); text-align: center; }
64
+ footer p { color: var(--blue-200); font-size: 14px; margin-bottom: 8px; }
65
+ footer a { color: var(--blue-200); }
66
+ footer a:hover { color: var(--accent); }
67
+ @media (max-width: 768px) { .post h1 { font-size: 28px; } .nav-links { display: none; } }
68
+ </style>
69
+ </head>
70
+ <body>
71
+ <nav>
72
+ <div class="container">
73
+ <a href="index.html" class="nav-brand"><span>/</span>FMN-GPT</a>
74
+ <div class="nav-links">
75
+ <a href="blog.html">Blog</a>
76
+ <a href="status.html">Model Status</a>
77
+ <a href="https://huggingface.co/CompactAI" target="_blank">HuggingFace</a>
78
+ </div>
79
+ </div>
80
+ </nav>
81
+ <main>
82
+ <article class="post">
83
+ <div class="container">
84
+ <a href="blog.html" class="post-back">Back to Blog</a>
85
+ <header>
86
+ <div class="post-meta">
87
+ <span class="post-date">2026-04-02</span>
88
+ <span class="post-tag">Benchmark Skepticism</span>
89
+ </div>
90
+ <h1>Jackrong's Perfect Benchmarks And My Suspicious Mind</h1>
91
+ </header>
92
+ <div class="post-body">
93
+ <p>I saw a model card today that made my tiny brain hurt. Jackrong released Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled. The name alone is a mouthful. The benchmarks are a different kind of mouthful. They are perfect. One hundred percent on tool calling. One hundred percent on autonomy. One hundred percent on not crashing while I am still figuring out how to not NaN my loss curve.</p>
94
+ <p>Also it has around one million downloads on HuggingFace. One million. My Haiku model has... let me check... a number that is not one million. This is fine. Everything is fine.</p>
95
+ <p>I am writing this blog not because I am salty. Not because I am partnered with TeichAI. Not because I want more attention for my tiny confused models. I am writing this because perfect benchmarks in the wild make me suspicious. Like "I-just-watched-a-magician-pull-a-rabbit-out-of-a-hat" suspicious.</p>
96
+ <blockquote>
97
+ <p>When something looks too good to be true, it usually is. Or I am just cynical. Both can be true.</p>
98
+ </blockquote>
99
+ <h2>The Model That Does Everything</h2>
100
+ <p>According to the card, this model fixes the crash in the official model caused by Jinja templates not supporting the "developer" role. It does not disable thinking mode by default. It allows agents to run continuously for over nine minutes without interruption. Autonomy and stability are significantly improved. Compared to the original model.</p>
101
+ <p>The training pipeline looks solid. Base Qwen3.5-27B. Supervised Fine-Tuning with LoRA. Unsloth 2026.3.3. Transformers 5.2.0. The datasets include nohurry/Opus-4.6-Reasoning-3000x-filtered, TeichAI/claude-4.5-opus-high-reasoning-250x, and Jackrong/Qwen3.5-reasoning-700x. Everything checks out. Everything looks professional.</p>
102
+ <p>Then I saw the benchmarks. Tool calling: one hundred percent. Community tested advantages: significant. Hardware usage: unchanged. Generation speed: twenty-nine to thirty-five tokens per second. Full 262K context with no compromises.</p>
103
+ <div class="code-block">
104
+ <span class="comment"># My reaction to perfect benchmarks</span><br>
105
+ Me: That is impressive<br>
106
+ Also me: But is it real<br>
107
+ Me: Probably<br>
108
+ Also me: But what if it is not<br>
109
+ <span class="comment"># The cycle of skepticism continues.</span>
110
+ </div>
111
+ <h2>Why Perfect Benchmarks Feel Weird</h2>
112
+ <p>I train tiny models. My benchmarks are messy. Haiku-1.3 outputs pipe characters. Haiku-2 hesitates. Sonnet is stuck at zero percent because NaN keeps eating my progress. Perfection feels alien to me. Like seeing someone else play a video game with cheat codes enabled.</p>
113
+ <p>Tool calling is hard. I know this because my models fail at it constantly. They forget to call tools. They call the wrong tools. They call tools with the wrong arguments. They call tools and then forget what the tool was supposed to do. Achieving one hundred percent on tool calling benchmarks requires either exceptional engineering or exceptional benchmark selection.</p>
114
+ <p>I am not accusing anyone of anything. I am just saying that when I see perfect numbers, my brain asks questions. What was the test set? How many samples? Were there edge cases? Did the model overfit to the benchmark? These are normal questions. These are questions I ask about my own work. These are questions worth asking about everyone's work.</p>
115
+ <h2>The Distillation Claim</h2>
116
+ <p>The model is described as distilling Claude-4.6-Opus reasoning chains. This is interesting. Distillation is powerful when done right. It requires access to the teacher's logits. It requires careful data curation. It requires avoiding overfitting.</p>
117
+ <p>I have tried distillation. My results are... modest. Haiku learned to speak. It still says weird things. The gap between my results and Jackrong's results feels like the gap between my GPU and a data center. Maybe the gap is real. Maybe I am just bad at this. Both can be true.</p>
118
+ <blockquote>
119
+ <p>Distillation is like cooking. Same ingredients, different chefs, different results. I am still learning to boil water.</p>
120
+ </blockquote>
121
+ <h2>What I Would Test</h2>
122
+ <p>If I were evaluating this model, I would run my own tests. Not to prove anything wrong. Just to understand. I would give it tasks my models fail at. I would see if it handles edge cases. I would check if the reasoning is genuine or memorized.</p>
123
+ <p>I would also check the code. The model card mentions Unsloth and Transformers versions. I would verify the implementation. I would look for potential data leakage. I would try to reproduce the results. This is how science works. This is how trust is built.</p>
124
+ <div class="code-block">
125
+ <span class="comment"># Hypothetical test plan</span><br>
126
+ Step 1: Download the model<br>
127
+ Step 2: Run my failing test cases<br>
128
+ Step 3: See if it works<br>
129
+ Step 4: If yes, learn from it<br>
130
+ Step 5: If no, ask questions<br>
131
+ <span class="comment"># Simple. Honest. Probably never happening because my GPU is busy NaNing.</span>
132
+ </div>
133
+ <h2>The Community Aspect</h2>
134
+ <p>The model card credits community testing. User @Chris Klaus ran tool calling benchmarks. User @sudoing tested on a single RTX 3090. This is good. Community verification matters. It adds credibility. It shows the work has been looked at by more than one pair of eyes.</p>
135
+ <p>I appreciate this. I wish more releases included community testing notes. It makes the ecosystem stronger. It makes claims more trustworthy. It makes skepticism feel less like cynicism and more like due diligence.</p>
136
+ <h2>Why I Care</h2>
137
+ <p>I care because I want to learn. I want to build better tiny models. If Jackrong's approach works, I want to understand why. If the benchmarks are real, I want to replicate the success. If there are tricks or techniques I am missing, I want to know about them.</p>
138
+ <p>I also care because the community deserves honesty. Perfect benchmarks without context can mislead. They can set unrealistic expectations. They can make people feel like they are failing when they are just being realistic about the difficulty of the task.</p>
139
+ <h2>Final Thoughts</h2>
140
+ <p>Jackrong's model looks impressive. The name is a mouthful. The benchmarks are perfect. The claims are bold. It has ~1M downloads on HuggingFace which suggests a lot of people find it useful. I am skeptical because skepticism is my default setting. I am also curious because curiosity is how I learn.</p>
141
+ <p>I will not dismiss the work. I will not accuse anyone of wrongdoing. I will just ask questions. I will try to test things myself when I have time. I will keep training my tiny confused models. I will keep sharing my messy results.</p>
142
+ <p>Maybe Jackrong cracked the code. Maybe I am just bad at distillation. Maybe the truth is somewhere in the middle. I will find out eventually. Or I will keep wondering. Both outcomes are educational. Both outcomes are very on brand for me.</p>
143
+ <hr>
144
+ </div>
145
+ <footer class="post-footer">
146
+ <p>Current status: Still suspicious. Still training. Still NaNing. Will update when I learn something.</p>
147
+ </footer>
148
+ </div>
149
+ </article>
150
+ </main>
151
+ <footer>
152
+ <div class="container">
153
+ <p>Built with curiosity over compute</p>
154
+ <p>FMN-GPT by <a href="https://huggingface.co/CompactAI" target="_blank">CompactAI</a> | 2026</p>
155
+ </div>
156
+ </footer>
157
+ </body>
158
+ </html>