Jellyfish042 Claude Opus 4.5 commited on
Commit
9afeeeb
·
0 Parent(s):

Initial commit: UncheatableEval LossLens visualization

Browse files

- Compare Qwen3-1.7B-Base vs RWKV7-G1C-1.5B byte-level predictions
- Interactive HTML visualization with hover tooltips
- Support both CPU and GPU execution
- Gradio web interface for HuggingFace Spaces

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

.claude/settings.local.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "permissions": {
3
+ "allow": [
4
+ "Bash(git init:*)",
5
+ "Bash(git add:*)"
6
+ ]
7
+ }
8
+ }
.gitattributes ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ *.pth filter=lfs diff=lfs merge=lfs -text
2
+ *.bin filter=lfs diff=lfs merge=lfs -text
3
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: UncheatableEval Visualization
3
+ emoji: 🔬
4
+ colorFrom: blue
5
+ colorTo: green
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ # UncheatableEval: Qwen3 vs RWKV7 Byte-Level Comparison
14
+
15
+ Compare the byte-level prediction performance between **Qwen3-1.7B-Base** and **RWKV7-G1C-1.5B**.
16
+
17
+ ## Features
18
+
19
+ - **Byte-level analysis**: See exactly where each model performs better or worse
20
+ - **Interactive visualization**: Hover over tokens to see detailed predictions
21
+ - **Color-coded comparison**:
22
+ - 🟢 Green = Qwen3 predicts better (lower loss)
23
+ - 🔴 Red = RWKV7 predicts better (lower loss)
24
+ - **Top-10 predictions**: View each model's top predictions for every token
25
+ - **Word occurrence linking**: See how repeated words are predicted differently
26
+
27
+ ## How to Use
28
+
29
+ 1. Enter or paste your text (max 4000 characters)
30
+ 2. Click "Run Comparison"
31
+ 3. Explore the interactive visualization
32
+ 4. Download the HTML file for offline viewing
33
+
34
+ ## Models
35
+
36
+ | Model | Type | Parameters | Architecture |
37
+ |-------|------|------------|--------------|
38
+ | Qwen3-1.7B-Base | Transformer | 1.7B | Dense attention |
39
+ | RWKV7-G1C-1.5B | RWKV | 1.5B | Linear attention |
40
+
41
+ ## Technical Details
42
+
43
+ This tool uses the [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) framework to:
44
+
45
+ 1. Tokenize input text with each model's tokenizer
46
+ 2. Calculate per-token cross-entropy loss
47
+ 3. Map token losses to byte-level losses
48
+ 4. Generate interactive HTML visualization
49
+
50
+ ## Local Development
51
+
52
+ ```bash
53
+ # Clone the repository
54
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/UncheatableEval-Visualization
55
+
56
+ # Install dependencies
57
+ pip install -r requirements.txt
58
+
59
+ # Run locally
60
+ python app.py
61
+ ```
62
+
63
+ ## Requirements
64
+
65
+ - CUDA-capable GPU (16GB+ VRAM recommended)
66
+ - Python 3.10+
67
+ - See `requirements.txt` for package dependencies
68
+
69
+ ## License
70
+
71
+ MIT License
72
+
73
+ ## Acknowledgments
74
+
75
+ - [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) - Original evaluation framework
76
+ - [Qwen](https://github.com/QwenLM/Qwen) - Qwen model family
77
+ - [RWKV](https://github.com/BlinkDL/RWKV-LM) - RWKV model family
app.py ADDED
@@ -0,0 +1,362 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ UncheatableEval Visualization - Hugging Face Space
3
+
4
+ Compare byte-level prediction performance between Qwen3-1.7B-Base and RWKV7-G1C-1.5B.
5
+ """
6
+
7
+ import gc
8
+ import os
9
+ import tempfile
10
+ from pathlib import Path
11
+
12
+ import gradio as gr
13
+ import torch
14
+
15
+ # Detect device
16
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
17
+ IS_CPU = DEVICE == "cpu"
18
+
19
+ # Model configuration
20
+ QWEN_MODEL_ID = "Qwen/Qwen3-1.7B-Base"
21
+ RWKV_MODEL_URL = "https://huggingface.co/BlinkDL/rwkv7-g1/resolve/main/rwkv7-g1c-1.5b-20260110-ctx8192.pth"
22
+ RWKV_MODEL_FILENAME = "rwkv7-g1c-1.5b-20260110-ctx8192.pth"
23
+
24
+ # Get the directory where this script is located
25
+ SCRIPT_DIR = Path(__file__).parent.absolute()
26
+ MODELS_DIR = SCRIPT_DIR / "models"
27
+ SUPPORT_DIR = SCRIPT_DIR / "support"
28
+
29
+ # Text length limits
30
+ MAX_TEXT_LENGTH = 4000
31
+ MIN_TEXT_LENGTH = 10
32
+
33
+ # Example texts
34
+ EXAMPLE_NEWS = """The rapid advancement of artificial intelligence has sparked both excitement and concern among researchers worldwide. While AI systems demonstrate remarkable capabilities in language understanding and generation, questions remain about their potential impact on employment and society."""
35
+
36
+ EXAMPLE_CODE = """def fibonacci(n):
37
+ if n <= 1:
38
+ return n
39
+ return fibonacci(n-1) + fibonacci(n-2)
40
+
41
+ # Calculate first 10 Fibonacci numbers
42
+ for i in range(10):
43
+ print(f"F({i}) = {fibonacci(i)}")"""
44
+
45
+ EXAMPLE_LITERATURE = """It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness."""
46
+
47
+
48
+ def download_rwkv_model(progress=None):
49
+ """Download RWKV7 model if not exists."""
50
+ from huggingface_hub import hf_hub_download
51
+
52
+ model_path = MODELS_DIR / RWKV_MODEL_FILENAME
53
+
54
+ if model_path.exists():
55
+ return str(model_path)
56
+
57
+ MODELS_DIR.mkdir(parents=True, exist_ok=True)
58
+
59
+ if progress:
60
+ progress(0.1, desc="Downloading RWKV7 model...")
61
+
62
+ # Download from HuggingFace Hub
63
+ downloaded_path = hf_hub_download(
64
+ repo_id="BlinkDL/rwkv7-g1",
65
+ filename=RWKV_MODEL_FILENAME,
66
+ local_dir=str(MODELS_DIR),
67
+ local_dir_use_symlinks=False
68
+ )
69
+
70
+ return downloaded_path
71
+
72
+
73
+ def load_qwen_model():
74
+ """Load Qwen3-1.7B-Base model."""
75
+ from transformers import AutoTokenizer, AutoModelForCausalLM
76
+
77
+ tokenizer = AutoTokenizer.from_pretrained(
78
+ QWEN_MODEL_ID,
79
+ trust_remote_code=True
80
+ )
81
+
82
+ # Configure based on device
83
+ if IS_CPU:
84
+ model_kwargs = {
85
+ "torch_dtype": torch.float32,
86
+ "device_map": None,
87
+ "trust_remote_code": True,
88
+ "low_cpu_mem_usage": True
89
+ }
90
+ model = AutoModelForCausalLM.from_pretrained(
91
+ QWEN_MODEL_ID,
92
+ **model_kwargs
93
+ ).eval()
94
+ else:
95
+ model_kwargs = {
96
+ "torch_dtype": torch.bfloat16,
97
+ "device_map": "auto",
98
+ "trust_remote_code": True
99
+ }
100
+ try:
101
+ model = AutoModelForCausalLM.from_pretrained(
102
+ QWEN_MODEL_ID,
103
+ attn_implementation="flash_attention_2",
104
+ **model_kwargs
105
+ ).eval()
106
+ except Exception:
107
+ model = AutoModelForCausalLM.from_pretrained(
108
+ QWEN_MODEL_ID,
109
+ **model_kwargs
110
+ ).eval()
111
+
112
+ return model, tokenizer
113
+
114
+
115
+ def load_rwkv7_model(model_path: str):
116
+ """Load RWKV7-G1C-1.5B model."""
117
+ os.environ["RWKV_JIT_ON"] = "1"
118
+ os.environ["RWKV_V7_ON"] = "1"
119
+
120
+ # Set CUDA flag based on device
121
+ if IS_CPU:
122
+ os.environ["RWKV_CUDA_ON"] = "0"
123
+ else:
124
+ os.environ["RWKV_CUDA_ON"] = "1"
125
+
126
+ from rwkv.model import RWKV
127
+ from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
128
+
129
+ # Use appropriate strategy for device
130
+ if IS_CPU:
131
+ strategy = "cpu fp32"
132
+ else:
133
+ strategy = "cuda fp16"
134
+
135
+ model = RWKV(model=model_path, strategy=strategy)
136
+
137
+ vocab_path = str(SUPPORT_DIR / "rwkv_vocab_v20230424.txt")
138
+ tokenizer = TRIE_TOKENIZER(vocab_path)
139
+
140
+ return model, tokenizer
141
+
142
+
143
+ def validate_input(text: str) -> tuple[bool, str]:
144
+ """Validate input text."""
145
+ if not text or not text.strip():
146
+ return False, "Please enter some text to analyze."
147
+
148
+ text = text.strip()
149
+
150
+ if len(text) < MIN_TEXT_LENGTH:
151
+ return False, f"Text is too short. Minimum {MIN_TEXT_LENGTH} characters required."
152
+
153
+ if len(text) > MAX_TEXT_LENGTH:
154
+ return False, f"Text is too long. Maximum {MAX_TEXT_LENGTH} characters allowed. Current: {len(text)}"
155
+
156
+ return True, text
157
+
158
+
159
+ def wrap_html_in_iframe(html: str) -> str:
160
+ """Wrap HTML in an iframe for Gradio display."""
161
+ escaped = html.replace('"', '&quot;')
162
+ return f'''
163
+ <div style="width:100%;height:700px;border:1px solid #ddd;border-radius:8px;overflow:hidden;">
164
+ <iframe srcdoc="{escaped}"
165
+ style="width:100%;height:100%;border:none;"
166
+ sandbox="allow-scripts"></iframe>
167
+ </div>
168
+ '''
169
+
170
+
171
+ def run_evaluation(text: str, progress=gr.Progress()):
172
+ """Run evaluation on both models and generate visualization."""
173
+ from core.evaluator import evaluate_hf_single_sample, evaluate_rwkv7_single_sample
174
+ from visualization.html_generator import generate_comparison_html
175
+
176
+ # Validate input
177
+ valid, result = validate_input(text)
178
+ if not valid:
179
+ raise gr.Error(result)
180
+
181
+ text = result # Use cleaned text
182
+
183
+ try:
184
+ # Step 1: Download RWKV model if needed
185
+ progress(0.05, desc="Checking RWKV7 model...")
186
+ rwkv_model_path = download_rwkv_model(progress)
187
+
188
+ # Step 2: Load Qwen model
189
+ progress(0.1, desc="Loading Qwen3-1.7B-Base...")
190
+ qwen_model, qwen_tokenizer = load_qwen_model()
191
+
192
+ # Step 3: Evaluate Qwen
193
+ progress(0.3, desc="Evaluating with Qwen3...")
194
+ result_qwen = evaluate_hf_single_sample(
195
+ qwen_model,
196
+ qwen_tokenizer,
197
+ text,
198
+ bos_mode="add_newline_token"
199
+ )
200
+
201
+ # Step 4: Free Qwen memory
202
+ progress(0.4, desc="Freeing memory...")
203
+ del qwen_model
204
+ if torch.cuda.is_available():
205
+ torch.cuda.empty_cache()
206
+ gc.collect()
207
+
208
+ # Step 5: Load RWKV7 model
209
+ progress(0.5, desc="Loading RWKV7-G1C-1.5B...")
210
+ rwkv_model, rwkv_tokenizer = load_rwkv7_model(rwkv_model_path)
211
+
212
+ # Step 6: Evaluate RWKV7
213
+ progress(0.7, desc="Evaluating with RWKV7...")
214
+ result_rwkv = evaluate_rwkv7_single_sample(
215
+ rwkv_model,
216
+ rwkv_tokenizer,
217
+ text
218
+ )
219
+
220
+ # Step 7: Free RWKV memory
221
+ progress(0.8, desc="Freeing memory...")
222
+ del rwkv_model
223
+ if torch.cuda.is_available():
224
+ torch.cuda.empty_cache()
225
+ gc.collect()
226
+
227
+ # Step 8: Generate visualization
228
+ progress(0.9, desc="Generating visualization...")
229
+ html = generate_comparison_html(
230
+ text=text,
231
+ byte_losses_a=result_qwen["byte_wise_losses"],
232
+ byte_losses_b=result_rwkv["byte_wise_losses"],
233
+ model_a_name="Qwen3-1.7B-Base",
234
+ model_b_name="RWKV7-G1C-1.5B",
235
+ topk_predictions_a=result_qwen["top5_predictions"],
236
+ topk_predictions_b=result_rwkv["top5_predictions"],
237
+ tokenizer_a=result_qwen["tokenizer"],
238
+ tokenizer_b=result_rwkv["tokenizer"],
239
+ model_type_a="hf",
240
+ model_type_b="rwkv7"
241
+ )
242
+
243
+ # Wrap HTML for iframe display
244
+ wrapped_html = wrap_html_in_iframe(html)
245
+
246
+ # Save HTML for download
247
+ temp_file = tempfile.NamedTemporaryFile(
248
+ mode='w',
249
+ suffix='.html',
250
+ delete=False,
251
+ encoding='utf-8'
252
+ )
253
+ temp_file.write(html)
254
+ temp_file.close()
255
+
256
+ progress(1.0, desc="Done!")
257
+
258
+ return wrapped_html, temp_file.name
259
+
260
+ except torch.cuda.OutOfMemoryError:
261
+ if torch.cuda.is_available():
262
+ torch.cuda.empty_cache()
263
+ gc.collect()
264
+ raise gr.Error(
265
+ "GPU memory insufficient. Please try:\n"
266
+ "1. Use shorter text\n"
267
+ "2. Wait a moment and try again"
268
+ )
269
+ except Exception as e:
270
+ if torch.cuda.is_available():
271
+ torch.cuda.empty_cache()
272
+ gc.collect()
273
+ raise gr.Error(f"Evaluation failed: {str(e)}")
274
+
275
+
276
+ def clear_inputs():
277
+ """Clear all inputs and outputs."""
278
+ return "", None, None
279
+
280
+
281
+ # Build Gradio UI
282
+ with gr.Blocks(
283
+ title="UncheatableEval: Qwen3 vs RWKV7",
284
+ theme=gr.themes.Soft(),
285
+ css="""
286
+ .example-btn {
287
+ margin: 2px !important;
288
+ }
289
+ """
290
+ ) as demo:
291
+ gr.Markdown("""
292
+ # 🔬 UncheatableEval: Qwen3 vs RWKV7 Byte-Level Comparison
293
+
294
+ Compare the byte-level prediction performance between **Qwen3-1.7B-Base** and **RWKV7-G1C-1.5B**.
295
+
296
+ - **Green** = Qwen3 predicts better (lower loss)
297
+ - **Red** = RWKV7 predicts better (lower loss)
298
+ - **Hover** over tokens to see detailed predictions and compression rates
299
+ """)
300
+
301
+ with gr.Row():
302
+ with gr.Column(scale=1):
303
+ text_input = gr.Textbox(
304
+ label="Input Text",
305
+ placeholder=f"Enter text to analyze (max {MAX_TEXT_LENGTH} characters)...",
306
+ lines=10,
307
+ max_lines=20,
308
+ )
309
+
310
+ gr.Markdown("**Examples:**")
311
+ with gr.Row():
312
+ news_btn = gr.Button("📰 News", size="sm", elem_classes=["example-btn"])
313
+ code_btn = gr.Button("💻 Code", size="sm", elem_classes=["example-btn"])
314
+ lit_btn = gr.Button("📚 Literature", size="sm", elem_classes=["example-btn"])
315
+
316
+ with gr.Row():
317
+ clear_btn = gr.Button("Clear", variant="secondary")
318
+ run_btn = gr.Button("▶ Run Comparison", variant="primary")
319
+
320
+ gr.Markdown("---")
321
+
322
+ with gr.Row():
323
+ with gr.Column():
324
+ output_html = gr.HTML(label="Visualization")
325
+ download_file = gr.File(label="Download HTML", visible=True)
326
+
327
+ # Event handlers
328
+ news_btn.click(fn=lambda: EXAMPLE_NEWS, outputs=[text_input])
329
+ code_btn.click(fn=lambda: EXAMPLE_CODE, outputs=[text_input])
330
+ lit_btn.click(fn=lambda: EXAMPLE_LITERATURE, outputs=[text_input])
331
+
332
+ clear_btn.click(
333
+ fn=clear_inputs,
334
+ outputs=[text_input, output_html, download_file]
335
+ )
336
+
337
+ run_btn.click(
338
+ fn=run_evaluation,
339
+ inputs=[text_input],
340
+ outputs=[output_html, download_file]
341
+ )
342
+
343
+ gr.Markdown("""
344
+ ---
345
+ ### About
346
+
347
+ This tool uses [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) to compare
348
+ language model performance at the byte level.
349
+
350
+ **Models:**
351
+ - **Qwen3-1.7B-Base**: Transformer-based model from Alibaba
352
+ - **RWKV7-G1C-1.5B**: Linear attention model from RWKV team
353
+
354
+ **How it works:**
355
+ 1. Both models predict each byte in the input text
356
+ 2. Lower prediction loss = better compression = better understanding
357
+ 3. The visualization shows where each model performs better or worse
358
+ """)
359
+
360
+
361
+ if __name__ == "__main__":
362
+ demo.launch()
core/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Core evaluation modules
core/evaluator.py ADDED
@@ -0,0 +1,270 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Evaluator module for UncheatableEval visualization.
3
+
4
+ Provides single-sample evaluation functions for Qwen3 and RWKV7 models.
5
+ """
6
+
7
+ import gc
8
+ import math
9
+ import os
10
+ from typing import List, Dict, Any, Optional
11
+
12
+ import torch
13
+ import torch.nn.functional as F
14
+
15
+ from .helpers import TokenizerBytesConverter
16
+
17
+
18
+ # Compression rate conversion factor
19
+ COMPRESSION_RATE_FACTOR = (1.0 / math.log(2.0)) * 0.125 * 100.0
20
+
21
+
22
+ def get_device():
23
+ """Get the best available device."""
24
+ if torch.cuda.is_available():
25
+ return "cuda"
26
+ else:
27
+ return "cpu"
28
+
29
+
30
+ def calculate_log_sum(logits: torch.Tensor, target_token_ids: torch.Tensor) -> torch.Tensor:
31
+ """Calculate cross entropy loss for each token."""
32
+ # Use float32 for CPU compatibility, bfloat16 for CUDA
33
+ if logits.device.type == "cuda":
34
+ return F.cross_entropy(logits[:-1].to(torch.bfloat16), target_token_ids[1:], reduction="none")
35
+ else:
36
+ return F.cross_entropy(logits[:-1].float(), target_token_ids[1:], reduction="none")
37
+
38
+
39
+ def extract_topk_predictions(logit: torch.Tensor, target_ids: torch.Tensor, k: int = 10) -> List:
40
+ """
41
+ Extract top-k predictions from logits.
42
+
43
+ Args:
44
+ logit: [seq_length, vocab_size] logit tensor
45
+ target_ids: [seq_length] actual target token IDs
46
+ k: number of top predictions to extract (default: 10)
47
+
48
+ Returns:
49
+ list: [[actual_id, rank, [[id1, prob1], [id2, prob2], ...]], ...]
50
+ """
51
+ probs = F.softmax(logit, dim=-1)
52
+ top_probs, top_ids = torch.topk(probs, k, dim=-1)
53
+
54
+ results = []
55
+ for pos in range(logit.shape[0]):
56
+ target_id = target_ids[pos].item()
57
+ actual_prob = probs[pos, target_id].item()
58
+ rank = (probs[pos] > actual_prob).sum().item() + 1
59
+
60
+ topk_list = [
61
+ [top_ids[pos, i].item(), round(top_probs[pos, i].item(), 6)]
62
+ for i in range(k)
63
+ ]
64
+ results.append([target_id, rank, topk_list])
65
+
66
+ return results
67
+
68
+
69
+ def count_model_parameters_in_billions(model) -> float:
70
+ """Count model parameters in billions."""
71
+ total_params = sum(p.numel() for p in model.parameters())
72
+ return total_params / 1e9
73
+
74
+
75
+ def count_rwkv_parameters_in_billions(rwkv_model) -> float:
76
+ """Count RWKV model parameters in billions."""
77
+ total_params = 0
78
+ if hasattr(rwkv_model, "z"):
79
+ for param in rwkv_model.z.values():
80
+ total_params += param.numel()
81
+ if hasattr(rwkv_model, "w"):
82
+ for param in rwkv_model.w.values():
83
+ total_params += param.numel()
84
+ return total_params / 1e9
85
+
86
+
87
+ @torch.no_grad()
88
+ def evaluate_hf_single_sample(
89
+ model,
90
+ tokenizer,
91
+ text: str,
92
+ bos_mode: str = "add_newline_token"
93
+ ) -> Dict[str, Any]:
94
+ """
95
+ Evaluate a HuggingFace model on a single text sample.
96
+
97
+ Args:
98
+ model: HuggingFace model
99
+ tokenizer: HuggingFace tokenizer
100
+ text: Input text to evaluate
101
+ bos_mode: How to handle BOS token
102
+
103
+ Returns:
104
+ dict with byte_wise_losses, top5_predictions, compression_rate, etc.
105
+ """
106
+ # Create token-to-bytes converter
107
+ token2bytes_converter = TokenizerBytesConverter(
108
+ model_name_or_path=tokenizer.name_or_path,
109
+ tokenizer=tokenizer
110
+ )
111
+
112
+ # Determine BOS token
113
+ if bos_mode in ["add_default_bos", "replace_with_bos"]:
114
+ bos_token = tokenizer.bos_token_id
115
+ elif bos_mode in ["add_default_eos", "replace_with_eos"]:
116
+ bos_token = tokenizer.eos_token_id
117
+ elif bos_mode in ["add_newline_token", "replace_with_newline_token"]:
118
+ bos_token = tokenizer.encode("\n")[0]
119
+ else:
120
+ bos_token = tokenizer.bos_token_id
121
+
122
+ bos_tensor = torch.tensor([bos_token], device=model.device).unsqueeze(0)
123
+
124
+ # Tokenize input
125
+ inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
126
+ inputs = inputs.to(model.device)
127
+ seq_length = inputs["input_ids"].shape[-1]
128
+
129
+ if seq_length < 2:
130
+ raise ValueError(f"Text is too short (only {seq_length} tokens)")
131
+
132
+ # Forward pass
133
+ input_chunk = inputs["input_ids"]
134
+ if bos_mode in ["add_default_bos", "add_default_eos", "add_newline_token"]:
135
+ input_chunk = torch.cat([bos_tensor, input_chunk], dim=-1)
136
+ if bos_mode in ["replace_with_bos", "replace_with_eos", "replace_with_newline_token"]:
137
+ input_chunk[0, 0] = bos_token
138
+
139
+ logit = model.forward(input_ids=input_chunk).logits[0, :, :]
140
+ loss = calculate_log_sum(logit, input_chunk.squeeze(0))
141
+
142
+ # Get per-token bytes
143
+ per_token_bytes = token2bytes_converter.encode_to_bytes(text)
144
+
145
+ # Verify bytes match
146
+ all_bytes = [byte for token in per_token_bytes for byte in token]
147
+ expected_bytes = list(text.encode("utf-8"))
148
+ if all_bytes != expected_bytes:
149
+ raise ValueError("Token bytes don't match original text bytes")
150
+
151
+ # Extract top-k predictions
152
+ sample_topk = extract_topk_predictions(
153
+ logit[:-1], input_chunk.squeeze(0)[1:]
154
+ )
155
+
156
+ # Calculate byte-wise losses
157
+ byte_wise_losses = []
158
+ pending_loss = 0.0
159
+
160
+ for l, byte_values in zip(loss, per_token_bytes):
161
+ current_loss = l.item() + pending_loss
162
+ pending_loss = 0.0
163
+
164
+ if len(byte_values) == 0:
165
+ pending_loss = current_loss
166
+ continue
167
+
168
+ per_byte_loss = current_loss / len(byte_values)
169
+ for _ in range(len(byte_values)):
170
+ byte_wise_losses.append(per_byte_loss)
171
+
172
+ # Calculate overall metrics
173
+ total_loss = loss.sum().item()
174
+ num_bytes = len(text.encode("utf-8"))
175
+ avg_loss = total_loss / seq_length
176
+ compression_rate = avg_loss * COMPRESSION_RATE_FACTOR
177
+
178
+ return {
179
+ "byte_wise_losses": byte_wise_losses,
180
+ "top5_predictions": sample_topk,
181
+ "compression_rate": compression_rate,
182
+ "total_loss": total_loss,
183
+ "num_tokens": seq_length,
184
+ "num_bytes": num_bytes,
185
+ "model_name": getattr(model.config, "_name_or_path", "unknown"),
186
+ "tokenizer": tokenizer
187
+ }
188
+
189
+
190
+ @torch.no_grad()
191
+ def evaluate_rwkv7_single_sample(
192
+ model,
193
+ tokenizer,
194
+ text: str
195
+ ) -> Dict[str, Any]:
196
+ """
197
+ Evaluate a RWKV7 model on a single text sample.
198
+
199
+ Args:
200
+ model: RWKV7 model
201
+ tokenizer: RWKV tokenizer (TRIE_TOKENIZER)
202
+ text: Input text to evaluate
203
+
204
+ Returns:
205
+ dict with byte_wise_losses, top5_predictions, compression_rate, etc.
206
+ """
207
+ # Tokenize
208
+ tokenized = tokenizer.encode(text)
209
+ if hasattr(tokenized, "ids"):
210
+ input_seq = tokenized.ids
211
+ else:
212
+ input_seq = tokenized
213
+
214
+ input_length = len(input_seq)
215
+
216
+ if input_length < 2:
217
+ raise ValueError(f"Text is too short (only {input_length} tokens)")
218
+
219
+ # Forward pass with state
220
+ input_chunk = [0] + input_seq # Add BOS token (0)
221
+ device = get_device()
222
+
223
+ CHUNK_LEN = 1024
224
+ state = None
225
+ logit = torch.empty((0, 65536), device=device)
226
+
227
+ temp_input = input_chunk.copy()
228
+ while len(temp_input) > 0:
229
+ out, state = model.forward(temp_input[:CHUNK_LEN], state, full_output=True)
230
+ if len(temp_input) == 1:
231
+ out = out.unsqueeze(0)
232
+ temp_input = temp_input[CHUNK_LEN:]
233
+ logit = torch.concat((logit, out.to(device)), dim=0)
234
+
235
+ if len(input_chunk) == 1:
236
+ logit = logit.unsqueeze(0)
237
+
238
+ loss = calculate_log_sum(logit, torch.tensor(input_chunk).to(device))
239
+
240
+ # Get per-token bytes
241
+ token_bytes = [tokenizer.decodeBytes([token]) for token in input_chunk[1:]]
242
+
243
+ # Extract top-k predictions
244
+ sample_topk = extract_topk_predictions(
245
+ logit[:-1], torch.tensor(input_chunk[1:]).to(device)
246
+ )
247
+
248
+ # Calculate byte-wise losses
249
+ byte_wise_losses = []
250
+ for l, byte_values in zip(loss.tolist(), token_bytes):
251
+ per_byte_loss = l / len(byte_values)
252
+ for _ in range(len(byte_values)):
253
+ byte_wise_losses.append(per_byte_loss)
254
+
255
+ # Calculate overall metrics
256
+ total_loss = loss.sum().item()
257
+ num_bytes = len(text.encode("utf-8"))
258
+ avg_loss = total_loss / input_length
259
+ compression_rate = avg_loss * COMPRESSION_RATE_FACTOR
260
+
261
+ return {
262
+ "byte_wise_losses": byte_wise_losses,
263
+ "top5_predictions": sample_topk,
264
+ "compression_rate": compression_rate,
265
+ "total_loss": total_loss,
266
+ "num_tokens": input_length,
267
+ "num_bytes": num_bytes,
268
+ "model_name": "RWKV7-G1C-1.5B",
269
+ "tokenizer": tokenizer
270
+ }
core/helpers.py ADDED
@@ -0,0 +1,266 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Helper utilities for UncheatableEval visualization.
3
+
4
+ Contains TokenizerBytesConverter for mapping tokens to bytes.
5
+ """
6
+
7
+ import json
8
+ import re
9
+ from typing import Dict, List, Optional
10
+
11
+
12
+ def bytes_to_unicode() -> Dict[int, str]:
13
+ """
14
+ GPT-2 style byte-to-unicode mapping.
15
+ Maps byte values 0-255 to printable Unicode characters.
16
+ """
17
+ bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
18
+ cs = bs[:]
19
+ n = 0
20
+ for b in range(2**8):
21
+ if b not in bs:
22
+ bs.append(b)
23
+ cs.append(2**8 + n)
24
+ n += 1
25
+ cs = [chr(n) for n in cs]
26
+ return dict(zip(bs, cs))
27
+
28
+
29
+ class TokenizerBytesConverter:
30
+ """
31
+ Universal Token-to-Bytes Converter for HuggingFace tokenizers.
32
+
33
+ Supports two encoding schemes:
34
+ 1. ByteLevel BPE (Llama 3.x, Qwen, GPT-2 style)
35
+ 2. SentencePiece with ByteFallback (Mistral, early LLaMA)
36
+
37
+ Usage:
38
+ converter = TokenizerBytesConverter("meta-llama/Llama-3.2-1B")
39
+ nested_bytes = converter.encode_to_bytes("Hello world")
40
+ # Returns: [[72, 101, 108, 108, 111], [32, 119, 111, 114, 108, 100]]
41
+ """
42
+
43
+ # Class-level mapping table cache
44
+ _BYTE_TO_UNICODE = bytes_to_unicode()
45
+ _UNICODE_TO_BYTE = {v: k for k, v in _BYTE_TO_UNICODE.items()}
46
+
47
+ def __init__(
48
+ self,
49
+ model_name_or_path: str = None,
50
+ cache_dir: Optional[str] = None,
51
+ trust_remote_code: bool = True,
52
+ tokenizer=None,
53
+ ):
54
+ """
55
+ Initialize the converter.
56
+
57
+ Args:
58
+ model_name_or_path: HuggingFace model name or local path
59
+ cache_dir: Directory to cache the downloaded tokenizer files
60
+ trust_remote_code: Whether to trust remote code for custom tokenizers
61
+ tokenizer: Optional pre-loaded tokenizer instance for encoding.
62
+ If provided, this tokenizer will be used for encode() calls,
63
+ while AutoTokenizer is still used to extract vocab/decoder config.
64
+ """
65
+ from transformers import AutoTokenizer
66
+
67
+ # Always load AutoTokenizer for vocab extraction
68
+ auto_tokenizer = AutoTokenizer.from_pretrained(
69
+ model_name_or_path,
70
+ cache_dir=cache_dir,
71
+ trust_remote_code=trust_remote_code,
72
+ )
73
+
74
+ # Use provided tokenizer for encoding, or fall back to auto_tokenizer
75
+ self._tokenizer = tokenizer if tokenizer is not None else auto_tokenizer
76
+
77
+ # Extract tokenizer.json from the AutoTokenizer's backend
78
+ if hasattr(auto_tokenizer, "backend_tokenizer") and hasattr(auto_tokenizer.backend_tokenizer, "to_str"):
79
+ tokenizer_json = json.loads(auto_tokenizer.backend_tokenizer.to_str())
80
+ else:
81
+ raise ValueError("Tokenizer object is not supported. " "The tokenizer must have a backend_tokenizer with to_str() method.")
82
+
83
+ self._tokenizer_json = tokenizer_json
84
+ self._vocab = tokenizer_json["model"]["vocab"]
85
+ self._id_to_token: Dict[int, str] = {v: k for k, v in self._vocab.items()}
86
+
87
+ # Detect encoding type
88
+ self._decoder_type = self._detect_decoder_type()
89
+
90
+ # Load added_tokens
91
+ self._load_added_tokens()
92
+
93
+ def _detect_decoder_type(self) -> str:
94
+ """Detect the decoder type from tokenizer.json."""
95
+ decoder = self._tokenizer_json.get("decoder", {})
96
+ decoder_type = decoder.get("type", "")
97
+
98
+ if decoder_type == "ByteLevel":
99
+ return "bytelevel"
100
+ elif decoder_type == "Sequence":
101
+ decoders = decoder.get("decoders", [])
102
+ for d in decoders:
103
+ if d.get("type") == "ByteFallback":
104
+ return "sentencepiece"
105
+ for d in decoders:
106
+ if d.get("type") == "ByteLevel":
107
+ return "bytelevel"
108
+
109
+ # Fallback: check model configuration
110
+ model = self._tokenizer_json.get("model", {})
111
+ if model.get("byte_fallback", False):
112
+ return "sentencepiece"
113
+
114
+ # Default to bytelevel
115
+ return "bytelevel"
116
+
117
+ def _load_added_tokens(self):
118
+ """Load added_tokens into the vocabulary."""
119
+ self._special_token_ids = set()
120
+ added_tokens = self._tokenizer_json.get("added_tokens", [])
121
+ for token_info in added_tokens:
122
+ token_id = token_info["id"]
123
+ content = token_info["content"]
124
+ self._id_to_token[token_id] = content
125
+ if token_info.get("special", False):
126
+ self._special_token_ids.add(token_id)
127
+
128
+ @property
129
+ def decoder_type(self) -> str:
130
+ """Return the detected decoder type."""
131
+ return self._decoder_type
132
+
133
+ @property
134
+ def vocab_size(self) -> int:
135
+ """Return the vocabulary size."""
136
+ return len(self._id_to_token)
137
+
138
+ @property
139
+ def tokenizer(self):
140
+ """Return the underlying HuggingFace tokenizer."""
141
+ return self._tokenizer
142
+
143
+ def get_token_string(self, token_id: int) -> Optional[str]:
144
+ """Get the raw string for a token_id."""
145
+ return self._id_to_token.get(token_id)
146
+
147
+ def token_to_bytes(self, token_id: int) -> Optional[List[int]]:
148
+ """
149
+ Map a single token_id to its byte sequence.
150
+
151
+ Args:
152
+ token_id: The token ID
153
+
154
+ Returns:
155
+ List of byte values (0-255) as integers, or None if token_id doesn't exist
156
+ """
157
+ token_str = self._id_to_token.get(token_id)
158
+ if token_str is None:
159
+ return None
160
+
161
+ if self._decoder_type == "bytelevel":
162
+ return self._decode_bytelevel(token_str)
163
+ else:
164
+ return self._decode_sentencepiece(token_str)
165
+
166
+ def _decode_bytelevel(self, token_str: str) -> List[int]:
167
+ """
168
+ ByteLevel decoding: map each Unicode character back to a byte.
169
+ """
170
+ result = []
171
+ for char in token_str:
172
+ if char in self._UNICODE_TO_BYTE:
173
+ result.append(self._UNICODE_TO_BYTE[char])
174
+ else:
175
+ # Characters not in the mapping table are encoded as UTF-8
176
+ result.extend(char.encode("utf-8"))
177
+ return result
178
+
179
+ def _decode_sentencepiece(self, token_str: str) -> List[int]:
180
+ """
181
+ SentencePiece decoding: handle ▁ and <0xXX> format.
182
+ """
183
+ result = []
184
+ i = 0
185
+ while i < len(token_str):
186
+ # Check for <0xXX> format
187
+ match = re.match(r"<0x([0-9A-Fa-f]{2})>", token_str[i:])
188
+ if match:
189
+ byte_val = int(match.group(1), 16)
190
+ result.append(byte_val)
191
+ i += 6
192
+ elif token_str[i] == "▁":
193
+ # Replace ▁ with space
194
+ result.append(0x20)
195
+ i += 1
196
+ else:
197
+ result.extend(token_str[i].encode("utf-8"))
198
+ i += 1
199
+ return result
200
+
201
+ def encode_to_bytes(
202
+ self,
203
+ text: str,
204
+ add_special_tokens: bool = False,
205
+ strip_leading_space: bool = True,
206
+ ) -> List[List[int]]:
207
+ """
208
+ Encode text to a nested list of bytes.
209
+
210
+ Each sub-list contains the byte values (as integers) for one token.
211
+
212
+ Args:
213
+ text: Input text to encode
214
+ add_special_tokens: Whether to add special tokens (BOS, EOS, etc.)
215
+ strip_leading_space: For SentencePiece, whether to strip the leading space
216
+ from the first token
217
+
218
+ Returns:
219
+ Nested list where each inner list contains byte values for one token.
220
+ Example: [[72, 101, 108, 108, 111], [32, 119, 111, 114, 108, 100]]
221
+ """
222
+ token_ids = self._tokenizer.encode(text, add_special_tokens=add_special_tokens)
223
+
224
+ result = []
225
+ for idx, token_id in enumerate(token_ids):
226
+ token_bytes = self.token_to_bytes(token_id)
227
+ if token_bytes is not None:
228
+ # Handle SentencePiece leading space
229
+ if idx == 0 and self._decoder_type == "sentencepiece" and strip_leading_space and token_bytes and token_bytes[0] == 0x20:
230
+ token_bytes = token_bytes[1:]
231
+
232
+ result.append(token_bytes)
233
+
234
+ return result
235
+
236
+ def encode_to_flat_bytes(
237
+ self,
238
+ text: str,
239
+ add_special_tokens: bool = False,
240
+ strip_leading_space: bool = True,
241
+ ) -> bytes:
242
+ """
243
+ Encode text to a flat byte sequence.
244
+
245
+ Args:
246
+ text: Input text to encode
247
+ add_special_tokens: Whether to add special tokens
248
+ strip_leading_space: For SentencePiece, whether to strip the leading space
249
+
250
+ Returns:
251
+ Concatenated bytes from all tokens
252
+ """
253
+ nested = self.encode_to_bytes(text, add_special_tokens, strip_leading_space)
254
+ result = []
255
+ for token_bytes in nested:
256
+ result.extend(token_bytes)
257
+ return bytes(result)
258
+
259
+ def get_all_token_bytes(self) -> Dict[int, List[int]]:
260
+ """
261
+ Get byte mapping for all tokens in the vocabulary.
262
+
263
+ Returns:
264
+ Dictionary mapping token_id to list of byte values
265
+ """
266
+ return {token_id: self.token_to_bytes(token_id) for token_id in self._id_to_token}
examples/sample_texts.json ADDED
@@ -0,0 +1,24 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "examples": [
3
+ {
4
+ "name": "News",
5
+ "text": "The rapid advancement of artificial intelligence has sparked both excitement and concern among researchers worldwide. While AI systems demonstrate remarkable capabilities in language understanding and generation, questions remain about their potential impact on employment and society."
6
+ },
7
+ {
8
+ "name": "Code",
9
+ "text": "def fibonacci(n):\n if n <= 1:\n return n\n return fibonacci(n-1) + fibonacci(n-2)\n\n# Calculate first 10 Fibonacci numbers\nfor i in range(10):\n print(f\"F({i}) = {fibonacci(i)}\")"
10
+ },
11
+ {
12
+ "name": "Literature",
13
+ "text": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness."
14
+ },
15
+ {
16
+ "name": "Chinese",
17
+ "text": "人工智能的快速发展在全球研究人员中引发了兴奋和担忧。虽然人工智能系统在语言理解和生成方面展现了非凡的能力,但关于其对就业和社会的潜在影响的问题仍然存在。"
18
+ },
19
+ {
20
+ "name": "Mixed",
21
+ "text": "The transformer architecture, introduced in the paper \"Attention Is All You Need\" (2017), revolutionized NLP. 这种架构使用自注意力机制来处理序列数据,比传统的RNN和LSTM更加高效。"
22
+ }
23
+ ]
24
+ }
requirements.txt ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ transformers>=4.35.0
3
+ tokenizers>=0.15.0
4
+ gradio>=4.0.0
5
+ numpy>=1.24.0
6
+ tqdm>=4.65.0
7
+ packaging
8
+ rwkv>=0.8.0
9
+ requests
10
+ huggingface_hub
11
+ accelerate
support/rwkv_vocab_v20230424.txt ADDED
The diff for this file is too large to render. See raw diff
 
visualization/__init__.py ADDED
@@ -0,0 +1 @@
 
 
1
+ # Visualization modules
visualization/html_generator.py ADDED
@@ -0,0 +1,865 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ HTML visualization generator for UncheatableEval.
3
+
4
+ Generates interactive HTML visualizations comparing byte-level losses between two models.
5
+ """
6
+
7
+ import json
8
+ import math
9
+ import re
10
+ from typing import List, Tuple, Optional, Set
11
+
12
+ import numpy as np
13
+
14
+ from core.helpers import TokenizerBytesConverter
15
+
16
+
17
+ # Compression rate conversion factor
18
+ COMPRESSION_RATE_FACTOR = (1.0 / math.log(2.0)) * 0.125 * 100.0
19
+
20
+ # Global tokenizers (lazy loaded)
21
+ _qwen_tokenizer = None
22
+ _rwkv_tokenizer = None
23
+
24
+
25
+ def get_qwen_tokenizer():
26
+ """Lazy load Qwen tokenizer."""
27
+ global _qwen_tokenizer
28
+ if _qwen_tokenizer is None:
29
+ _qwen_tokenizer = TokenizerBytesConverter("Qwen/Qwen3-0.6B-Base")
30
+ return _qwen_tokenizer
31
+
32
+
33
+ def get_rwkv_tokenizer():
34
+ """Lazy load RWKV tokenizer."""
35
+ global _rwkv_tokenizer
36
+ if _rwkv_tokenizer is None:
37
+ from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
38
+ import os
39
+ script_dir = os.path.dirname(os.path.abspath(__file__))
40
+ vocab_path = os.path.join(os.path.dirname(script_dir), "support", "rwkv_vocab_v20230424.txt")
41
+ _rwkv_tokenizer = TRIE_TOKENIZER(vocab_path)
42
+ return _rwkv_tokenizer
43
+
44
+
45
+ def get_tokenizer_boundaries(text: str, tokenizer, is_rwkv: bool = False) -> Set[int]:
46
+ """Get token boundaries (byte positions) for a given text."""
47
+ boundaries = set()
48
+ boundaries.add(0)
49
+
50
+ if is_rwkv:
51
+ tokenized = tokenizer.encode(text)
52
+ if hasattr(tokenized, "ids"):
53
+ token_ids = tokenized.ids
54
+ else:
55
+ token_ids = tokenized
56
+
57
+ byte_pos = 0
58
+ for token_id in token_ids:
59
+ token_bytes = tokenizer.decodeBytes([token_id])
60
+ byte_pos += len(token_bytes)
61
+ boundaries.add(byte_pos)
62
+ else:
63
+ token_bytes_list = tokenizer.encode_to_bytes(text)
64
+ byte_pos = 0
65
+ for token_bytes in token_bytes_list:
66
+ byte_pos += len(token_bytes)
67
+ boundaries.add(byte_pos)
68
+
69
+ return boundaries
70
+
71
+
72
+ def get_token_info_for_text(text: str) -> dict:
73
+ """Get detailed token information for each byte position."""
74
+ qwen_tokenizer = get_qwen_tokenizer()
75
+ rwkv_tokenizer = get_rwkv_tokenizer()
76
+
77
+ # Get Qwen tokens with positions
78
+ qwen_tokens = []
79
+ byte_to_qwen = {}
80
+ qwen_bytes_list = qwen_tokenizer.encode_to_bytes(text)
81
+ byte_pos = 0
82
+ for idx, token_bytes in enumerate(qwen_bytes_list):
83
+ start = byte_pos
84
+ end = byte_pos + len(token_bytes)
85
+ try:
86
+ token_str = bytes(token_bytes).decode("utf-8")
87
+ except UnicodeDecodeError:
88
+ token_str = repr(bytes(token_bytes))
89
+ qwen_tokens.append((start, end, token_str))
90
+ byte_to_qwen[start] = idx
91
+ byte_pos = end
92
+
93
+ # Get RWKV tokens with positions
94
+ rwkv_tokens = []
95
+ byte_to_rwkv = {}
96
+ tokenized = rwkv_tokenizer.encode(text)
97
+ if hasattr(tokenized, "ids"):
98
+ token_ids = tokenized.ids
99
+ else:
100
+ token_ids = tokenized
101
+
102
+ byte_pos = 0
103
+ for idx, token_id in enumerate(token_ids):
104
+ token_bytes = rwkv_tokenizer.decodeBytes([token_id])
105
+ start = byte_pos
106
+ end = byte_pos + len(token_bytes)
107
+ try:
108
+ token_str = token_bytes.decode("utf-8")
109
+ except UnicodeDecodeError:
110
+ token_str = repr(token_bytes)
111
+ rwkv_tokens.append((start, end, token_str))
112
+ byte_to_rwkv[start] = idx
113
+ byte_pos = end
114
+
115
+ # Get common boundaries
116
+ qwen_boundaries = set([0] + [t[1] for t in qwen_tokens])
117
+ rwkv_boundaries = set([0] + [t[1] for t in rwkv_tokens])
118
+ common_boundaries = sorted(qwen_boundaries & rwkv_boundaries)
119
+
120
+ return {
121
+ "common_boundaries": common_boundaries,
122
+ "qwen_tokens": qwen_tokens,
123
+ "rwkv_tokens": rwkv_tokens,
124
+ "byte_to_qwen": byte_to_qwen,
125
+ "byte_to_rwkv": byte_to_rwkv,
126
+ }
127
+
128
+
129
+ def delta_to_color(delta: float, avg_delta: float, max_deviation: float) -> Tuple[int, int, int]:
130
+ """Map a delta value to an RGB color based on deviation from average."""
131
+ if max_deviation == 0:
132
+ return (255, 255, 255)
133
+
134
+ deviation = delta - avg_delta
135
+ normalized = max(-1, min(1, deviation / max_deviation))
136
+
137
+ if normalized < 0:
138
+ intensity = -normalized
139
+ r = int(255 * (1 - intensity * 0.7))
140
+ g = 255
141
+ b = int(255 * (1 - intensity * 0.7))
142
+ else:
143
+ intensity = normalized
144
+ r = 255
145
+ g = int(255 * (1 - intensity * 0.7))
146
+ b = int(255 * (1 - intensity * 0.7))
147
+
148
+ return (r, g, b)
149
+
150
+
151
+ def generate_comparison_html(
152
+ text: str,
153
+ byte_losses_a: List[float],
154
+ byte_losses_b: List[float],
155
+ model_a_name: str,
156
+ model_b_name: str,
157
+ topk_predictions_a: Optional[List] = None,
158
+ topk_predictions_b: Optional[List] = None,
159
+ tokenizer_a=None,
160
+ tokenizer_b=None,
161
+ model_type_a: str = "hf",
162
+ model_type_b: str = "rwkv7",
163
+ ) -> str:
164
+ """
165
+ Generate an interactive HTML visualization comparing two models.
166
+
167
+ Args:
168
+ text: The input text that was evaluated
169
+ byte_losses_a: Per-byte losses from model A
170
+ byte_losses_b: Per-byte losses from model B
171
+ model_a_name: Display name for model A
172
+ model_b_name: Display name for model B
173
+ topk_predictions_a: Top-k predictions from model A
174
+ topk_predictions_b: Top-k predictions from model B
175
+ tokenizer_a: Tokenizer for model A
176
+ tokenizer_b: Tokenizer for model B
177
+ model_type_a: Type of model A ("hf" or "rwkv7")
178
+ model_type_b: Type of model B ("hf" or "rwkv7")
179
+
180
+ Returns:
181
+ HTML string with interactive visualization
182
+ """
183
+
184
+ def decode_token(token_id: int, tokenizer, model_type: str) -> str:
185
+ if tokenizer is None:
186
+ return f"[{token_id}]"
187
+ try:
188
+ if model_type in ["rwkv", "rwkv7"]:
189
+ return tokenizer.decode([token_id])
190
+ else:
191
+ return tokenizer.decode([token_id])
192
+ except:
193
+ return f"[{token_id}]"
194
+
195
+ def build_byte_to_token_map(text: str, tokenizer, model_type: str):
196
+ if tokenizer is None:
197
+ return []
198
+
199
+ token_ranges = []
200
+
201
+ try:
202
+ if model_type in ["rwkv", "rwkv7"]:
203
+ tokenized = tokenizer.encode(text)
204
+ if hasattr(tokenized, "ids"):
205
+ token_ids = tokenized.ids
206
+ else:
207
+ token_ids = tokenized
208
+
209
+ byte_pos = 0
210
+ for idx, token_id in enumerate(token_ids):
211
+ try:
212
+ token_bytes = tokenizer.decodeBytes([token_id])
213
+ token_ranges.append((byte_pos, byte_pos + len(token_bytes), idx))
214
+ byte_pos += len(token_bytes)
215
+ except:
216
+ pass
217
+ else:
218
+ tokenizer_name = getattr(tokenizer, "name_or_path", None)
219
+ if tokenizer_name:
220
+ converter = TokenizerBytesConverter(tokenizer_name, trust_remote_code=True)
221
+ token_bytes_list = converter.encode_to_bytes(text)
222
+ byte_pos = 0
223
+ for idx, token_bytes in enumerate(token_bytes_list):
224
+ token_ranges.append((byte_pos, byte_pos + len(token_bytes), idx))
225
+ byte_pos += len(token_bytes)
226
+ except Exception as e:
227
+ print(f"Warning: Could not build byte-to-token map ({model_type}): {e}")
228
+ return []
229
+
230
+ return token_ranges
231
+
232
+ def find_token_for_byte(byte_pos: int, token_ranges):
233
+ for start, end, idx in token_ranges:
234
+ if start <= byte_pos < end:
235
+ return idx
236
+ return None
237
+
238
+ # Calculate deltas
239
+ deltas = [a - b for a, b in zip(byte_losses_a, byte_losses_b)]
240
+ avg_delta = sum(deltas) / len(deltas) if deltas else 0
241
+
242
+ # Calculate max deviation
243
+ deviations = [d - avg_delta for d in deltas]
244
+ abs_deviations = [abs(dev) for dev in deviations]
245
+ max_deviation = float(np.percentile(abs_deviations, 100)) if abs_deviations else 0
246
+ max_deviation = max(max_deviation, 1e-6)
247
+
248
+ # Calculate average compression rates
249
+ avg_compression_a = sum(byte_losses_a) / len(byte_losses_a) * COMPRESSION_RATE_FACTOR if byte_losses_a else 0
250
+ avg_compression_b = sum(byte_losses_b) / len(byte_losses_b) * COMPRESSION_RATE_FACTOR if byte_losses_b else 0
251
+ avg_delta_compression = avg_delta * COMPRESSION_RATE_FACTOR
252
+
253
+ # Get token info
254
+ text_bytes = text.encode("utf-8")
255
+ token_info = get_token_info_for_text(text)
256
+ common_boundaries = token_info["common_boundaries"]
257
+ qwen_tokens = token_info["qwen_tokens"]
258
+ rwkv_tokens = token_info["rwkv_tokens"]
259
+
260
+ # Build byte position to token index mapping
261
+ model_a_token_ranges = build_byte_to_token_map(text, tokenizer_a, model_type_a)
262
+ model_b_token_ranges = build_byte_to_token_map(text, tokenizer_b, model_type_b)
263
+
264
+ def get_tokens_for_range(byte_start, byte_end, token_list):
265
+ result = []
266
+ for idx, (t_start, t_end, t_str) in enumerate(token_list):
267
+ if t_start < byte_end and t_end > byte_start:
268
+ result.append((idx, t_str))
269
+ return result
270
+
271
+ # Build tokens based on common boundaries
272
+ tokens = []
273
+ for i in range(len(common_boundaries) - 1):
274
+ start_byte = common_boundaries[i]
275
+ end_byte = common_boundaries[i + 1]
276
+ token_bytes = text_bytes[start_byte:end_byte]
277
+ try:
278
+ token_text = token_bytes.decode("utf-8")
279
+ except UnicodeDecodeError:
280
+ continue
281
+
282
+ qwen_toks = get_tokens_for_range(start_byte, end_byte, qwen_tokens)
283
+ rwkv_toks = get_tokens_for_range(start_byte, end_byte, rwkv_tokens)
284
+
285
+ if re.search(r"\w", token_text, re.UNICODE):
286
+ tokens.append({
287
+ "type": "word",
288
+ "text": token_text,
289
+ "byte_start": start_byte,
290
+ "byte_end": end_byte,
291
+ "word_lower": token_text.lower(),
292
+ "qwen_tokens": qwen_toks,
293
+ "rwkv_tokens": rwkv_toks,
294
+ })
295
+ else:
296
+ tokens.append({
297
+ "type": "non-word",
298
+ "text": token_text,
299
+ "byte_start": start_byte,
300
+ "byte_end": end_byte,
301
+ "qwen_tokens": qwen_toks,
302
+ "rwkv_tokens": rwkv_toks,
303
+ })
304
+
305
+ # Track word occurrences
306
+ word_occurrences = {}
307
+ word_id_counter = 0
308
+
309
+ for i, token in enumerate(tokens):
310
+ if token["type"] == "word":
311
+ word_lower = token["word_lower"]
312
+ if word_lower not in word_occurrences:
313
+ word_occurrences[word_lower] = []
314
+ word_occurrences[word_lower].append(i)
315
+ token["word_id"] = word_id_counter
316
+ word_id_counter += 1
317
+
318
+ # Build HTML content
319
+ html_content = []
320
+
321
+ def escape_for_attr(s):
322
+ return s.replace("&", "&amp;").replace('"', "&quot;").replace("<", "&lt;").replace(">", "&gt;")
323
+
324
+ for token in tokens:
325
+ token_text = token["text"]
326
+ byte_start = token["byte_start"]
327
+ byte_end = token["byte_end"]
328
+
329
+ qwen_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["qwen_tokens"]])
330
+ rwkv_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["rwkv_tokens"]])
331
+
332
+ raw_bytes = list(text_bytes[byte_start:byte_end])
333
+ losses_a = byte_losses_a[byte_start:byte_end]
334
+ losses_b = byte_losses_b[byte_start:byte_end]
335
+
336
+ bytes_str = " ".join([f"{b:02x}" for b in raw_bytes])
337
+ compression_a_str = " ".join([f"{l * COMPRESSION_RATE_FACTOR:.2f}%" for l in losses_a])
338
+ compression_b_str = " ".join([f"{l * COMPRESSION_RATE_FACTOR:.2f}%" for l in losses_b])
339
+
340
+ topk_a_json = ""
341
+ topk_b_json = ""
342
+ if topk_predictions_a is not None and model_a_token_ranges:
343
+ model_a_token_idx = find_token_for_byte(byte_start, model_a_token_ranges)
344
+ if model_a_token_idx is not None and model_a_token_idx < len(topk_predictions_a):
345
+ pred = topk_predictions_a[model_a_token_idx]
346
+ decoded_pred = [
347
+ pred[0],
348
+ pred[1],
349
+ [[tid, prob, decode_token(tid, tokenizer_a, model_type_a)] for tid, prob in pred[2]],
350
+ ]
351
+ topk_a_json = json.dumps(decoded_pred, ensure_ascii=False)
352
+ if topk_predictions_b is not None and model_b_token_ranges:
353
+ model_b_token_idx = find_token_for_byte(byte_start, model_b_token_ranges)
354
+ if model_b_token_idx is not None and model_b_token_idx < len(topk_predictions_b):
355
+ pred = topk_predictions_b[model_b_token_idx]
356
+ decoded_pred = [pred[0], pred[1], [[tid, prob, decode_token(tid, tokenizer_b, model_type_b)] for tid, prob in pred[2]]]
357
+ topk_b_json = json.dumps(decoded_pred, ensure_ascii=False)
358
+
359
+ token_deltas = deltas[byte_start:byte_end]
360
+ avg_token_delta = sum(token_deltas) / len(token_deltas) if token_deltas else 0
361
+
362
+ color = delta_to_color(avg_token_delta, avg_delta, max_deviation)
363
+ r, g, b = color
364
+
365
+ token_html_parts = []
366
+ for char in token_text:
367
+ if char == "<":
368
+ escaped_char = "&lt;"
369
+ elif char == ">":
370
+ escaped_char = "&gt;"
371
+ elif char == "&":
372
+ escaped_char = "&amp;"
373
+ elif char == "\n":
374
+ escaped_char = "<br>"
375
+ elif char == " ":
376
+ escaped_char = "&nbsp;"
377
+ elif char == "\t":
378
+ escaped_char = "&nbsp;&nbsp;&nbsp;&nbsp;"
379
+ else:
380
+ escaped_char = char
381
+ token_html_parts.append(escaped_char)
382
+
383
+ token_span_content = "".join(token_html_parts)
384
+ data_attrs = (
385
+ f'data-qwen="{escape_for_attr(qwen_info)}" '
386
+ f'data-rwkv="{escape_for_attr(rwkv_info)}" '
387
+ f'data-bytes="{escape_for_attr(bytes_str)}" '
388
+ f'data-compression-a="{escape_for_attr(compression_a_str)}" '
389
+ f'data-compression-b="{escape_for_attr(compression_b_str)}" '
390
+ f'data-delta="{avg_token_delta * COMPRESSION_RATE_FACTOR:.4f}" '
391
+ f'data-topk-a="{escape_for_attr(topk_a_json)}" '
392
+ f'data-topk-b="{escape_for_attr(topk_b_json)}"'
393
+ )
394
+ style_attr = f'style="background-color: rgb({r},{g},{b})"'
395
+
396
+ if token["type"] == "word":
397
+ word_lower = token["word_lower"]
398
+ occurrences = word_occurrences[word_lower]
399
+ if len(occurrences) > 1:
400
+ word_id = token["word_id"]
401
+ html_content.append(
402
+ f'<span class="token word" {data_attrs} {style_attr} data-word="{word_lower}" data-word-id="{word_id}">'
403
+ + token_span_content
404
+ + "</span>"
405
+ )
406
+ else:
407
+ html_content.append(f'<span class="token" {data_attrs} {style_attr}>{token_span_content}</span>')
408
+ else:
409
+ html_content.append(f'<span class="token" {data_attrs} {style_attr}>{token_span_content}</span>')
410
+
411
+ delta_color = "#64ff64" if avg_delta < 0 else "#ff6464"
412
+
413
+ html = f"""<!DOCTYPE html>
414
+ <html>
415
+ <head>
416
+ <meta charset="UTF-8">
417
+ <title>UncheatableEval - Byte-wise Loss Comparison</title>
418
+ <style>
419
+ body {{
420
+ font-family: Consolas, 'Courier New', monospace;
421
+ margin: 0;
422
+ padding: 0;
423
+ background-color: #f5f5f5;
424
+ }}
425
+ .header {{
426
+ background-color: #333;
427
+ color: white;
428
+ padding: 20px;
429
+ position: sticky;
430
+ top: 0;
431
+ z-index: 100;
432
+ }}
433
+ .header h1 {{
434
+ margin: 0 0 15px 0;
435
+ font-size: 18px;
436
+ }}
437
+ .meta {{
438
+ display: flex;
439
+ flex-wrap: wrap;
440
+ gap: 20px;
441
+ font-size: 12px;
442
+ color: #c8c8c8;
443
+ }}
444
+ .legend {{
445
+ display: flex;
446
+ gap: 15px;
447
+ margin-top: 10px;
448
+ }}
449
+ .legend-item {{
450
+ display: flex;
451
+ align-items: center;
452
+ gap: 5px;
453
+ }}
454
+ .legend-box {{
455
+ width: 20px;
456
+ height: 12px;
457
+ border: 1px solid #666;
458
+ }}
459
+ .content {{
460
+ background-color: white;
461
+ margin: 10px;
462
+ padding: 15px;
463
+ border: 1px solid #ccc;
464
+ font-size: 14px;
465
+ line-height: 1.8;
466
+ word-wrap: break-word;
467
+ position: relative;
468
+ }}
469
+ .content span {{
470
+ padding: 1px 0;
471
+ }}
472
+ .word {{
473
+ cursor: pointer;
474
+ position: relative;
475
+ }}
476
+ .word:hover {{
477
+ outline: 2px solid #007bff;
478
+ outline-offset: 1px;
479
+ }}
480
+ .word.highlighted {{
481
+ outline: 2px solid #ff6b6b;
482
+ outline-offset: 1px;
483
+ }}
484
+ #svg-overlay {{
485
+ position: fixed;
486
+ top: 0;
487
+ left: 0;
488
+ width: 100%;
489
+ height: 100%;
490
+ pointer-events: none;
491
+ z-index: 1000;
492
+ }}
493
+ .link-line {{
494
+ stroke: #007bff;
495
+ stroke-width: 2;
496
+ fill: none;
497
+ opacity: 0.7;
498
+ }}
499
+ .link-dot {{
500
+ fill: #007bff;
501
+ opacity: 0.8;
502
+ }}
503
+ .token {{
504
+ position: relative;
505
+ cursor: help;
506
+ }}
507
+ .token:hover {{
508
+ outline: 1px dashed #666;
509
+ }}
510
+ #tooltip {{
511
+ position: fixed;
512
+ background-color: rgba(0, 0, 0, 0.9);
513
+ color: white;
514
+ padding: 10px 14px;
515
+ border-radius: 6px;
516
+ font-size: 12px;
517
+ max-width: 500px;
518
+ z-index: 2000;
519
+ pointer-events: none;
520
+ display: none;
521
+ line-height: 1.6;
522
+ box-shadow: 0 2px 10px rgba(0,0,0,0.3);
523
+ }}
524
+ #tooltip .label {{
525
+ color: #aaa;
526
+ font-weight: bold;
527
+ }}
528
+ #tooltip .bytes {{
529
+ color: #a5f3fc;
530
+ font-family: monospace;
531
+ }}
532
+ #tooltip .loss-a {{
533
+ color: #86efac;
534
+ font-family: monospace;
535
+ }}
536
+ #tooltip .loss-b {{
537
+ color: #fca5a5;
538
+ font-family: monospace;
539
+ }}
540
+ #tooltip .qwen {{
541
+ color: #7dd3fc;
542
+ }}
543
+ #tooltip .rwkv {{
544
+ color: #fcd34d;
545
+ }}
546
+ #tooltip .topk-section {{
547
+ margin-top: 8px;
548
+ padding-top: 8px;
549
+ border-top: 1px solid #555;
550
+ }}
551
+ #tooltip .topk-container {{
552
+ display: flex;
553
+ gap: 16px;
554
+ }}
555
+ #tooltip .topk-column {{
556
+ flex: 1;
557
+ min-width: 180px;
558
+ }}
559
+ #tooltip .topk-title {{
560
+ color: #aaa;
561
+ font-weight: bold;
562
+ margin-bottom: 4px;
563
+ font-size: 11px;
564
+ }}
565
+ #tooltip .topk-title.model-a {{
566
+ color: #86efac;
567
+ }}
568
+ #tooltip .topk-title.model-b {{
569
+ color: #fca5a5;
570
+ }}
571
+ #tooltip .topk-list {{
572
+ font-size: 11px;
573
+ }}
574
+ #tooltip .topk-item {{
575
+ display: flex;
576
+ gap: 4px;
577
+ padding: 1px 0;
578
+ align-items: center;
579
+ }}
580
+ #tooltip .topk-rank {{
581
+ color: #888;
582
+ min-width: 18px;
583
+ }}
584
+ #tooltip .topk-rank.hit {{
585
+ color: #ffd700;
586
+ }}
587
+ #tooltip .topk-token {{
588
+ color: #a5f3fc;
589
+ max-width: 100px;
590
+ overflow: hidden;
591
+ text-overflow: ellipsis;
592
+ white-space: nowrap;
593
+ font-family: monospace;
594
+ }}
595
+ #tooltip .topk-prob {{
596
+ color: #86efac;
597
+ min-width: 45px;
598
+ text-align: right;
599
+ }}
600
+ #tooltip .topk-hit {{
601
+ color: #22c55e;
602
+ }}
603
+ #tooltip .topk-miss {{
604
+ color: #ef4444;
605
+ font-style: italic;
606
+ }}
607
+ </style>
608
+ </head>
609
+ <body>
610
+ <svg id="svg-overlay"></svg>
611
+ <div id="tooltip"></div>
612
+ <div class="header">
613
+ <h1>UncheatableEval - Byte-wise Loss Comparison</h1>
614
+ <div class="meta">
615
+ <div>Model A: {model_a_name}</div>
616
+ <div>Model B: {model_b_name}</div>
617
+ <div>Compression A: {avg_compression_a:.2f}%</div>
618
+ <div>Compression B: {avg_compression_b:.2f}%</div>
619
+ <div style="color: {delta_color}">Avg Delta: {avg_delta_compression:+.2f}%</div>
620
+ </div>
621
+ <div class="legend">
622
+ <div class="legend-item">
623
+ <div class="legend-box" style="background-color: rgb(77, 255, 77)"></div>
624
+ <span>Model A better</span>
625
+ </div>
626
+ <div class="legend-item">
627
+ <div class="legend-box" style="background-color: rgb(255, 255, 255)"></div>
628
+ <span>= Avg delta</span>
629
+ </div>
630
+ <div class="legend-item">
631
+ <div class="legend-box" style="background-color: rgb(255, 77, 77)"></div>
632
+ <span>Model B better</span>
633
+ </div>
634
+ <div class="legend-item" style="margin-left: 20px;">
635
+ <span style="color: #aaa;">Saturation:</span>
636
+ <input type="range" id="saturation-slider" min="500" max="1000" value="1000" step="1" style="width: 200px; vertical-align: middle;">
637
+ <span id="saturation-value" style="color: #fff; min-width: 45px; display: inline-block;">100.0%</span>
638
+ </div>
639
+ </div>
640
+ </div>
641
+ <div class="content">
642
+ {''.join(html_content)}
643
+ </div>
644
+ <script>
645
+ const svgOverlay = document.getElementById('svg-overlay');
646
+ const words = document.querySelectorAll('.word');
647
+
648
+ const wordGroups = {{}};
649
+ words.forEach(word => {{
650
+ const wordText = word.getAttribute('data-word');
651
+ if (!wordGroups[wordText]) {{
652
+ wordGroups[wordText] = [];
653
+ }}
654
+ wordGroups[wordText].push(word);
655
+ }});
656
+
657
+ function clearLines() {{
658
+ svgOverlay.innerHTML = '';
659
+ words.forEach(w => w.classList.remove('highlighted'));
660
+ }}
661
+
662
+ function drawLines(hoveredWord) {{
663
+ clearLines();
664
+
665
+ const wordText = hoveredWord.getAttribute('data-word');
666
+ const wordId = parseInt(hoveredWord.getAttribute('data-word-id'));
667
+ const sameWords = wordGroups[wordText] || [];
668
+
669
+ const previousWords = sameWords.filter(w => {{
670
+ const id = parseInt(w.getAttribute('data-word-id'));
671
+ return id < wordId;
672
+ }});
673
+
674
+ if (previousWords.length === 0) return;
675
+
676
+ sameWords.forEach(w => w.classList.add('highlighted'));
677
+
678
+ const hoveredRect = hoveredWord.getBoundingClientRect();
679
+ const hoveredX = hoveredRect.left + hoveredRect.width / 2;
680
+ const hoveredY = hoveredRect.top + hoveredRect.height / 2;
681
+
682
+ previousWords.forEach(prevWord => {{
683
+ const prevRect = prevWord.getBoundingClientRect();
684
+ const prevX = prevRect.left + prevRect.width / 2;
685
+ const prevY = prevRect.top + prevRect.height / 2;
686
+
687
+ const midX = (hoveredX + prevX) / 2;
688
+ const midY = Math.min(hoveredY, prevY) - 30;
689
+
690
+ const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
691
+ path.setAttribute('class', 'link-line');
692
+ path.setAttribute('d', `M ${{prevX}} ${{prevY}} Q ${{midX}} ${{midY}} ${{hoveredX}} ${{hoveredY}}`);
693
+ svgOverlay.appendChild(path);
694
+
695
+ const dot1 = document.createElementNS('http://www.w3.org/2000/svg', 'circle');
696
+ dot1.setAttribute('class', 'link-dot');
697
+ dot1.setAttribute('cx', prevX);
698
+ dot1.setAttribute('cy', prevY);
699
+ dot1.setAttribute('r', 4);
700
+ svgOverlay.appendChild(dot1);
701
+
702
+ const dot2 = document.createElementNS('http://www.w3.org/2000/svg', 'circle');
703
+ dot2.setAttribute('class', 'link-dot');
704
+ dot2.setAttribute('cx', hoveredX);
705
+ dot2.setAttribute('cy', hoveredY);
706
+ dot2.setAttribute('r', 4);
707
+ svgOverlay.appendChild(dot2);
708
+ }});
709
+ }}
710
+
711
+ words.forEach(word => {{
712
+ word.addEventListener('mouseenter', () => drawLines(word));
713
+ word.addEventListener('mouseleave', clearLines);
714
+ }});
715
+
716
+ window.addEventListener('scroll', clearLines);
717
+
718
+ const tooltip = document.getElementById('tooltip');
719
+ const tokenSpans = document.querySelectorAll('.token');
720
+
721
+ tokenSpans.forEach(token => {{
722
+ token.addEventListener('mouseenter', (e) => {{
723
+ const qwen = token.getAttribute('data-qwen') || 'N/A';
724
+ const rwkv = token.getAttribute('data-rwkv') || 'N/A';
725
+ const bytes = token.getAttribute('data-bytes') || '';
726
+ const compressionA = token.getAttribute('data-compression-a') || '';
727
+ const compressionB = token.getAttribute('data-compression-b') || '';
728
+ const top5A = token.getAttribute('data-topk-a') || '';
729
+ const top5B = token.getAttribute('data-topk-b') || '';
730
+
731
+ function formatTopkColumn(topkJson, modelName, titleClass) {{
732
+ if (!topkJson) return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">N/A</div></div>';
733
+ try {{
734
+ const data = JSON.parse(topkJson);
735
+ const [actualId, rank, topkList] = data;
736
+ let html = '<div class="topk-column">';
737
+ html += '<div class="topk-title ' + titleClass + '">' + modelName + '</div>';
738
+ html += '<div class="topk-list">';
739
+ topkList.forEach((item, idx) => {{
740
+ const [tokenId, prob, tokenText] = item;
741
+ const isHit = tokenId === actualId;
742
+ const rankClass = isHit ? 'topk-rank hit' : 'topk-rank';
743
+ const displayText = tokenText || '[' + tokenId + ']';
744
+ const escapedText = displayText.replace(/</g, '&lt;').replace(/>/g, '&gt;');
745
+ html += '<div class="topk-item">';
746
+ html += '<span class="' + rankClass + '">' + (idx + 1) + '.</span>';
747
+ html += '<span class="topk-token" title="ID: ' + tokenId + '">' + escapedText + '</span>';
748
+ html += '<span class="topk-prob">' + (prob * 100).toFixed(1) + '%</span>';
749
+ if (isHit) html += '<span class="topk-hit">✓</span>';
750
+ html += '</div>';
751
+ }});
752
+ if (rank > 10) {{
753
+ html += '<div class="topk-item topk-miss">Actual rank: ' + rank + '</div>';
754
+ }}
755
+ html += '</div></div>';
756
+ return html;
757
+ }} catch (e) {{
758
+ return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">Error</div></div>';
759
+ }}
760
+ }}
761
+
762
+ let tooltipHtml = `
763
+ <div><span class="label">Bytes:</span> <span class="bytes">${{bytes || '(empty)'}}</span></div>
764
+ <div><span class="label">Compression A:</span> <span class="loss-a">${{compressionA || '(empty)'}}</span></div>
765
+ <div><span class="label">Compression B:</span> <span class="loss-b">${{compressionB || '(empty)'}}</span></div>
766
+ <hr style="border-color: #555; margin: 6px 0;">
767
+ <div><span class="label">Qwen:</span> <span class="qwen">${{qwen || '(empty)'}}</span></div>
768
+ <div><span class="label">RWKV:</span> <span class="rwkv">${{rwkv || '(empty)'}}</span></div>
769
+ `;
770
+ if (top5A || top5B) {{
771
+ tooltipHtml += '<div class="topk-section"><div class="topk-container">';
772
+ tooltipHtml += formatTopkColumn(top5A, 'Model A Top10', 'model-a');
773
+ tooltipHtml += formatTopkColumn(top5B, 'Model B Top10', 'model-b');
774
+ tooltipHtml += '</div></div>';
775
+ }}
776
+ tooltip.innerHTML = tooltipHtml;
777
+ tooltip.style.display = 'block';
778
+ }});
779
+
780
+ token.addEventListener('mousemove', (e) => {{
781
+ const tooltipRect = tooltip.getBoundingClientRect();
782
+ const viewportWidth = window.innerWidth;
783
+ const viewportHeight = window.innerHeight;
784
+
785
+ let x = e.clientX + 15;
786
+ let y = e.clientY + 15;
787
+
788
+ if (x + tooltipRect.width > viewportWidth - 10) {{
789
+ x = e.clientX - tooltipRect.width - 15;
790
+ }}
791
+ if (y + tooltipRect.height > viewportHeight - 10) {{
792
+ y = e.clientY - tooltipRect.height - 15;
793
+ }}
794
+ if (x < 10) x = 10;
795
+ if (y < 10) y = 10;
796
+
797
+ tooltip.style.left = x + 'px';
798
+ tooltip.style.top = y + 'px';
799
+ }});
800
+
801
+ token.addEventListener('mouseleave', () => {{
802
+ tooltip.style.display = 'none';
803
+ }});
804
+ }});
805
+
806
+ const avgDelta = {avg_delta_compression};
807
+ const slider = document.getElementById('saturation-slider');
808
+ const saturationValue = document.getElementById('saturation-value');
809
+
810
+ const allDeltas = [];
811
+ tokenSpans.forEach(token => {{
812
+ const delta = parseFloat(token.getAttribute('data-delta'));
813
+ if (!isNaN(delta)) allDeltas.push(delta);
814
+ }});
815
+
816
+ function percentile(arr, p) {{
817
+ const sorted = [...arr].sort((a, b) => a - b);
818
+ const idx = (p / 100) * (sorted.length - 1);
819
+ const lower = Math.floor(idx);
820
+ const upper = Math.ceil(idx);
821
+ if (lower === upper) return sorted[lower];
822
+ return sorted[lower] + (sorted[upper] - sorted[lower]) * (idx - lower);
823
+ }}
824
+
825
+ function deltaToColor(delta, avgDelta, maxDeviation) {{
826
+ if (maxDeviation === 0) return 'rgb(255, 255, 255)';
827
+ const deviation = delta - avgDelta;
828
+ let normalized = Math.max(-1, Math.min(1, deviation / maxDeviation));
829
+ let r, g, b;
830
+ if (normalized < 0) {{
831
+ const intensity = -normalized;
832
+ r = Math.round(255 * (1 - intensity * 0.7));
833
+ g = 255;
834
+ b = Math.round(255 * (1 - intensity * 0.7));
835
+ }} else {{
836
+ const intensity = normalized;
837
+ r = 255;
838
+ g = Math.round(255 * (1 - intensity * 0.7));
839
+ b = Math.round(255 * (1 - intensity * 0.7));
840
+ }}
841
+ return `rgb(${{r}}, ${{g}}, ${{b}})`;
842
+ }}
843
+
844
+ function updateColors(percentileValue) {{
845
+ const deviations = allDeltas.map(d => Math.abs(d - avgDelta));
846
+ const maxDeviation = Math.max(percentile(deviations, percentileValue), 1e-6);
847
+ tokenSpans.forEach(token => {{
848
+ const delta = parseFloat(token.getAttribute('data-delta'));
849
+ if (!isNaN(delta)) {{
850
+ token.style.backgroundColor = deltaToColor(delta, avgDelta, maxDeviation);
851
+ }}
852
+ }});
853
+ }}
854
+
855
+ slider.addEventListener('input', (e) => {{
856
+ const val = parseInt(e.target.value) / 10;
857
+ saturationValue.textContent = val.toFixed(1) + '%';
858
+ updateColors(val);
859
+ }});
860
+ </script>
861
+ </body>
862
+ </html>
863
+ """
864
+
865
+ return html