Spaces:

Jellyfish042
/

Compression-Lens

Running

Jellyfish042 Claude Opus 4.5 commited on Jan 18

Commit

9afeeeb

0 Parent(s):

Initial commit: UncheatableEval LossLens visualization

- Compare Qwen3-1.7B-Base vs RWKV7-G1C-1.5B byte-level predictions
- Interactive HTML visualization with hover tooltips
- Support both CPU and GPU execution
- Gradio web interface for HuggingFace Spaces

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Files changed (12) hide show

.claude/settings.local.json +8 -0
.gitattributes +3 -0
README.md +77 -0
app.py +362 -0
core/__init__.py +1 -0
core/evaluator.py +270 -0
core/helpers.py +266 -0
examples/sample_texts.json +24 -0
requirements.txt +11 -0
support/rwkv_vocab_v20230424.txt +0 -0
visualization/__init__.py +1 -0
visualization/html_generator.py +865 -0

.claude/settings.local.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "permissions": {
+    "allow": [
+      "Bash(git init:*)",
+      "Bash(git add:*)"
+    ]
+  }
+}

.gitattributes ADDED Viewed

	@@ -0,0 +1,3 @@

+*.pth filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,77 @@

+---
+title: UncheatableEval Visualization
+emoji: 🔬
+colorFrom: blue
+colorTo: green
+sdk: gradio
+sdk_version: 4.44.0
+app_file: app.py
+pinned: false
+license: mit
+---
+# UncheatableEval: Qwen3 vs RWKV7 Byte-Level Comparison
+Compare the byte-level prediction performance between **Qwen3-1.7B-Base** and **RWKV7-G1C-1.5B**.
+## Features
+- **Byte-level analysis**: See exactly where each model performs better or worse
+- **Interactive visualization**: Hover over tokens to see detailed predictions
+- **Color-coded comparison**:
+  - 🟢 Green = Qwen3 predicts better (lower loss)
+  - 🔴 Red = RWKV7 predicts better (lower loss)
+- **Top-10 predictions**: View each model's top predictions for every token
+- **Word occurrence linking**: See how repeated words are predicted differently
+## How to Use
+1. Enter or paste your text (max 4000 characters)
+2. Click "Run Comparison"
+3. Explore the interactive visualization
+4. Download the HTML file for offline viewing
+## Models
+| Model | Type | Parameters | Architecture |
+|-------|------|------------|--------------|
+| Qwen3-1.7B-Base | Transformer | 1.7B | Dense attention |
+| RWKV7-G1C-1.5B | RWKV | 1.5B | Linear attention |
+## Technical Details
+This tool uses the [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) framework to:
+1. Tokenize input text with each model's tokenizer
+2. Calculate per-token cross-entropy loss
+3. Map token losses to byte-level losses
+4. Generate interactive HTML visualization
+## Local Development
+```bash
+# Clone the repository
+git clone https://huggingface.co/spaces/YOUR_USERNAME/UncheatableEval-Visualization
+# Install dependencies
+pip install -r requirements.txt
+# Run locally
+python app.py
+```
+## Requirements
+- CUDA-capable GPU (16GB+ VRAM recommended)
+- Python 3.10+
+- See `requirements.txt` for package dependencies
+## License
+MIT License
+## Acknowledgments
+- [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) - Original evaluation framework
+- [Qwen](https://github.com/QwenLM/Qwen) - Qwen model family
+- [RWKV](https://github.com/BlinkDL/RWKV-LM) - RWKV model family

app.py ADDED Viewed

	@@ -0,0 +1,362 @@

+"""
+UncheatableEval Visualization - Hugging Face Space
+Compare byte-level prediction performance between Qwen3-1.7B-Base and RWKV7-G1C-1.5B.
+"""
+import gc
+import os
+import tempfile
+from pathlib import Path
+import gradio as gr
+import torch
+# Detect device
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+IS_CPU = DEVICE == "cpu"
+# Model configuration
+QWEN_MODEL_ID = "Qwen/Qwen3-1.7B-Base"
+RWKV_MODEL_URL = "https://huggingface.co/BlinkDL/rwkv7-g1/resolve/main/rwkv7-g1c-1.5b-20260110-ctx8192.pth"
+RWKV_MODEL_FILENAME = "rwkv7-g1c-1.5b-20260110-ctx8192.pth"
+# Get the directory where this script is located
+SCRIPT_DIR = Path(__file__).parent.absolute()
+MODELS_DIR = SCRIPT_DIR / "models"
+SUPPORT_DIR = SCRIPT_DIR / "support"
+# Text length limits
+MAX_TEXT_LENGTH = 4000
+MIN_TEXT_LENGTH = 10
+# Example texts
+EXAMPLE_NEWS = """The rapid advancement of artificial intelligence has sparked both excitement and concern among researchers worldwide. While AI systems demonstrate remarkable capabilities in language understanding and generation, questions remain about their potential impact on employment and society."""
+EXAMPLE_CODE = """def fibonacci(n):
+    if n <= 1:
+        return n
+    return fibonacci(n-1) + fibonacci(n-2)
+# Calculate first 10 Fibonacci numbers
+for i in range(10):
+    print(f"F({i}) = {fibonacci(i)}")"""
+EXAMPLE_LITERATURE = """It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness."""
+def download_rwkv_model(progress=None):
+    """Download RWKV7 model if not exists."""
+    from huggingface_hub import hf_hub_download
+    model_path = MODELS_DIR / RWKV_MODEL_FILENAME
+    if model_path.exists():
+        return str(model_path)
+    MODELS_DIR.mkdir(parents=True, exist_ok=True)
+    if progress:
+        progress(0.1, desc="Downloading RWKV7 model...")
+    # Download from HuggingFace Hub
+    downloaded_path = hf_hub_download(
+        repo_id="BlinkDL/rwkv7-g1",
+        filename=RWKV_MODEL_FILENAME,
+        local_dir=str(MODELS_DIR),
+        local_dir_use_symlinks=False
+    )
+    return downloaded_path
+def load_qwen_model():
+    """Load Qwen3-1.7B-Base model."""
+    from transformers import AutoTokenizer, AutoModelForCausalLM
+    tokenizer = AutoTokenizer.from_pretrained(
+        QWEN_MODEL_ID,
+        trust_remote_code=True
+    )
+    # Configure based on device
+    if IS_CPU:
+        model_kwargs = {
+            "torch_dtype": torch.float32,
+            "device_map": None,
+            "trust_remote_code": True,
+            "low_cpu_mem_usage": True
+        }
+        model = AutoModelForCausalLM.from_pretrained(
+            QWEN_MODEL_ID,
+            **model_kwargs
+        ).eval()
+    else:
+        model_kwargs = {
+            "torch_dtype": torch.bfloat16,
+            "device_map": "auto",
+            "trust_remote_code": True
+        }
+        try:
+            model = AutoModelForCausalLM.from_pretrained(
+                QWEN_MODEL_ID,
+                attn_implementation="flash_attention_2",
+                **model_kwargs
+            ).eval()
+        except Exception:
+            model = AutoModelForCausalLM.from_pretrained(
+                QWEN_MODEL_ID,
+                **model_kwargs
+            ).eval()
+    return model, tokenizer
+def load_rwkv7_model(model_path: str):
+    """Load RWKV7-G1C-1.5B model."""
+    os.environ["RWKV_JIT_ON"] = "1"
+    os.environ["RWKV_V7_ON"] = "1"
+    # Set CUDA flag based on device
+    if IS_CPU:
+        os.environ["RWKV_CUDA_ON"] = "0"
+    else:
+        os.environ["RWKV_CUDA_ON"] = "1"
+    from rwkv.model import RWKV
+    from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
+    # Use appropriate strategy for device
+    if IS_CPU:
+        strategy = "cpu fp32"
+    else:
+        strategy = "cuda fp16"
+    model = RWKV(model=model_path, strategy=strategy)
+    vocab_path = str(SUPPORT_DIR / "rwkv_vocab_v20230424.txt")
+    tokenizer = TRIE_TOKENIZER(vocab_path)
+    return model, tokenizer
+def validate_input(text: str) -> tuple[bool, str]:
+    """Validate input text."""
+    if not text or not text.strip():
+        return False, "Please enter some text to analyze."
+    text = text.strip()
+    if len(text) < MIN_TEXT_LENGTH:
+        return False, f"Text is too short. Minimum {MIN_TEXT_LENGTH} characters required."
+    if len(text) > MAX_TEXT_LENGTH:
+        return False, f"Text is too long. Maximum {MAX_TEXT_LENGTH} characters allowed. Current: {len(text)}"
+    return True, text
+def wrap_html_in_iframe(html: str) -> str:
+    """Wrap HTML in an iframe for Gradio display."""
+    escaped = html.replace('"', '&quot;')
+    return f'''
+    <div style="width:100%;height:700px;border:1px solid #ddd;border-radius:8px;overflow:hidden;">
+        <iframe srcdoc="{escaped}"
+                style="width:100%;height:100%;border:none;"
+                sandbox="allow-scripts"></iframe>
+    </div>
+    '''
+def run_evaluation(text: str, progress=gr.Progress()):
+    """Run evaluation on both models and generate visualization."""
+    from core.evaluator import evaluate_hf_single_sample, evaluate_rwkv7_single_sample
+    from visualization.html_generator import generate_comparison_html
+    # Validate input
+    valid, result = validate_input(text)
+    if not valid:
+        raise gr.Error(result)
+    text = result  # Use cleaned text
+    try:
+        # Step 1: Download RWKV model if needed
+        progress(0.05, desc="Checking RWKV7 model...")
+        rwkv_model_path = download_rwkv_model(progress)
+        # Step 2: Load Qwen model
+        progress(0.1, desc="Loading Qwen3-1.7B-Base...")
+        qwen_model, qwen_tokenizer = load_qwen_model()
+        # Step 3: Evaluate Qwen
+        progress(0.3, desc="Evaluating with Qwen3...")
+        result_qwen = evaluate_hf_single_sample(
+            qwen_model,
+            qwen_tokenizer,
+            text,
+            bos_mode="add_newline_token"
+        )
+        # Step 4: Free Qwen memory
+        progress(0.4, desc="Freeing memory...")
+        del qwen_model
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+        # Step 5: Load RWKV7 model
+        progress(0.5, desc="Loading RWKV7-G1C-1.5B...")
+        rwkv_model, rwkv_tokenizer = load_rwkv7_model(rwkv_model_path)
+        # Step 6: Evaluate RWKV7
+        progress(0.7, desc="Evaluating with RWKV7...")
+        result_rwkv = evaluate_rwkv7_single_sample(
+            rwkv_model,
+            rwkv_tokenizer,
+            text
+        )
+        # Step 7: Free RWKV memory
+        progress(0.8, desc="Freeing memory...")
+        del rwkv_model
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+        # Step 8: Generate visualization
+        progress(0.9, desc="Generating visualization...")
+        html = generate_comparison_html(
+            text=text,
+            byte_losses_a=result_qwen["byte_wise_losses"],
+            byte_losses_b=result_rwkv["byte_wise_losses"],
+            model_a_name="Qwen3-1.7B-Base",
+            model_b_name="RWKV7-G1C-1.5B",
+            topk_predictions_a=result_qwen["top5_predictions"],
+            topk_predictions_b=result_rwkv["top5_predictions"],
+            tokenizer_a=result_qwen["tokenizer"],
+            tokenizer_b=result_rwkv["tokenizer"],
+            model_type_a="hf",
+            model_type_b="rwkv7"
+        )
+        # Wrap HTML for iframe display
+        wrapped_html = wrap_html_in_iframe(html)
+        # Save HTML for download
+        temp_file = tempfile.NamedTemporaryFile(
+            mode='w',
+            suffix='.html',
+            delete=False,
+            encoding='utf-8'
+        )
+        temp_file.write(html)
+        temp_file.close()
+        progress(1.0, desc="Done!")
+        return wrapped_html, temp_file.name
+    except torch.cuda.OutOfMemoryError:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+        raise gr.Error(
+            "GPU memory insufficient. Please try:\n"
+            "1. Use shorter text\n"
+            "2. Wait a moment and try again"
+        )
+    except Exception as e:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        gc.collect()
+        raise gr.Error(f"Evaluation failed: {str(e)}")
+def clear_inputs():
+    """Clear all inputs and outputs."""
+    return "", None, None
+# Build Gradio UI
+with gr.Blocks(
+    title="UncheatableEval: Qwen3 vs RWKV7",
+    theme=gr.themes.Soft(),
+    css="""
+    .example-btn {
+        margin: 2px !important;
+    }
+    """
+) as demo:
+    gr.Markdown("""
+    # 🔬 UncheatableEval: Qwen3 vs RWKV7 Byte-Level Comparison
+    Compare the byte-level prediction performance between **Qwen3-1.7B-Base** and **RWKV7-G1C-1.5B**.
+    - **Green** = Qwen3 predicts better (lower loss)
+    - **Red** = RWKV7 predicts better (lower loss)
+    - **Hover** over tokens to see detailed predictions and compression rates
+    """)
+    with gr.Row():
+        with gr.Column(scale=1):
+            text_input = gr.Textbox(
+                label="Input Text",
+                placeholder=f"Enter text to analyze (max {MAX_TEXT_LENGTH} characters)...",
+                lines=10,
+                max_lines=20,
+            )
+            gr.Markdown("**Examples:**")
+            with gr.Row():
+                news_btn = gr.Button("📰 News", size="sm", elem_classes=["example-btn"])
+                code_btn = gr.Button("💻 Code", size="sm", elem_classes=["example-btn"])
+                lit_btn = gr.Button("📚 Literature", size="sm", elem_classes=["example-btn"])
+            with gr.Row():
+                clear_btn = gr.Button("Clear", variant="secondary")
+                run_btn = gr.Button("▶ Run Comparison", variant="primary")
+    gr.Markdown("---")
+    with gr.Row():
+        with gr.Column():
+            output_html = gr.HTML(label="Visualization")
+            download_file = gr.File(label="Download HTML", visible=True)
+    # Event handlers
+    news_btn.click(fn=lambda: EXAMPLE_NEWS, outputs=[text_input])
+    code_btn.click(fn=lambda: EXAMPLE_CODE, outputs=[text_input])
+    lit_btn.click(fn=lambda: EXAMPLE_LITERATURE, outputs=[text_input])
+    clear_btn.click(
+        fn=clear_inputs,
+        outputs=[text_input, output_html, download_file]
+    )
+    run_btn.click(
+        fn=run_evaluation,
+        inputs=[text_input],
+        outputs=[output_html, download_file]
+    )
+    gr.Markdown("""
+    ---
+    ### About
+    This tool uses [UncheatableEval](https://github.com/Jellyfish042/UncheatableEval) to compare
+    language model performance at the byte level.
+    **Models:**
+    - **Qwen3-1.7B-Base**: Transformer-based model from Alibaba
+    - **RWKV7-G1C-1.5B**: Linear attention model from RWKV team
+    **How it works:**
+    1. Both models predict each byte in the input text
+    2. Lower prediction loss = better compression = better understanding
+    3. The visualization shows where each model performs better or worse
+    """)
+if __name__ == "__main__":
+    demo.launch()

core/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Core evaluation modules

core/evaluator.py ADDED Viewed

	@@ -0,0 +1,270 @@

+"""
+Evaluator module for UncheatableEval visualization.
+Provides single-sample evaluation functions for Qwen3 and RWKV7 models.
+"""
+import gc
+import math
+import os
+from typing import List, Dict, Any, Optional
+import torch
+import torch.nn.functional as F
+from .helpers import TokenizerBytesConverter
+# Compression rate conversion factor
+COMPRESSION_RATE_FACTOR = (1.0 / math.log(2.0)) * 0.125 * 100.0
+def get_device():
+    """Get the best available device."""
+    if torch.cuda.is_available():
+        return "cuda"
+    else:
+        return "cpu"
+def calculate_log_sum(logits: torch.Tensor, target_token_ids: torch.Tensor) -> torch.Tensor:
+    """Calculate cross entropy loss for each token."""
+    # Use float32 for CPU compatibility, bfloat16 for CUDA
+    if logits.device.type == "cuda":
+        return F.cross_entropy(logits[:-1].to(torch.bfloat16), target_token_ids[1:], reduction="none")
+    else:
+        return F.cross_entropy(logits[:-1].float(), target_token_ids[1:], reduction="none")
+def extract_topk_predictions(logit: torch.Tensor, target_ids: torch.Tensor, k: int = 10) -> List:
+    """
+    Extract top-k predictions from logits.
+    Args:
+        logit: [seq_length, vocab_size] logit tensor
+        target_ids: [seq_length] actual target token IDs
+        k: number of top predictions to extract (default: 10)
+    Returns:
+        list: [[actual_id, rank, [[id1, prob1], [id2, prob2], ...]], ...]
+    """
+    probs = F.softmax(logit, dim=-1)
+    top_probs, top_ids = torch.topk(probs, k, dim=-1)
+    results = []
+    for pos in range(logit.shape[0]):
+        target_id = target_ids[pos].item()
+        actual_prob = probs[pos, target_id].item()
+        rank = (probs[pos] > actual_prob).sum().item() + 1
+        topk_list = [
+            [top_ids[pos, i].item(), round(top_probs[pos, i].item(), 6)]
+            for i in range(k)
+        ]
+        results.append([target_id, rank, topk_list])
+    return results
+def count_model_parameters_in_billions(model) -> float:
+    """Count model parameters in billions."""
+    total_params = sum(p.numel() for p in model.parameters())
+    return total_params / 1e9
+def count_rwkv_parameters_in_billions(rwkv_model) -> float:
+    """Count RWKV model parameters in billions."""
+    total_params = 0
+    if hasattr(rwkv_model, "z"):
+        for param in rwkv_model.z.values():
+            total_params += param.numel()
+    if hasattr(rwkv_model, "w"):
+        for param in rwkv_model.w.values():
+            total_params += param.numel()
+    return total_params / 1e9
+@torch.no_grad()
+def evaluate_hf_single_sample(
+    model,
+    tokenizer,
+    text: str,
+    bos_mode: str = "add_newline_token"
+) -> Dict[str, Any]:
+    """
+    Evaluate a HuggingFace model on a single text sample.
+    Args:
+        model: HuggingFace model
+        tokenizer: HuggingFace tokenizer
+        text: Input text to evaluate
+        bos_mode: How to handle BOS token
+    Returns:
+        dict with byte_wise_losses, top5_predictions, compression_rate, etc.
+    """
+    # Create token-to-bytes converter
+    token2bytes_converter = TokenizerBytesConverter(
+        model_name_or_path=tokenizer.name_or_path,
+        tokenizer=tokenizer
+    )
+    # Determine BOS token
+    if bos_mode in ["add_default_bos", "replace_with_bos"]:
+        bos_token = tokenizer.bos_token_id
+    elif bos_mode in ["add_default_eos", "replace_with_eos"]:
+        bos_token = tokenizer.eos_token_id
+    elif bos_mode in ["add_newline_token", "replace_with_newline_token"]:
+        bos_token = tokenizer.encode("\n")[0]
+    else:
+        bos_token = tokenizer.bos_token_id
+    bos_tensor = torch.tensor([bos_token], device=model.device).unsqueeze(0)
+    # Tokenize input
+    inputs = tokenizer(text, return_tensors="pt", add_special_tokens=False)
+    inputs = inputs.to(model.device)
+    seq_length = inputs["input_ids"].shape[-1]
+    if seq_length < 2:
+        raise ValueError(f"Text is too short (only {seq_length} tokens)")
+    # Forward pass
+    input_chunk = inputs["input_ids"]
+    if bos_mode in ["add_default_bos", "add_default_eos", "add_newline_token"]:
+        input_chunk = torch.cat([bos_tensor, input_chunk], dim=-1)
+    if bos_mode in ["replace_with_bos", "replace_with_eos", "replace_with_newline_token"]:
+        input_chunk[0, 0] = bos_token
+    logit = model.forward(input_ids=input_chunk).logits[0, :, :]
+    loss = calculate_log_sum(logit, input_chunk.squeeze(0))
+    # Get per-token bytes
+    per_token_bytes = token2bytes_converter.encode_to_bytes(text)
+    # Verify bytes match
+    all_bytes = [byte for token in per_token_bytes for byte in token]
+    expected_bytes = list(text.encode("utf-8"))
+    if all_bytes != expected_bytes:
+        raise ValueError("Token bytes don't match original text bytes")
+    # Extract top-k predictions
+    sample_topk = extract_topk_predictions(
+        logit[:-1], input_chunk.squeeze(0)[1:]
+    )
+    # Calculate byte-wise losses
+    byte_wise_losses = []
+    pending_loss = 0.0
+    for l, byte_values in zip(loss, per_token_bytes):
+        current_loss = l.item() + pending_loss
+        pending_loss = 0.0
+        if len(byte_values) == 0:
+            pending_loss = current_loss
+            continue
+        per_byte_loss = current_loss / len(byte_values)
+        for _ in range(len(byte_values)):
+            byte_wise_losses.append(per_byte_loss)
+    # Calculate overall metrics
+    total_loss = loss.sum().item()
+    num_bytes = len(text.encode("utf-8"))
+    avg_loss = total_loss / seq_length
+    compression_rate = avg_loss * COMPRESSION_RATE_FACTOR
+    return {
+        "byte_wise_losses": byte_wise_losses,
+        "top5_predictions": sample_topk,
+        "compression_rate": compression_rate,
+        "total_loss": total_loss,
+        "num_tokens": seq_length,
+        "num_bytes": num_bytes,
+        "model_name": getattr(model.config, "_name_or_path", "unknown"),
+        "tokenizer": tokenizer
+    }
+@torch.no_grad()
+def evaluate_rwkv7_single_sample(
+    model,
+    tokenizer,
+    text: str
+) -> Dict[str, Any]:
+    """
+    Evaluate a RWKV7 model on a single text sample.
+    Args:
+        model: RWKV7 model
+        tokenizer: RWKV tokenizer (TRIE_TOKENIZER)
+        text: Input text to evaluate
+    Returns:
+        dict with byte_wise_losses, top5_predictions, compression_rate, etc.
+    """
+    # Tokenize
+    tokenized = tokenizer.encode(text)
+    if hasattr(tokenized, "ids"):
+        input_seq = tokenized.ids
+    else:
+        input_seq = tokenized
+    input_length = len(input_seq)
+    if input_length < 2:
+        raise ValueError(f"Text is too short (only {input_length} tokens)")
+    # Forward pass with state
+    input_chunk = [0] + input_seq  # Add BOS token (0)
+    device = get_device()
+    CHUNK_LEN = 1024
+    state = None
+    logit = torch.empty((0, 65536), device=device)
+    temp_input = input_chunk.copy()
+    while len(temp_input) > 0:
+        out, state = model.forward(temp_input[:CHUNK_LEN], state, full_output=True)
+        if len(temp_input) == 1:
+            out = out.unsqueeze(0)
+        temp_input = temp_input[CHUNK_LEN:]
+        logit = torch.concat((logit, out.to(device)), dim=0)
+    if len(input_chunk) == 1:
+        logit = logit.unsqueeze(0)
+    loss = calculate_log_sum(logit, torch.tensor(input_chunk).to(device))
+    # Get per-token bytes
+    token_bytes = [tokenizer.decodeBytes([token]) for token in input_chunk[1:]]
+    # Extract top-k predictions
+    sample_topk = extract_topk_predictions(
+        logit[:-1], torch.tensor(input_chunk[1:]).to(device)
+    )
+    # Calculate byte-wise losses
+    byte_wise_losses = []
+    for l, byte_values in zip(loss.tolist(), token_bytes):
+        per_byte_loss = l / len(byte_values)
+        for _ in range(len(byte_values)):
+            byte_wise_losses.append(per_byte_loss)
+    # Calculate overall metrics
+    total_loss = loss.sum().item()
+    num_bytes = len(text.encode("utf-8"))
+    avg_loss = total_loss / input_length
+    compression_rate = avg_loss * COMPRESSION_RATE_FACTOR
+    return {
+        "byte_wise_losses": byte_wise_losses,
+        "top5_predictions": sample_topk,
+        "compression_rate": compression_rate,
+        "total_loss": total_loss,
+        "num_tokens": input_length,
+        "num_bytes": num_bytes,
+        "model_name": "RWKV7-G1C-1.5B",
+        "tokenizer": tokenizer
+    }

core/helpers.py ADDED Viewed

	@@ -0,0 +1,266 @@

+"""
+Helper utilities for UncheatableEval visualization.
+Contains TokenizerBytesConverter for mapping tokens to bytes.
+"""
+import json
+import re
+from typing import Dict, List, Optional
+def bytes_to_unicode() -> Dict[int, str]:
+    """
+    GPT-2 style byte-to-unicode mapping.
+    Maps byte values 0-255 to printable Unicode characters.
+    """
+    bs = list(range(ord("!"), ord("~") + 1)) + list(range(ord("¡"), ord("¬") + 1)) + list(range(ord("®"), ord("ÿ") + 1))
+    cs = bs[:]
+    n = 0
+    for b in range(2**8):
+        if b not in bs:
+            bs.append(b)
+            cs.append(2**8 + n)
+            n += 1
+    cs = [chr(n) for n in cs]
+    return dict(zip(bs, cs))
+class TokenizerBytesConverter:
+    """
+    Universal Token-to-Bytes Converter for HuggingFace tokenizers.
+    Supports two encoding schemes:
+    1. ByteLevel BPE (Llama 3.x, Qwen, GPT-2 style)
+    2. SentencePiece with ByteFallback (Mistral, early LLaMA)
+    Usage:
+        converter = TokenizerBytesConverter("meta-llama/Llama-3.2-1B")
+        nested_bytes = converter.encode_to_bytes("Hello world")
+        # Returns: [[72, 101, 108, 108, 111], [32, 119, 111, 114, 108, 100]]
+    """
+    # Class-level mapping table cache
+    _BYTE_TO_UNICODE = bytes_to_unicode()
+    _UNICODE_TO_BYTE = {v: k for k, v in _BYTE_TO_UNICODE.items()}
+    def __init__(
+        self,
+        model_name_or_path: str = None,
+        cache_dir: Optional[str] = None,
+        trust_remote_code: bool = True,
+        tokenizer=None,
+    ):
+        """
+        Initialize the converter.
+        Args:
+            model_name_or_path: HuggingFace model name or local path
+            cache_dir: Directory to cache the downloaded tokenizer files
+            trust_remote_code: Whether to trust remote code for custom tokenizers
+            tokenizer: Optional pre-loaded tokenizer instance for encoding.
+                       If provided, this tokenizer will be used for encode() calls,
+                       while AutoTokenizer is still used to extract vocab/decoder config.
+        """
+        from transformers import AutoTokenizer
+        # Always load AutoTokenizer for vocab extraction
+        auto_tokenizer = AutoTokenizer.from_pretrained(
+            model_name_or_path,
+            cache_dir=cache_dir,
+            trust_remote_code=trust_remote_code,
+        )
+        # Use provided tokenizer for encoding, or fall back to auto_tokenizer
+        self._tokenizer = tokenizer if tokenizer is not None else auto_tokenizer
+        # Extract tokenizer.json from the AutoTokenizer's backend
+        if hasattr(auto_tokenizer, "backend_tokenizer") and hasattr(auto_tokenizer.backend_tokenizer, "to_str"):
+            tokenizer_json = json.loads(auto_tokenizer.backend_tokenizer.to_str())
+        else:
+            raise ValueError("Tokenizer object is not supported. " "The tokenizer must have a backend_tokenizer with to_str() method.")
+        self._tokenizer_json = tokenizer_json
+        self._vocab = tokenizer_json["model"]["vocab"]
+        self._id_to_token: Dict[int, str] = {v: k for k, v in self._vocab.items()}
+        # Detect encoding type
+        self._decoder_type = self._detect_decoder_type()
+        # Load added_tokens
+        self._load_added_tokens()
+    def _detect_decoder_type(self) -> str:
+        """Detect the decoder type from tokenizer.json."""
+        decoder = self._tokenizer_json.get("decoder", {})
+        decoder_type = decoder.get("type", "")
+        if decoder_type == "ByteLevel":
+            return "bytelevel"
+        elif decoder_type == "Sequence":
+            decoders = decoder.get("decoders", [])
+            for d in decoders:
+                if d.get("type") == "ByteFallback":
+                    return "sentencepiece"
+            for d in decoders:
+                if d.get("type") == "ByteLevel":
+                    return "bytelevel"
+        # Fallback: check model configuration
+        model = self._tokenizer_json.get("model", {})
+        if model.get("byte_fallback", False):
+            return "sentencepiece"
+        # Default to bytelevel
+        return "bytelevel"
+    def _load_added_tokens(self):
+        """Load added_tokens into the vocabulary."""
+        self._special_token_ids = set()
+        added_tokens = self._tokenizer_json.get("added_tokens", [])
+        for token_info in added_tokens:
+            token_id = token_info["id"]
+            content = token_info["content"]
+            self._id_to_token[token_id] = content
+            if token_info.get("special", False):
+                self._special_token_ids.add(token_id)
+    @property
+    def decoder_type(self) -> str:
+        """Return the detected decoder type."""
+        return self._decoder_type
+    @property
+    def vocab_size(self) -> int:
+        """Return the vocabulary size."""
+        return len(self._id_to_token)
+    @property
+    def tokenizer(self):
+        """Return the underlying HuggingFace tokenizer."""
+        return self._tokenizer
+    def get_token_string(self, token_id: int) -> Optional[str]:
+        """Get the raw string for a token_id."""
+        return self._id_to_token.get(token_id)
+    def token_to_bytes(self, token_id: int) -> Optional[List[int]]:
+        """
+        Map a single token_id to its byte sequence.
+        Args:
+            token_id: The token ID
+        Returns:
+            List of byte values (0-255) as integers, or None if token_id doesn't exist
+        """
+        token_str = self._id_to_token.get(token_id)
+        if token_str is None:
+            return None
+        if self._decoder_type == "bytelevel":
+            return self._decode_bytelevel(token_str)
+        else:
+            return self._decode_sentencepiece(token_str)
+    def _decode_bytelevel(self, token_str: str) -> List[int]:
+        """
+        ByteLevel decoding: map each Unicode character back to a byte.
+        """
+        result = []
+        for char in token_str:
+            if char in self._UNICODE_TO_BYTE:
+                result.append(self._UNICODE_TO_BYTE[char])
+            else:
+                # Characters not in the mapping table are encoded as UTF-8
+                result.extend(char.encode("utf-8"))
+        return result
+    def _decode_sentencepiece(self, token_str: str) -> List[int]:
+        """
+        SentencePiece decoding: handle ▁ and <0xXX> format.
+        """
+        result = []
+        i = 0
+        while i < len(token_str):
+            # Check for <0xXX> format
+            match = re.match(r"<0x([0-9A-Fa-f]{2})>", token_str[i:])
+            if match:
+                byte_val = int(match.group(1), 16)
+                result.append(byte_val)
+                i += 6
+            elif token_str[i] == "▁":
+                # Replace ▁ with space
+                result.append(0x20)
+                i += 1
+            else:
+                result.extend(token_str[i].encode("utf-8"))
+                i += 1
+        return result
+    def encode_to_bytes(
+        self,
+        text: str,
+        add_special_tokens: bool = False,
+        strip_leading_space: bool = True,
+    ) -> List[List[int]]:
+        """
+        Encode text to a nested list of bytes.
+        Each sub-list contains the byte values (as integers) for one token.
+        Args:
+            text: Input text to encode
+            add_special_tokens: Whether to add special tokens (BOS, EOS, etc.)
+            strip_leading_space: For SentencePiece, whether to strip the leading space
+                                from the first token
+        Returns:
+            Nested list where each inner list contains byte values for one token.
+            Example: [[72, 101, 108, 108, 111], [32, 119, 111, 114, 108, 100]]
+        """
+        token_ids = self._tokenizer.encode(text, add_special_tokens=add_special_tokens)
+        result = []
+        for idx, token_id in enumerate(token_ids):
+            token_bytes = self.token_to_bytes(token_id)
+            if token_bytes is not None:
+                # Handle SentencePiece leading space
+                if idx == 0 and self._decoder_type == "sentencepiece" and strip_leading_space and token_bytes and token_bytes[0] == 0x20:
+                    token_bytes = token_bytes[1:]
+                result.append(token_bytes)
+        return result
+    def encode_to_flat_bytes(
+        self,
+        text: str,
+        add_special_tokens: bool = False,
+        strip_leading_space: bool = True,
+    ) -> bytes:
+        """
+        Encode text to a flat byte sequence.
+        Args:
+            text: Input text to encode
+            add_special_tokens: Whether to add special tokens
+            strip_leading_space: For SentencePiece, whether to strip the leading space
+        Returns:
+            Concatenated bytes from all tokens
+        """
+        nested = self.encode_to_bytes(text, add_special_tokens, strip_leading_space)
+        result = []
+        for token_bytes in nested:
+            result.extend(token_bytes)
+        return bytes(result)
+    def get_all_token_bytes(self) -> Dict[int, List[int]]:
+        """
+        Get byte mapping for all tokens in the vocabulary.
+        Returns:
+            Dictionary mapping token_id to list of byte values
+        """
+        return {token_id: self.token_to_bytes(token_id) for token_id in self._id_to_token}

examples/sample_texts.json ADDED Viewed

	@@ -0,0 +1,24 @@

+{
+  "examples": [
+    {
+      "name": "News",
+      "text": "The rapid advancement of artificial intelligence has sparked both excitement and concern among researchers worldwide. While AI systems demonstrate remarkable capabilities in language understanding and generation, questions remain about their potential impact on employment and society."
+    },
+    {
+      "name": "Code",
+      "text": "def fibonacci(n):\n    if n <= 1:\n        return n\n    return fibonacci(n-1) + fibonacci(n-2)\n\n# Calculate first 10 Fibonacci numbers\nfor i in range(10):\n    print(f\"F({i}) = {fibonacci(i)}\")"
+    },
+    {
+      "name": "Literature",
+      "text": "It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness."
+    },
+    {
+      "name": "Chinese",
+      "text": "人工智能的快速发展在全球研究人员中引发了兴奋和担忧。虽然人工智能系统在语言理解和生成方面展现了非凡的能力，但关于其对就业和社会的潜在影响的问题仍然存在。"
+    },
+    {
+      "name": "Mixed",
+      "text": "The transformer architecture, introduced in the paper \"Attention Is All You Need\" (2017), revolutionized NLP. 这种架构使用自注意力机制来处理序列数据，比传统的RNN和LSTM更加高效。"
+    }
+  ]
+}

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+torch>=2.0.0
+transformers>=4.35.0
+tokenizers>=0.15.0
+gradio>=4.0.0
+numpy>=1.24.0
+tqdm>=4.65.0
+packaging
+rwkv>=0.8.0
+requests
+huggingface_hub
+accelerate

support/rwkv_vocab_v20230424.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

visualization/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # Visualization modules

visualization/html_generator.py ADDED Viewed

	@@ -0,0 +1,865 @@

+"""
+HTML visualization generator for UncheatableEval.
+Generates interactive HTML visualizations comparing byte-level losses between two models.
+"""
+import json
+import math
+import re
+from typing import List, Tuple, Optional, Set
+import numpy as np
+from core.helpers import TokenizerBytesConverter
+# Compression rate conversion factor
+COMPRESSION_RATE_FACTOR = (1.0 / math.log(2.0)) * 0.125 * 100.0
+# Global tokenizers (lazy loaded)
+_qwen_tokenizer = None
+_rwkv_tokenizer = None
+def get_qwen_tokenizer():
+    """Lazy load Qwen tokenizer."""
+    global _qwen_tokenizer
+    if _qwen_tokenizer is None:
+        _qwen_tokenizer = TokenizerBytesConverter("Qwen/Qwen3-0.6B-Base")
+    return _qwen_tokenizer
+def get_rwkv_tokenizer():
+    """Lazy load RWKV tokenizer."""
+    global _rwkv_tokenizer
+    if _rwkv_tokenizer is None:
+        from rwkv.rwkv_tokenizer import TRIE_TOKENIZER
+        import os
+        script_dir = os.path.dirname(os.path.abspath(__file__))
+        vocab_path = os.path.join(os.path.dirname(script_dir), "support", "rwkv_vocab_v20230424.txt")
+        _rwkv_tokenizer = TRIE_TOKENIZER(vocab_path)
+    return _rwkv_tokenizer
+def get_tokenizer_boundaries(text: str, tokenizer, is_rwkv: bool = False) -> Set[int]:
+    """Get token boundaries (byte positions) for a given text."""
+    boundaries = set()
+    boundaries.add(0)
+    if is_rwkv:
+        tokenized = tokenizer.encode(text)
+        if hasattr(tokenized, "ids"):
+            token_ids = tokenized.ids
+        else:
+            token_ids = tokenized
+        byte_pos = 0
+        for token_id in token_ids:
+            token_bytes = tokenizer.decodeBytes([token_id])
+            byte_pos += len(token_bytes)
+            boundaries.add(byte_pos)
+    else:
+        token_bytes_list = tokenizer.encode_to_bytes(text)
+        byte_pos = 0
+        for token_bytes in token_bytes_list:
+            byte_pos += len(token_bytes)
+            boundaries.add(byte_pos)
+    return boundaries
+def get_token_info_for_text(text: str) -> dict:
+    """Get detailed token information for each byte position."""
+    qwen_tokenizer = get_qwen_tokenizer()
+    rwkv_tokenizer = get_rwkv_tokenizer()
+    # Get Qwen tokens with positions
+    qwen_tokens = []
+    byte_to_qwen = {}
+    qwen_bytes_list = qwen_tokenizer.encode_to_bytes(text)
+    byte_pos = 0
+    for idx, token_bytes in enumerate(qwen_bytes_list):
+        start = byte_pos
+        end = byte_pos + len(token_bytes)
+        try:
+            token_str = bytes(token_bytes).decode("utf-8")
+        except UnicodeDecodeError:
+            token_str = repr(bytes(token_bytes))
+        qwen_tokens.append((start, end, token_str))
+        byte_to_qwen[start] = idx
+        byte_pos = end
+    # Get RWKV tokens with positions
+    rwkv_tokens = []
+    byte_to_rwkv = {}
+    tokenized = rwkv_tokenizer.encode(text)
+    if hasattr(tokenized, "ids"):
+        token_ids = tokenized.ids
+    else:
+        token_ids = tokenized
+    byte_pos = 0
+    for idx, token_id in enumerate(token_ids):
+        token_bytes = rwkv_tokenizer.decodeBytes([token_id])
+        start = byte_pos
+        end = byte_pos + len(token_bytes)
+        try:
+            token_str = token_bytes.decode("utf-8")
+        except UnicodeDecodeError:
+            token_str = repr(token_bytes)
+        rwkv_tokens.append((start, end, token_str))
+        byte_to_rwkv[start] = idx
+        byte_pos = end
+    # Get common boundaries
+    qwen_boundaries = set([0] + [t[1] for t in qwen_tokens])
+    rwkv_boundaries = set([0] + [t[1] for t in rwkv_tokens])
+    common_boundaries = sorted(qwen_boundaries & rwkv_boundaries)
+    return {
+        "common_boundaries": common_boundaries,
+        "qwen_tokens": qwen_tokens,
+        "rwkv_tokens": rwkv_tokens,
+        "byte_to_qwen": byte_to_qwen,
+        "byte_to_rwkv": byte_to_rwkv,
+    }
+def delta_to_color(delta: float, avg_delta: float, max_deviation: float) -> Tuple[int, int, int]:
+    """Map a delta value to an RGB color based on deviation from average."""
+    if max_deviation == 0:
+        return (255, 255, 255)
+    deviation = delta - avg_delta
+    normalized = max(-1, min(1, deviation / max_deviation))
+    if normalized < 0:
+        intensity = -normalized
+        r = int(255 * (1 - intensity * 0.7))
+        g = 255
+        b = int(255 * (1 - intensity * 0.7))
+    else:
+        intensity = normalized
+        r = 255
+        g = int(255 * (1 - intensity * 0.7))
+        b = int(255 * (1 - intensity * 0.7))
+    return (r, g, b)
+def generate_comparison_html(
+    text: str,
+    byte_losses_a: List[float],
+    byte_losses_b: List[float],
+    model_a_name: str,
+    model_b_name: str,
+    topk_predictions_a: Optional[List] = None,
+    topk_predictions_b: Optional[List] = None,
+    tokenizer_a=None,
+    tokenizer_b=None,
+    model_type_a: str = "hf",
+    model_type_b: str = "rwkv7",
+) -> str:
+    """
+    Generate an interactive HTML visualization comparing two models.
+    Args:
+        text: The input text that was evaluated
+        byte_losses_a: Per-byte losses from model A
+        byte_losses_b: Per-byte losses from model B
+        model_a_name: Display name for model A
+        model_b_name: Display name for model B
+        topk_predictions_a: Top-k predictions from model A
+        topk_predictions_b: Top-k predictions from model B
+        tokenizer_a: Tokenizer for model A
+        tokenizer_b: Tokenizer for model B
+        model_type_a: Type of model A ("hf" or "rwkv7")
+        model_type_b: Type of model B ("hf" or "rwkv7")
+    Returns:
+        HTML string with interactive visualization
+    """
+    def decode_token(token_id: int, tokenizer, model_type: str) -> str:
+        if tokenizer is None:
+            return f"[{token_id}]"
+        try:
+            if model_type in ["rwkv", "rwkv7"]:
+                return tokenizer.decode([token_id])
+            else:
+                return tokenizer.decode([token_id])
+        except:
+            return f"[{token_id}]"
+    def build_byte_to_token_map(text: str, tokenizer, model_type: str):
+        if tokenizer is None:
+            return []
+        token_ranges = []
+        try:
+            if model_type in ["rwkv", "rwkv7"]:
+                tokenized = tokenizer.encode(text)
+                if hasattr(tokenized, "ids"):
+                    token_ids = tokenized.ids
+                else:
+                    token_ids = tokenized
+                byte_pos = 0
+                for idx, token_id in enumerate(token_ids):
+                    try:
+                        token_bytes = tokenizer.decodeBytes([token_id])
+                        token_ranges.append((byte_pos, byte_pos + len(token_bytes), idx))
+                        byte_pos += len(token_bytes)
+                    except:
+                        pass
+            else:
+                tokenizer_name = getattr(tokenizer, "name_or_path", None)
+                if tokenizer_name:
+                    converter = TokenizerBytesConverter(tokenizer_name, trust_remote_code=True)
+                    token_bytes_list = converter.encode_to_bytes(text)
+                    byte_pos = 0
+                    for idx, token_bytes in enumerate(token_bytes_list):
+                        token_ranges.append((byte_pos, byte_pos + len(token_bytes), idx))
+                        byte_pos += len(token_bytes)
+        except Exception as e:
+            print(f"Warning: Could not build byte-to-token map ({model_type}): {e}")
+            return []
+        return token_ranges
+    def find_token_for_byte(byte_pos: int, token_ranges):
+        for start, end, idx in token_ranges:
+            if start <= byte_pos < end:
+                return idx
+        return None
+    # Calculate deltas
+    deltas = [a - b for a, b in zip(byte_losses_a, byte_losses_b)]
+    avg_delta = sum(deltas) / len(deltas) if deltas else 0
+    # Calculate max deviation
+    deviations = [d - avg_delta for d in deltas]
+    abs_deviations = [abs(dev) for dev in deviations]
+    max_deviation = float(np.percentile(abs_deviations, 100)) if abs_deviations else 0
+    max_deviation = max(max_deviation, 1e-6)
+    # Calculate average compression rates
+    avg_compression_a = sum(byte_losses_a) / len(byte_losses_a) * COMPRESSION_RATE_FACTOR if byte_losses_a else 0
+    avg_compression_b = sum(byte_losses_b) / len(byte_losses_b) * COMPRESSION_RATE_FACTOR if byte_losses_b else 0
+    avg_delta_compression = avg_delta * COMPRESSION_RATE_FACTOR
+    # Get token info
+    text_bytes = text.encode("utf-8")
+    token_info = get_token_info_for_text(text)
+    common_boundaries = token_info["common_boundaries"]
+    qwen_tokens = token_info["qwen_tokens"]
+    rwkv_tokens = token_info["rwkv_tokens"]
+    # Build byte position to token index mapping
+    model_a_token_ranges = build_byte_to_token_map(text, tokenizer_a, model_type_a)
+    model_b_token_ranges = build_byte_to_token_map(text, tokenizer_b, model_type_b)
+    def get_tokens_for_range(byte_start, byte_end, token_list):
+        result = []
+        for idx, (t_start, t_end, t_str) in enumerate(token_list):
+            if t_start < byte_end and t_end > byte_start:
+                result.append((idx, t_str))
+        return result
+    # Build tokens based on common boundaries
+    tokens = []
+    for i in range(len(common_boundaries) - 1):
+        start_byte = common_boundaries[i]
+        end_byte = common_boundaries[i + 1]
+        token_bytes = text_bytes[start_byte:end_byte]
+        try:
+            token_text = token_bytes.decode("utf-8")
+        except UnicodeDecodeError:
+            continue
+        qwen_toks = get_tokens_for_range(start_byte, end_byte, qwen_tokens)
+        rwkv_toks = get_tokens_for_range(start_byte, end_byte, rwkv_tokens)
+        if re.search(r"\w", token_text, re.UNICODE):
+            tokens.append({
+                "type": "word",
+                "text": token_text,
+                "byte_start": start_byte,
+                "byte_end": end_byte,
+                "word_lower": token_text.lower(),
+                "qwen_tokens": qwen_toks,
+                "rwkv_tokens": rwkv_toks,
+            })
+        else:
+            tokens.append({
+                "type": "non-word",
+                "text": token_text,
+                "byte_start": start_byte,
+                "byte_end": end_byte,
+                "qwen_tokens": qwen_toks,
+                "rwkv_tokens": rwkv_toks,
+            })
+    # Track word occurrences
+    word_occurrences = {}
+    word_id_counter = 0
+    for i, token in enumerate(tokens):
+        if token["type"] == "word":
+            word_lower = token["word_lower"]
+            if word_lower not in word_occurrences:
+                word_occurrences[word_lower] = []
+            word_occurrences[word_lower].append(i)
+            token["word_id"] = word_id_counter
+            word_id_counter += 1
+    # Build HTML content
+    html_content = []
+    def escape_for_attr(s):
+        return s.replace("&", "&amp;").replace('"', "&quot;").replace("<", "&lt;").replace(">", "&gt;")
+    for token in tokens:
+        token_text = token["text"]
+        byte_start = token["byte_start"]
+        byte_end = token["byte_end"]
+        qwen_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["qwen_tokens"]])
+        rwkv_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["rwkv_tokens"]])
+        raw_bytes = list(text_bytes[byte_start:byte_end])
+        losses_a = byte_losses_a[byte_start:byte_end]
+        losses_b = byte_losses_b[byte_start:byte_end]
+        bytes_str = " ".join([f"{b:02x}" for b in raw_bytes])
+        compression_a_str = " ".join([f"{l * COMPRESSION_RATE_FACTOR:.2f}%" for l in losses_a])
+        compression_b_str = " ".join([f"{l * COMPRESSION_RATE_FACTOR:.2f}%" for l in losses_b])
+        topk_a_json = ""
+        topk_b_json = ""
+        if topk_predictions_a is not None and model_a_token_ranges:
+            model_a_token_idx = find_token_for_byte(byte_start, model_a_token_ranges)
+            if model_a_token_idx is not None and model_a_token_idx < len(topk_predictions_a):
+                pred = topk_predictions_a[model_a_token_idx]
+                decoded_pred = [
+                    pred[0],
+                    pred[1],
+                    [[tid, prob, decode_token(tid, tokenizer_a, model_type_a)] for tid, prob in pred[2]],
+                ]
+                topk_a_json = json.dumps(decoded_pred, ensure_ascii=False)
+        if topk_predictions_b is not None and model_b_token_ranges:
+            model_b_token_idx = find_token_for_byte(byte_start, model_b_token_ranges)
+            if model_b_token_idx is not None and model_b_token_idx < len(topk_predictions_b):
+                pred = topk_predictions_b[model_b_token_idx]
+                decoded_pred = [pred[0], pred[1], [[tid, prob, decode_token(tid, tokenizer_b, model_type_b)] for tid, prob in pred[2]]]
+                topk_b_json = json.dumps(decoded_pred, ensure_ascii=False)
+        token_deltas = deltas[byte_start:byte_end]
+        avg_token_delta = sum(token_deltas) / len(token_deltas) if token_deltas else 0
+        color = delta_to_color(avg_token_delta, avg_delta, max_deviation)
+        r, g, b = color
+        token_html_parts = []
+        for char in token_text:
+            if char == "<":
+                escaped_char = "&lt;"
+            elif char == ">":
+                escaped_char = "&gt;"
+            elif char == "&":
+                escaped_char = "&amp;"
+            elif char == "\n":
+                escaped_char = "<br>"
+            elif char == " ":
+                escaped_char = "&nbsp;"
+            elif char == "\t":
+                escaped_char = "&nbsp;&nbsp;&nbsp;&nbsp;"
+            else:
+                escaped_char = char
+            token_html_parts.append(escaped_char)
+        token_span_content = "".join(token_html_parts)
+        data_attrs = (
+            f'data-qwen="{escape_for_attr(qwen_info)}" '
+            f'data-rwkv="{escape_for_attr(rwkv_info)}" '
+            f'data-bytes="{escape_for_attr(bytes_str)}" '
+            f'data-compression-a="{escape_for_attr(compression_a_str)}" '
+            f'data-compression-b="{escape_for_attr(compression_b_str)}" '
+            f'data-delta="{avg_token_delta * COMPRESSION_RATE_FACTOR:.4f}" '
+            f'data-topk-a="{escape_for_attr(topk_a_json)}" '
+            f'data-topk-b="{escape_for_attr(topk_b_json)}"'
+        )
+        style_attr = f'style="background-color: rgb({r},{g},{b})"'
+        if token["type"] == "word":
+            word_lower = token["word_lower"]
+            occurrences = word_occurrences[word_lower]
+            if len(occurrences) > 1:
+                word_id = token["word_id"]
+                html_content.append(
+                    f'<span class="token word" {data_attrs} {style_attr} data-word="{word_lower}" data-word-id="{word_id}">'
+                    + token_span_content
+                    + "</span>"
+                )
+            else:
+                html_content.append(f'<span class="token" {data_attrs} {style_attr}>{token_span_content}</span>')
+        else:
+            html_content.append(f'<span class="token" {data_attrs} {style_attr}>{token_span_content}</span>')
+    delta_color = "#64ff64" if avg_delta < 0 else "#ff6464"
+    html = f"""<!DOCTYPE html>
+<html>
+<head>
+    <meta charset="UTF-8">
+    <title>UncheatableEval - Byte-wise Loss Comparison</title>
+    <style>
+        body {{
+            font-family: Consolas, 'Courier New', monospace;
+            margin: 0;
+            padding: 0;
+            background-color: #f5f5f5;
+        }}
+        .header {{
+            background-color: #333;
+            color: white;
+            padding: 20px;
+            position: sticky;
+            top: 0;
+            z-index: 100;
+        }}
+        .header h1 {{
+            margin: 0 0 15px 0;
+            font-size: 18px;
+        }}
+        .meta {{
+            display: flex;
+            flex-wrap: wrap;
+            gap: 20px;
+            font-size: 12px;
+            color: #c8c8c8;
+        }}
+        .legend {{
+            display: flex;
+            gap: 15px;
+            margin-top: 10px;
+        }}
+        .legend-item {{
+            display: flex;
+            align-items: center;
+            gap: 5px;
+        }}
+        .legend-box {{
+            width: 20px;
+            height: 12px;
+            border: 1px solid #666;
+        }}
+        .content {{
+            background-color: white;
+            margin: 10px;
+            padding: 15px;
+            border: 1px solid #ccc;
+            font-size: 14px;
+            line-height: 1.8;
+            word-wrap: break-word;
+            position: relative;
+        }}
+        .content span {{
+            padding: 1px 0;
+        }}
+        .word {{
+            cursor: pointer;
+            position: relative;
+        }}
+        .word:hover {{
+            outline: 2px solid #007bff;
+            outline-offset: 1px;
+        }}
+        .word.highlighted {{
+            outline: 2px solid #ff6b6b;
+            outline-offset: 1px;
+        }}
+        #svg-overlay {{
+            position: fixed;
+            top: 0;
+            left: 0;
+            width: 100%;
+            height: 100%;
+            pointer-events: none;
+            z-index: 1000;
+        }}
+        .link-line {{
+            stroke: #007bff;
+            stroke-width: 2;
+            fill: none;
+            opacity: 0.7;
+        }}
+        .link-dot {{
+            fill: #007bff;
+            opacity: 0.8;
+        }}
+        .token {{
+            position: relative;
+            cursor: help;
+        }}
+        .token:hover {{
+            outline: 1px dashed #666;
+        }}
+        #tooltip {{
+            position: fixed;
+            background-color: rgba(0, 0, 0, 0.9);
+            color: white;
+            padding: 10px 14px;
+            border-radius: 6px;
+            font-size: 12px;
+            max-width: 500px;
+            z-index: 2000;
+            pointer-events: none;
+            display: none;
+            line-height: 1.6;
+            box-shadow: 0 2px 10px rgba(0,0,0,0.3);
+        }}
+        #tooltip .label {{
+            color: #aaa;
+            font-weight: bold;
+        }}
+        #tooltip .bytes {{
+            color: #a5f3fc;
+            font-family: monospace;
+        }}
+        #tooltip .loss-a {{
+            color: #86efac;
+            font-family: monospace;
+        }}
+        #tooltip .loss-b {{
+            color: #fca5a5;
+            font-family: monospace;
+        }}
+        #tooltip .qwen {{
+            color: #7dd3fc;
+        }}
+        #tooltip .rwkv {{
+            color: #fcd34d;
+        }}
+        #tooltip .topk-section {{
+            margin-top: 8px;
+            padding-top: 8px;
+            border-top: 1px solid #555;
+        }}
+        #tooltip .topk-container {{
+            display: flex;
+            gap: 16px;
+        }}
+        #tooltip .topk-column {{
+            flex: 1;
+            min-width: 180px;
+        }}
+        #tooltip .topk-title {{
+            color: #aaa;
+            font-weight: bold;
+            margin-bottom: 4px;
+            font-size: 11px;
+        }}
+        #tooltip .topk-title.model-a {{
+            color: #86efac;
+        }}
+        #tooltip .topk-title.model-b {{
+            color: #fca5a5;
+        }}
+        #tooltip .topk-list {{
+            font-size: 11px;
+        }}
+        #tooltip .topk-item {{
+            display: flex;
+            gap: 4px;
+            padding: 1px 0;
+            align-items: center;
+        }}
+        #tooltip .topk-rank {{
+            color: #888;
+            min-width: 18px;
+        }}
+        #tooltip .topk-rank.hit {{
+            color: #ffd700;
+        }}
+        #tooltip .topk-token {{
+            color: #a5f3fc;
+            max-width: 100px;
+            overflow: hidden;
+            text-overflow: ellipsis;
+            white-space: nowrap;
+            font-family: monospace;
+        }}
+        #tooltip .topk-prob {{
+            color: #86efac;
+            min-width: 45px;
+            text-align: right;
+        }}
+        #tooltip .topk-hit {{
+            color: #22c55e;
+        }}
+        #tooltip .topk-miss {{
+            color: #ef4444;
+            font-style: italic;
+        }}
+    </style>
+</head>
+<body>
+    <svg id="svg-overlay"></svg>
+    <div id="tooltip"></div>
+    <div class="header">
+        <h1>UncheatableEval - Byte-wise Loss Comparison</h1>
+        <div class="meta">
+            <div>Model A: {model_a_name}</div>
+            <div>Model B: {model_b_name}</div>
+            <div>Compression A: {avg_compression_a:.2f}%</div>
+            <div>Compression B: {avg_compression_b:.2f}%</div>
+            <div style="color: {delta_color}">Avg Delta: {avg_delta_compression:+.2f}%</div>
+        </div>
+        <div class="legend">
+            <div class="legend-item">
+                <div class="legend-box" style="background-color: rgb(77, 255, 77)"></div>
+                <span>Model A better</span>
+            </div>
+            <div class="legend-item">
+                <div class="legend-box" style="background-color: rgb(255, 255, 255)"></div>
+                <span>= Avg delta</span>
+            </div>
+            <div class="legend-item">
+                <div class="legend-box" style="background-color: rgb(255, 77, 77)"></div>
+                <span>Model B better</span>
+            </div>
+            <div class="legend-item" style="margin-left: 20px;">
+                <span style="color: #aaa;">Saturation:</span>
+                <input type="range" id="saturation-slider" min="500" max="1000" value="1000" step="1" style="width: 200px; vertical-align: middle;">
+                <span id="saturation-value" style="color: #fff; min-width: 45px; display: inline-block;">100.0%</span>
+            </div>
+        </div>
+    </div>
+    <div class="content">
+        {''.join(html_content)}
+    </div>
+    <script>
+        const svgOverlay = document.getElementById('svg-overlay');
+        const words = document.querySelectorAll('.word');
+        const wordGroups = {{}};
+        words.forEach(word => {{
+            const wordText = word.getAttribute('data-word');
+            if (!wordGroups[wordText]) {{
+                wordGroups[wordText] = [];
+            }}
+            wordGroups[wordText].push(word);
+        }});
+        function clearLines() {{
+            svgOverlay.innerHTML = '';
+            words.forEach(w => w.classList.remove('highlighted'));
+        }}
+        function drawLines(hoveredWord) {{
+            clearLines();
+            const wordText = hoveredWord.getAttribute('data-word');
+            const wordId = parseInt(hoveredWord.getAttribute('data-word-id'));
+            const sameWords = wordGroups[wordText] || [];
+            const previousWords = sameWords.filter(w => {{
+                const id = parseInt(w.getAttribute('data-word-id'));
+                return id < wordId;
+            }});
+            if (previousWords.length === 0) return;
+            sameWords.forEach(w => w.classList.add('highlighted'));
+            const hoveredRect = hoveredWord.getBoundingClientRect();
+            const hoveredX = hoveredRect.left + hoveredRect.width / 2;
+            const hoveredY = hoveredRect.top + hoveredRect.height / 2;
+            previousWords.forEach(prevWord => {{
+                const prevRect = prevWord.getBoundingClientRect();
+                const prevX = prevRect.left + prevRect.width / 2;
+                const prevY = prevRect.top + prevRect.height / 2;
+                const midX = (hoveredX + prevX) / 2;
+                const midY = Math.min(hoveredY, prevY) - 30;
+                const path = document.createElementNS('http://www.w3.org/2000/svg', 'path');
+                path.setAttribute('class', 'link-line');
+                path.setAttribute('d', `M ${{prevX}} ${{prevY}} Q ${{midX}} ${{midY}} ${{hoveredX}} ${{hoveredY}}`);
+                svgOverlay.appendChild(path);
+                const dot1 = document.createElementNS('http://www.w3.org/2000/svg', 'circle');
+                dot1.setAttribute('class', 'link-dot');
+                dot1.setAttribute('cx', prevX);
+                dot1.setAttribute('cy', prevY);
+                dot1.setAttribute('r', 4);
+                svgOverlay.appendChild(dot1);
+                const dot2 = document.createElementNS('http://www.w3.org/2000/svg', 'circle');
+                dot2.setAttribute('class', 'link-dot');
+                dot2.setAttribute('cx', hoveredX);
+                dot2.setAttribute('cy', hoveredY);
+                dot2.setAttribute('r', 4);
+                svgOverlay.appendChild(dot2);
+            }});
+        }}
+        words.forEach(word => {{
+            word.addEventListener('mouseenter', () => drawLines(word));
+            word.addEventListener('mouseleave', clearLines);
+        }});
+        window.addEventListener('scroll', clearLines);
+        const tooltip = document.getElementById('tooltip');
+        const tokenSpans = document.querySelectorAll('.token');
+        tokenSpans.forEach(token => {{
+            token.addEventListener('mouseenter', (e) => {{
+                const qwen = token.getAttribute('data-qwen') || 'N/A';
+                const rwkv = token.getAttribute('data-rwkv') || 'N/A';
+                const bytes = token.getAttribute('data-bytes') || '';
+                const compressionA = token.getAttribute('data-compression-a') || '';
+                const compressionB = token.getAttribute('data-compression-b') || '';
+                const top5A = token.getAttribute('data-topk-a') || '';
+                const top5B = token.getAttribute('data-topk-b') || '';
+                function formatTopkColumn(topkJson, modelName, titleClass) {{
+                    if (!topkJson) return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">N/A</div></div>';
+                    try {{
+                        const data = JSON.parse(topkJson);
+                        const [actualId, rank, topkList] = data;
+                        let html = '<div class="topk-column">';
+                        html += '<div class="topk-title ' + titleClass + '">' + modelName + '</div>';
+                        html += '<div class="topk-list">';
+                        topkList.forEach((item, idx) => {{
+                            const [tokenId, prob, tokenText] = item;
+                            const isHit = tokenId === actualId;
+                            const rankClass = isHit ? 'topk-rank hit' : 'topk-rank';
+                            const displayText = tokenText || '[' + tokenId + ']';
+                            const escapedText = displayText.replace(/</g, '&lt;').replace(/>/g, '&gt;');
+                            html += '<div class="topk-item">';
+                            html += '<span class="' + rankClass + '">' + (idx + 1) + '.</span>';
+                            html += '<span class="topk-token" title="ID: ' + tokenId + '">' + escapedText + '</span>';
+                            html += '<span class="topk-prob">' + (prob * 100).toFixed(1) + '%</span>';
+                            if (isHit) html += '<span class="topk-hit">✓</span>';
+                            html += '</div>';
+                        }});
+                        if (rank > 10) {{
+                            html += '<div class="topk-item topk-miss">Actual rank: ' + rank + '</div>';
+                        }}
+                        html += '</div></div>';
+                        return html;
+                    }} catch (e) {{
+                        return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">Error</div></div>';
+                    }}
+                }}
+                let tooltipHtml = `
+                    <div><span class="label">Bytes:</span> <span class="bytes">${{bytes || '(empty)'}}</span></div>
+                    <div><span class="label">Compression A:</span> <span class="loss-a">${{compressionA || '(empty)'}}</span></div>
+                    <div><span class="label">Compression B:</span> <span class="loss-b">${{compressionB || '(empty)'}}</span></div>
+                    <hr style="border-color: #555; margin: 6px 0;">
+                    <div><span class="label">Qwen:</span> <span class="qwen">${{qwen || '(empty)'}}</span></div>
+                    <div><span class="label">RWKV:</span> <span class="rwkv">${{rwkv || '(empty)'}}</span></div>
+                `;
+                if (top5A || top5B) {{
+                    tooltipHtml += '<div class="topk-section"><div class="topk-container">';
+                    tooltipHtml += formatTopkColumn(top5A, 'Model A Top10', 'model-a');
+                    tooltipHtml += formatTopkColumn(top5B, 'Model B Top10', 'model-b');
+                    tooltipHtml += '</div></div>';
+                }}
+                tooltip.innerHTML = tooltipHtml;
+                tooltip.style.display = 'block';
+            }});
+            token.addEventListener('mousemove', (e) => {{
+                const tooltipRect = tooltip.getBoundingClientRect();
+                const viewportWidth = window.innerWidth;
+                const viewportHeight = window.innerHeight;
+                let x = e.clientX + 15;
+                let y = e.clientY + 15;
+                if (x + tooltipRect.width > viewportWidth - 10) {{
+                    x = e.clientX - tooltipRect.width - 15;
+                }}
+                if (y + tooltipRect.height > viewportHeight - 10) {{
+                    y = e.clientY - tooltipRect.height - 15;
+                }}
+                if (x < 10) x = 10;
+                if (y < 10) y = 10;
+                tooltip.style.left = x + 'px';
+                tooltip.style.top = y + 'px';
+            }});
+            token.addEventListener('mouseleave', () => {{
+                tooltip.style.display = 'none';
+            }});
+        }});
+        const avgDelta = {avg_delta_compression};
+        const slider = document.getElementById('saturation-slider');
+        const saturationValue = document.getElementById('saturation-value');
+        const allDeltas = [];
+        tokenSpans.forEach(token => {{
+            const delta = parseFloat(token.getAttribute('data-delta'));
+            if (!isNaN(delta)) allDeltas.push(delta);
+        }});
+        function percentile(arr, p) {{
+            const sorted = [...arr].sort((a, b) => a - b);
+            const idx = (p / 100) * (sorted.length - 1);
+            const lower = Math.floor(idx);
+            const upper = Math.ceil(idx);
+            if (lower === upper) return sorted[lower];
+            return sorted[lower] + (sorted[upper] - sorted[lower]) * (idx - lower);
+        }}
+        function deltaToColor(delta, avgDelta, maxDeviation) {{
+            if (maxDeviation === 0) return 'rgb(255, 255, 255)';
+            const deviation = delta - avgDelta;
+            let normalized = Math.max(-1, Math.min(1, deviation / maxDeviation));
+            let r, g, b;
+            if (normalized < 0) {{
+                const intensity = -normalized;
+                r = Math.round(255 * (1 - intensity * 0.7));
+                g = 255;
+                b = Math.round(255 * (1 - intensity * 0.7));
+            }} else {{
+                const intensity = normalized;
+                r = 255;
+                g = Math.round(255 * (1 - intensity * 0.7));
+                b = Math.round(255 * (1 - intensity * 0.7));
+            }}
+            return `rgb(${{r}}, ${{g}}, ${{b}})`;
+        }}
+        function updateColors(percentileValue) {{
+            const deviations = allDeltas.map(d => Math.abs(d - avgDelta));
+            const maxDeviation = Math.max(percentile(deviations, percentileValue), 1e-6);
+            tokenSpans.forEach(token => {{
+                const delta = parseFloat(token.getAttribute('data-delta'));
+                if (!isNaN(delta)) {{
+                    token.style.backgroundColor = deltaToColor(delta, avgDelta, maxDeviation);
+                }}
+            }});
+        }}
+        slider.addEventListener('input', (e) => {{
+            const val = parseInt(e.target.value) / 10;
+            saturationValue.textContent = val.toFixed(1) + '%';
+            updateColors(val);
+        }});
+    </script>
+</body>
+</html>
+"""
+    return html