Spaces:

Jellyfish042
/

Compression-Lens

Running

App Files Files Community

Jellyfish042 commited on Jan 31

Commit

44c2c6d

1 Parent(s): 257183f

bug fix and improvements

Browse files

Files changed (6) hide show

app.py +2 -1
core/evaluator.py +2 -2
precompute_example.py +2 -1
precomputed/example_metadata.json +3 -3
precomputed/example_visualization.html +0 -0
visualization/html_generator.py +243 -57

app.py CHANGED Viewed

@@ -12,7 +12,8 @@ import gradio as gr
 import torch
 # Detect device
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 IS_CPU = DEVICE == "cpu"
 # Model configuration

 import torch
 # Detect device
+# DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+DEVICE = "cpu"
 IS_CPU = DEVICE == "cpu"
 # Model configuration

core/evaluator.py CHANGED Viewed

@@ -47,7 +47,7 @@ def extract_topk_predictions(logit: torch.Tensor, target_ids: torch.Tensor, k: i
         k: number of top predictions to extract (default: 10)
     Returns:
-        list: [[actual_id, rank, [[id1, prob1], [id2, prob2], ...]], ...]
     """
     probs = F.softmax(logit, dim=-1)
     top_probs, top_ids = torch.topk(probs, k, dim=-1)
@@ -59,7 +59,7 @@ def extract_topk_predictions(logit: torch.Tensor, target_ids: torch.Tensor, k: i
         rank = (probs[pos] > actual_prob).sum().item() + 1
         topk_list = [[top_ids[pos, i].item(), round(top_probs[pos, i].item(), 6)] for i in range(k)]
-        results.append([target_id, rank, topk_list])
     return results

         k: number of top predictions to extract (default: 10)
     Returns:
+        list: [[actual_id, rank, actual_prob, [[id1, prob1], [id2, prob2], ...]], ...]
     """
     probs = F.softmax(logit, dim=-1)
     top_probs, top_ids = torch.topk(probs, k, dim=-1)
         rank = (probs[pos] > actual_prob).sum().item() + 1
         topk_list = [[top_ids[pos, i].item(), round(top_probs[pos, i].item(), 6)] for i in range(k)]
+        results.append([target_id, rank, actual_prob, topk_list])
     return results

precompute_example.py CHANGED Viewed

@@ -26,7 +26,8 @@ QWEN_MODEL_ID = "Qwen/Qwen3-1.7B-Base"
 RWKV_MODEL_FILENAME = "rwkv7-g1c-1.5b-20260110-ctx8192.pth"
 # Detect device
-DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
 IS_CPU = DEVICE == "cpu"

 RWKV_MODEL_FILENAME = "rwkv7-g1c-1.5b-20260110-ctx8192.pth"
 # Detect device
+# DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+DEVICE = "cpu"
 IS_CPU = DEVICE == "cpu"

precomputed/example_metadata.json CHANGED Viewed

@@ -1,7 +1,7 @@
 {
   "example_text": "The Bitter Lesson\nRich Sutton\nMarch 13, 2019\nThe biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.  There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.\n\nIn computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force\" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.\n\nA similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.\n\nIn speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.\n\nIn computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.\n\nThis is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.\n\nOne thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.\n\nThe second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.\n",
-  "qwen_inference_time": 21.767822980880737,
-  "rwkv_inference_time": 33.561607122421265,
   "qwen_compression_rate": 48.14428559434192,
-  "rwkv_compression_rate": 47.624574152536056
 }

 {
   "example_text": "The Bitter Lesson\nRich Sutton\nMarch 13, 2019\nThe biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation.  There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.\n\nIn computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force\" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.\n\nA similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.\n\nIn speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.\n\nIn computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.\n\nThis is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.\n\nOne thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.\n\nThe second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.\n",
+  "qwen_inference_time": 20.516680479049683,
+  "rwkv_inference_time": 31.14354944229126,
   "qwen_compression_rate": 48.14428559434192,
+  "rwkv_compression_rate": 47.62502588510778
 }

precomputed/example_visualization.html CHANGED Viewed

The diff for this file is too large to render. See raw diff

visualization/html_generator.py CHANGED Viewed

@@ -4,6 +4,7 @@ HTML visualization generator for UncheatableEval.
 Generates interactive HTML visualizations comparing byte-level losses between two models.
 """
 import json
 import math
 import re
@@ -20,6 +21,7 @@ COMPRESSION_RATE_FACTOR = (1.0 / math.log(2.0)) * 0.125 * 100.0
 # Global tokenizers (lazy loaded)
 _qwen_tokenizer = None
 _rwkv_tokenizer = None
 def get_qwen_tokenizer():
@@ -83,12 +85,9 @@ def get_token_info_for_text(text: str) -> dict:
     byte_pos = 0
     for idx, (token_id, token_bytes) in enumerate(qwen_id_and_bytes):
         start = byte_pos
-        end = byte_pos + len(token_bytes)
-        try:
-            token_str = bytes(token_bytes).decode("utf-8")
-        except UnicodeDecodeError:
-            token_str = repr(bytes(token_bytes))
-        qwen_tokens.append((start, end, token_id, token_str))
         byte_to_qwen[start] = idx
         byte_pos = end
@@ -106,18 +105,24 @@ def get_token_info_for_text(text: str) -> dict:
         token_bytes = rwkv_tokenizer.decodeBytes([token_id])
         start = byte_pos
         end = byte_pos + len(token_bytes)
-        try:
-            token_str = token_bytes.decode("utf-8")
-        except UnicodeDecodeError:
-            token_str = repr(token_bytes)
-        rwkv_tokens.append((start, end, token_id, token_str))
         byte_to_rwkv[start] = idx
         byte_pos = end
-    # Get common boundaries
     qwen_boundaries = set([0] + [t[1] for t in qwen_tokens])
     rwkv_boundaries = set([0] + [t[1] for t in rwkv_tokens])
-    common_boundaries = sorted(qwen_boundaries & rwkv_boundaries)
     return {
         "common_boundaries": common_boundaries,
@@ -163,16 +168,58 @@ def generate_comparison_html(
     def decode_token(token_id: int, tokenizer, model_type: str) -> str:
         """Decode a single token ID to text using the appropriate tokenizer."""
         if tokenizer is None:
             return f"[{token_id}]"
         try:
             if model_type in ["rwkv", "rwkv7"]:
-                # RWKV tokenizer uses decode method
-                decoded = tokenizer.decode([token_id])
-                return decoded if decoded else f"[{token_id}]"
             else:
-                # HuggingFace tokenizer
                 decoded = tokenizer.decode([token_id])
                 return decoded if decoded else f"[{token_id}]"
         except Exception as e:
             print(f"Warning: Failed to decode token {token_id} ({model_type}): {e}")
@@ -250,9 +297,9 @@ def generate_comparison_html(
     def get_tokens_for_range(byte_start, byte_end, token_list):
         result = []
-        for t_start, t_end, token_id, t_str in token_list:
             if t_start < byte_end and t_end > byte_start:
-                result.append((token_id, t_str))
         return result
     # Build tokens based on common boundaries
@@ -262,15 +309,18 @@ def generate_comparison_html(
         start_byte = common_boundaries[i]
         end_byte = common_boundaries[i + 1]
         token_bytes = text_bytes[start_byte:end_byte]
         try:
             token_text = token_bytes.decode("utf-8")
         except UnicodeDecodeError:
-            continue
         qwen_toks = get_tokens_for_range(start_byte, end_byte, qwen_tokens)
         rwkv_toks = get_tokens_for_range(start_byte, end_byte, rwkv_tokens)
-        if re.search(r"\w", token_text, re.UNICODE):
             tokens.append(
                 {
                     "type": "word",
@@ -334,11 +384,31 @@ def generate_comparison_html(
         model_b_token_idx = find_token_for_byte(byte_start, model_b_token_ranges)
         # Build token info strings showing all tokens in this byte range
-        # Model A (RWKV7) - show all tokens that overlap with this byte range
-        model_a_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["rwkv_tokens"]])
-        # Model B (Qwen3) - show all tokens that overlap with this byte range
-        model_b_info = ", ".join([f"[{idx}] {repr(s)}" for idx, s in token["qwen_tokens"]])
         raw_bytes = list(text_bytes[byte_start:byte_end])
         losses_a = byte_losses_a[byte_start:byte_end]
@@ -359,14 +429,20 @@ def generate_comparison_html(
             if model_a_token_idx is not None and model_a_token_idx < len(topk_predictions_a):
                 pred = topk_predictions_a[model_a_token_idx]
                 try:
-                    decoded_pred = [
-                        pred[0],
-                        pred[1],
-                        [[tid, prob, decode_token(tid, tokenizer_a, model_type_a)] for tid, prob in pred[2]],
-                    ]
-                    # Use base64 encoding to avoid escaping issues
-                    import base64
                     topk_a_json = base64.b64encode(json.dumps(decoded_pred, ensure_ascii=False).encode("utf-8")).decode("ascii")
                 except Exception as e:
                     pass
@@ -375,10 +451,16 @@ def generate_comparison_html(
             if model_b_token_idx is not None and model_b_token_idx < len(topk_predictions_b):
                 pred = topk_predictions_b[model_b_token_idx]
                 try:
-                    decoded_pred = [pred[0], pred[1], [[tid, prob, decode_token(tid, tokenizer_b, model_type_b)] for tid, prob in pred[2]]]
-                    # Use base64 encoding to avoid escaping issues
-                    import base64
                     topk_b_json = base64.b64encode(json.dumps(decoded_pred, ensure_ascii=False).encode("utf-8")).decode("ascii")
                 except Exception as e:
                     pass
@@ -607,7 +689,31 @@ def generate_comparison_html(
             display: flex;
             gap: 4px;
             padding: 1px 0;
             align-items: center;
         }}
         #tooltip .topk-rank {{
             color: #888;
@@ -618,11 +724,15 @@ def generate_comparison_html(
         }}
         #tooltip .topk-token {{
             color: #a5f3fc;
-            max-width: 100px;
-            overflow: hidden;
-            text-overflow: ellipsis;
-            white-space: nowrap;
             font-family: monospace;
         }}
         #tooltip .topk-prob {{
             color: #86efac;
@@ -751,8 +861,8 @@ def generate_comparison_html(
         tokenSpans.forEach(token => {{
             token.addEventListener('mouseenter', (e) => {{
-                const modelA = token.getAttribute('data-model-a') || 'N/A';
-                const modelB = token.getAttribute('data-model-b') || 'N/A';
                 const bytes = token.getAttribute('data-bytes') || '';
                 const compressionA = token.getAttribute('data-compression-a') || '';
                 const compressionB = token.getAttribute('data-compression-b') || '';
@@ -761,18 +871,52 @@ def generate_comparison_html(
                 const top5A = token.getAttribute('data-topk-a') || '';
                 const top5B = token.getAttribute('data-topk-b') || '';
                 function formatTopkColumn(topkBase64, modelName, titleClass) {{
                     if (!topkBase64) return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">N/A</div></div>';
                     try {{
-                        // Decode base64 to UTF-8 string (atob() doesn't support UTF-8, need proper decoding)
-                        const binaryString = atob(topkBase64);
-                        const bytes = new Uint8Array(binaryString.length);
-                        for (let i = 0; i < binaryString.length; i++) {{
-                            bytes[i] = binaryString.charCodeAt(i);
                         }}
-                        const topkJson = new TextDecoder('utf-8').decode(bytes);
-                        const data = JSON.parse(topkJson);
-                        const [actualId, rank, topkList] = data;
                         let html = '<div class="topk-column">';
                         html += '<div class="topk-title ' + titleClass + '">' + modelName + '</div>';
                         html += '<div class="topk-list">';
@@ -780,8 +924,13 @@ def generate_comparison_html(
                             const [tokenId, prob, tokenText] = item;
                             const isHit = tokenId === actualId;
                             const rankClass = isHit ? 'topk-rank hit' : 'topk-rank';
-                            const displayText = tokenText || '[' + tokenId + ']';
-                            const escapedText = displayText.replace(/</g, '&lt;').replace(/>/g, '&gt;');
                             html += '<div class="topk-item">';
                             html += '<span class="' + rankClass + '">' + (idx + 1) + '.</span>';
                             html += '<span class="topk-token" title="ID: ' + tokenId + '">' + escapedText + '</span>';
@@ -790,7 +939,12 @@ def generate_comparison_html(
                             html += '</div>';
                         }});
                         if (rank > 10) {{
-                            html += '<div class="topk-item topk-miss">Actual rank: ' + rank + '</div>';
                         }}
                         html += '</div></div>';
                         return html;
@@ -801,13 +955,45 @@ def generate_comparison_html(
                     }}
                 }}
                 let tooltipHtml = `
                     <div><span class="label">Bytes:</span> <span class="bytes">${{bytes || '(empty)'}}</span></div>
                     <div><span class="label">RWKV Compression Rate:</span> <span class="loss-a">${{compressionA || '(empty)'}}${{avgCompressionA ? ' (avg: ' + avgCompressionA + '%)' : ''}}</span></div>
                     <div><span class="label">Qwen Compression Rate:</span> <span class="loss-b">${{compressionB || '(empty)'}}${{avgCompressionB ? ' (avg: ' + avgCompressionB + '%)' : ''}}</span></div>
                     <hr style="border-color: #555; margin: 6px 0;">
-                    <div><span class="label">RWKV:</span> <span class="model-a">${{modelA || '(empty)'}}</span></div>
-                    <div><span class="label">Qwen:</span> <span class="model-b">${{modelB || '(empty)'}}</span></div>
                 `;
                 if (top5A || top5B) {{
                     tooltipHtml += '<div class="topk-section"><div class="topk-container">';

 Generates interactive HTML visualizations comparing byte-level losses between two models.
 """
+import base64
 import json
 import math
 import re
 # Global tokenizers (lazy loaded)
 _qwen_tokenizer = None
 _rwkv_tokenizer = None
+_token_bytes_converter_cache = {}
 def get_qwen_tokenizer():
     byte_pos = 0
     for idx, (token_id, token_bytes) in enumerate(qwen_id_and_bytes):
         start = byte_pos
+        token_bytes_blob = bytes(token_bytes)
+        end = byte_pos + len(token_bytes_blob)
+        qwen_tokens.append((start, end, token_id, token_bytes_blob))
         byte_to_qwen[start] = idx
         byte_pos = end
         token_bytes = rwkv_tokenizer.decodeBytes([token_id])
         start = byte_pos
         end = byte_pos + len(token_bytes)
+        rwkv_tokens.append((start, end, token_id, token_bytes))
         byte_to_rwkv[start] = idx
         byte_pos = end
+    # Get common boundaries, but keep only UTF-8 codepoint boundaries
     qwen_boundaries = set([0] + [t[1] for t in qwen_tokens])
     rwkv_boundaries = set([0] + [t[1] for t in rwkv_tokens])
+    utf8_boundaries = set([0])
+    byte_pos = 0
+    for ch in text:
+        byte_pos += len(ch.encode("utf-8"))
+        utf8_boundaries.add(byte_pos)
+    common_boundaries = sorted(qwen_boundaries & rwkv_boundaries & utf8_boundaries)
+    # Ensure we always include the end boundary
+    text_end = len(text.encode("utf-8"))
+    if text_end not in common_boundaries:
+        common_boundaries.append(text_end)
+        common_boundaries = sorted(common_boundaries)
     return {
         "common_boundaries": common_boundaries,
     def decode_token(token_id: int, tokenizer, model_type: str) -> str:
         """Decode a single token ID to text using the appropriate tokenizer."""
+        def bytes_to_hex_str(byte_values) -> str:
+            return "".join([f"\\x{b:02x}" for b in byte_values])
+        def get_bytes_converter(tokenizer):
+            if tokenizer is None:
+                return None
+            key = getattr(tokenizer, "name_or_path", None)
+            if not key:
+                key = str(id(tokenizer))
+            if key not in _token_bytes_converter_cache:
+                try:
+                    _token_bytes_converter_cache[key] = TokenizerBytesConverter(
+                        model_name_or_path=getattr(tokenizer, "name_or_path", None),
+                        tokenizer=tokenizer,
+                        trust_remote_code=True,
+                    )
+                except Exception:
+                    _token_bytes_converter_cache[key] = None
+            return _token_bytes_converter_cache.get(key)
         if tokenizer is None:
             return f"[{token_id}]"
         try:
             if model_type in ["rwkv", "rwkv7"]:
+                # RWKV tokenizer provides raw bytes
+                token_bytes = tokenizer.decodeBytes([token_id])
+                if token_bytes:
+                    try:
+                        decoded = token_bytes.decode("utf-8")
+                        return decoded if decoded else f"[{token_id}]"
+                    except UnicodeDecodeError:
+                        return bytes_to_hex_str(token_bytes)
+                return f"[{token_id}]"
             else:
+                # HuggingFace tokenizer: prefer raw bytes when possible
+                converter = get_bytes_converter(tokenizer)
+                token_bytes = None
+                if converter is not None:
+                    try:
+                        token_bytes = converter.token_to_bytes(token_id)
+                    except Exception:
+                        token_bytes = None
+                if token_bytes:
+                    try:
+                        decoded = bytes(token_bytes).decode("utf-8")
+                        return decoded if decoded else f"[{token_id}]"
+                    except UnicodeDecodeError:
+                        return bytes_to_hex_str(token_bytes)
                 decoded = tokenizer.decode([token_id])
+                if decoded and "�" not in decoded:
+                    return decoded
                 return decoded if decoded else f"[{token_id}]"
         except Exception as e:
             print(f"Warning: Failed to decode token {token_id} ({model_type}): {e}")
     def get_tokens_for_range(byte_start, byte_end, token_list):
         result = []
+        for t_start, t_end, token_id, t_bytes in token_list:
             if t_start < byte_end and t_end > byte_start:
+                result.append((token_id, t_bytes))
         return result
     # Build tokens based on common boundaries
         start_byte = common_boundaries[i]
         end_byte = common_boundaries[i + 1]
         token_bytes = text_bytes[start_byte:end_byte]
+        decoded_ok = True
         try:
             token_text = token_bytes.decode("utf-8")
         except UnicodeDecodeError:
+            # Show raw bytes when UTF-8 decoding fails
+            token_text = "".join([f"\\x{b:02x}" for b in token_bytes])
+            decoded_ok = False
         qwen_toks = get_tokens_for_range(start_byte, end_byte, qwen_tokens)
         rwkv_toks = get_tokens_for_range(start_byte, end_byte, rwkv_tokens)
+        if decoded_ok and re.search(r"\w", token_text, re.UNICODE):
             tokens.append(
                 {
                     "type": "word",
         model_b_token_idx = find_token_for_byte(byte_start, model_b_token_ranges)
         # Build token info strings showing all tokens in this byte range
+        def token_bytes_to_display_text(token_bytes: bytes) -> str:
+            if token_bytes is None:
+                return ""
+            if isinstance(token_bytes, list):
+                token_bytes = bytes(token_bytes)
+            if isinstance(token_bytes, str):
+                return token_bytes
+            if len(token_bytes) == 0:
+                return ""
+            try:
+                return token_bytes.decode("utf-8")
+            except UnicodeDecodeError:
+                return "".join([f"\\x{b:02x}" for b in token_bytes])
+        # Model A (RWKV7) - tokens overlapping this byte range
+        model_a_info = ""
+        if token["rwkv_tokens"]:
+            model_a_list = [[tid, token_bytes_to_display_text(tb)] for tid, tb in token["rwkv_tokens"]]
+            model_a_info = base64.b64encode(json.dumps(model_a_list, ensure_ascii=False).encode("utf-8")).decode("ascii")
+        # Model B (Qwen3) - tokens overlapping this byte range
+        model_b_info = ""
+        if token["qwen_tokens"]:
+            model_b_list = [[tid, token_bytes_to_display_text(tb)] for tid, tb in token["qwen_tokens"]]
+            model_b_info = base64.b64encode(json.dumps(model_b_list, ensure_ascii=False).encode("utf-8")).decode("ascii")
         raw_bytes = list(text_bytes[byte_start:byte_end])
         losses_a = byte_losses_a[byte_start:byte_end]
             if model_a_token_idx is not None and model_a_token_idx < len(topk_predictions_a):
                 pred = topk_predictions_a[model_a_token_idx]
                 try:
+                    if len(pred) >= 4:
+                        actual_id, rank, actual_prob, topk_list = pred[0], pred[1], pred[2], pred[3]
+                        decoded_pred = [
+                            actual_id,
+                            rank,
+                            actual_prob,
+                            [[tid, prob, decode_token(tid, tokenizer_a, model_type_a)] for tid, prob in topk_list],
+                        ]
+                    else:
+                        decoded_pred = [
+                            pred[0],
+                            pred[1],
+                            [[tid, prob, decode_token(tid, tokenizer_a, model_type_a)] for tid, prob in pred[2]],
+                        ]
                     topk_a_json = base64.b64encode(json.dumps(decoded_pred, ensure_ascii=False).encode("utf-8")).decode("ascii")
                 except Exception as e:
                     pass
             if model_b_token_idx is not None and model_b_token_idx < len(topk_predictions_b):
                 pred = topk_predictions_b[model_b_token_idx]
                 try:
+                    if len(pred) >= 4:
+                        actual_id, rank, actual_prob, topk_list = pred[0], pred[1], pred[2], pred[3]
+                        decoded_pred = [
+                            actual_id,
+                            rank,
+                            actual_prob,
+                            [[tid, prob, decode_token(tid, tokenizer_b, model_type_b)] for tid, prob in topk_list],
+                        ]
+                    else:
+                        decoded_pred = [pred[0], pred[1], [[tid, prob, decode_token(tid, tokenizer_b, model_type_b)] for tid, prob in pred[2]]]
                     topk_b_json = base64.b64encode(json.dumps(decoded_pred, ensure_ascii=False).encode("utf-8")).decode("ascii")
                 except Exception as e:
                     pass
             display: flex;
             gap: 4px;
             padding: 1px 0;
+            align-items: flex-start;
+        }}
+        #tooltip .token-block {{
+            margin-top: 6px;
+            display: flex;
             align-items: center;
+            gap: 6px;
+            white-space: nowrap;
+        }}
+        #tooltip .token-chips {{
+            display: flex;
+            flex-wrap: nowrap;
+            gap: 4px;
+        }}
+        #tooltip .token-chip-group {{
+            display: inline-flex;
+            align-items: center;
+            gap: 4px;
+        }}
+        #tooltip .token-id {{
+            color: #888;
+            font-family: monospace;
+        }}
+        #tooltip .token-chip {{
+            max-width: 100%;
         }}
         #tooltip .topk-rank {{
             color: #888;
         }}
         #tooltip .topk-token {{
             color: #a5f3fc;
+            white-space: pre-wrap;
+            overflow-wrap: anywhere;
+            word-break: break-word;
             font-family: monospace;
+            background-color: rgba(255, 255, 255, 0.08);
+            padding: 0 4px;
+            border-radius: 3px;
+            display: inline-block;
+            max-width: 100%;
         }}
         #tooltip .topk-prob {{
             color: #86efac;
         tokenSpans.forEach(token => {{
             token.addEventListener('mouseenter', (e) => {{
+                const modelA = token.getAttribute('data-model-a') || '';
+                const modelB = token.getAttribute('data-model-b') || '';
                 const bytes = token.getAttribute('data-bytes') || '';
                 const compressionA = token.getAttribute('data-compression-a') || '';
                 const compressionB = token.getAttribute('data-compression-b') || '';
                 const top5A = token.getAttribute('data-topk-a') || '';
                 const top5B = token.getAttribute('data-topk-b') || '';
+                function decodeBase64Json(base64Str) {{
+                    const binaryString = atob(base64Str);
+                    const bytes = new Uint8Array(binaryString.length);
+                    for (let i = 0; i < binaryString.length; i++) {{
+                        bytes[i] = binaryString.charCodeAt(i);
+                    }}
+                    const jsonStr = new TextDecoder('utf-8').decode(bytes);
+                    return JSON.parse(jsonStr);
+                }}
+                function escapeControlChars(text) {{
+                    if (!text) return text;
+                    let out = '';
+                    for (let i = 0; i < text.length; i++) {{
+                        const ch = text[i];
+                        const code = text.charCodeAt(i);
+                        if (ch === '\\\\') {{
+                            out += '\\\\\\\\';
+                        }} else if (ch === '\\n') {{
+                            out += '\\\\n';
+                        }} else if (ch === '\\r') {{
+                            out += '\\\\r';
+                        }} else if (ch === '\\t') {{
+                            out += '\\\\t';
+                        }} else if (code < 32 || code === 127) {{
+                            out += '\\\\x' + code.toString(16).padStart(2, '0');
+                        }} else {{
+                            out += ch;
+                        }}
+                    }}
+                    return out;
+                }}
                 function formatTopkColumn(topkBase64, modelName, titleClass) {{
                     if (!topkBase64) return '<div class="topk-column"><div class="topk-title ' + titleClass + '">' + modelName + '</div><div class="topk-list">N/A</div></div>';
                     try {{
+                        const data = decodeBase64Json(topkBase64);
+                        let actualId = null;
+                        let rank = null;
+                        let actualProb = null;
+                        let topkList = [];
+                        if (data.length >= 4) {{
+                            [actualId, rank, actualProb, topkList] = data;
+                        }} else {{
+                            [actualId, rank, topkList] = data;
                         }}
                         let html = '<div class="topk-column">';
                         html += '<div class="topk-title ' + titleClass + '">' + modelName + '</div>';
                         html += '<div class="topk-list">';
                             const [tokenId, prob, tokenText] = item;
                             const isHit = tokenId === actualId;
                             const rankClass = isHit ? 'topk-rank hit' : 'topk-rank';
+                            const rawText = (tokenText !== undefined && tokenText !== null) ? tokenText : '';
+                            const visibleText = escapeControlChars(rawText);
+                            const displayText = (visibleText !== '') ? visibleText : ('[' + tokenId + ']');
+                            const escapedText = displayText
+                                .replace(/&/g, '&amp;')
+                                .replace(/</g, '&lt;')
+                                .replace(/>/g, '&gt;');
                             html += '<div class="topk-item">';
                             html += '<span class="' + rankClass + '">' + (idx + 1) + '.</span>';
                             html += '<span class="topk-token" title="ID: ' + tokenId + '">' + escapedText + '</span>';
                             html += '</div>';
                         }});
                         if (rank > 10) {{
+                            let probSuffix = '';
+                            const probVal = parseFloat(actualProb);
+                            if (!isNaN(probVal)) {{
+                                probSuffix = ' (' + (probVal * 100).toFixed(4) + '%)';
+                            }}
+                            html += '<div class="topk-item topk-miss">Actual rank: ' + rank + probSuffix + '</div>';
                         }}
                         html += '</div></div>';
                         return html;
                     }}
                 }}
+                function formatTokenChips(modelBase64, label, labelClass) {{
+                    if (!modelBase64) {{
+                        return '<div class="token-block"><span class="label ' + labelClass + '">' + label + ':</span> <span class="topk-token token-chip">N/A</span></div>';
+                    }}
+                    try {{
+                        const tokenList = decodeBase64Json(modelBase64);
+                        let html = '<div class="token-block">';
+                        html += '<span class="label ' + labelClass + '">' + label + ':</span>';
+                        html += '<div class="token-chips">';
+                        tokenList.forEach((item) => {{
+                            const tokenId = item[0];
+                            const tokenText = item[1];
+                            const visible = escapeControlChars(tokenText || '');
+                            const displayText = (visible !== '') ? visible : '';
+                            const escapedText = displayText
+                                .replace(/&/g, '&amp;')
+                                .replace(/</g, '&lt;')
+                                .replace(/>/g, '&gt;');
+                            html += '<span class="token-chip-group" title="ID: ' + tokenId + '">';
+                            html += '<span class="token-id">[' + tokenId + ']</span>';
+                            html += '<span class="topk-token token-chip">' + escapedText + '</span>';
+                            html += '</span>';
+                        }});
+                        html += '</div></div>';
+                        return html;
+                    }} catch (e) {{
+                        console.error('Error in formatTokenChips for ' + label + ':', e);
+                        console.error('modelBase64:', modelBase64);
+                        return '<div class="token-block"><span class="label ' + labelClass + '">' + label + ':</span> <span class="topk-token token-chip">Error: ' + e.message + '</span></div>';
+                    }}
+                }}
                 let tooltipHtml = `
                     <div><span class="label">Bytes:</span> <span class="bytes">${{bytes || '(empty)'}}</span></div>
                     <div><span class="label">RWKV Compression Rate:</span> <span class="loss-a">${{compressionA || '(empty)'}}${{avgCompressionA ? ' (avg: ' + avgCompressionA + '%)' : ''}}</span></div>
                     <div><span class="label">Qwen Compression Rate:</span> <span class="loss-b">${{compressionB || '(empty)'}}${{avgCompressionB ? ' (avg: ' + avgCompressionB + '%)' : ''}}</span></div>
                     <hr style="border-color: #555; margin: 6px 0;">
+                    ${{formatTokenChips(modelA, 'RWKV', 'model-a')}}
+                    ${{formatTokenChips(modelB, 'Qwen', 'model-b')}}
                 `;
                 if (top5A || top5B) {{
                     tooltipHtml += '<div class="topk-section"><div class="topk-container">';