| --- |
| license: apache-2.0 |
| language: |
| - en |
| tags: |
| - tiny |
| - fromzero |
| - regression |
| - api_pricing |
| - research |
| - token_estimation |
| - prediction |
| - token_counting |
| - dataset_labeling_tokens |
| datasets: |
| - fromziro/jetoncount_corpus |
| --- |
| |
| # JetonCount |
|
|
| ## Summary |
| ``` |
| Task: regression |
| Total training time: 111 minutes |
| Params: 7009 |
| Final MAE: 192 |
| Framework: PyTorch |
| Authors: Paul Courneya, Jonathon Ly |
| ``` |
|
|
| ## Description |
|
|
| JetonCount is a 7k-parameter MLP regression model trained to predict the number of tokens a piece of text might contain using only six input features. |
|
|
| ## Model Details |
|
|
| - Architecture: MLP |
| - Input Feature Dimension: 19 |
| - Raw Input Features: 7 |
| - Engineered Features: True |
| - Log1p Engineered Features: True |
| - Hidden Size: 32 |
| - Number of Layers: 8 |
| - Activation: SiLU |
| - Dropout: 0.005 |
| - Total Parameters: 7,009 |
|
|
| ### Input Features |
|
|
| - `chars` |
| - `words` |
| - `avg_chars_per_word` |
| - `longest_word_chars` |
| - `symbol_ratio` |
| - `punctuation_ratio` |
| - `vocab_size` |
| |
| ## Training |
|
|
| ### Dataset |
|
|
| 22M rows. 28 tokenizers. 9 sources. |
|
|
| - Tokenizers: [`tokenizers_used.txt`](https://huggingface.co/fromziro/JetonCount/blob/main/tokenizers_used.txt) |
| - Datasets: [`datasets_used.txt`](https://huggingface.co/fromziro/JetonCount/edit/main/datasets_used.txt) |
|
|
| The sources include math, code, educational text, general web, social media posts (reddit), and instruct data. |
| The tokenizers range from GPT2 to DeepSeek-v4 and have varying vocabulary sizes (250 to 256,000). |
|
|
| ### Training Details |
|
|
| - Maximum Learning Rate: 6e-3 |
| - Minimum Learning Rate: 3e-6 |
| - Number of Epochs: 3 |
| - Batch Size: 32000 |
| - Eval Split Ratio: 0.005 |
| - Gradient Accumulation Steps: 1 |
| - Gradient Clipping: 1.0 |
| - AdamW Betas: `(0.9, 0.95)` |
| - DType: `float32` |
|
|
| ### Final Eval and Train Results |
|
|
| - Train: |
| - R²: 0.951257 |
| - MSE: 938621.647477 |
| - RMSE: 968.824880 |
| - MAE: 192.055944 |
| - MRE: 0.137838 |
| - Explained Variance: 0.951305 |
| - Loss: 938621.647477 |
|
|
| - Eval: |
| - R²: 0.9738627018499254 |
| - MSE: 480722.18035314884 |
| - RMSE: 693.3413159138498 |
| - MAE: 163.19862499318103 |
| - MRE: 0.10729957442834033 |
| - Explained Variance: 0.9738670065468997 |
| - Loss: 480722.18035314884 |
|
|
| - Test: |
| - R²: 0.9717820439277628 |
| - MSE: 388793.8423401673 |
| - RMSE: 623.5333530294649 |
| - MAE: 157.27345487387672 |
| - MRE: 0.10509939610167868 |
| - Explained Variance: 0.9717854117493441 |
| - Loss: 388793.8423401673 |
|
|
| ### Hardware |
|
|
| - CPU: Ryzen 5 2600 (data preparation and training) |
|
|
| ### Predictions |
|
|
| | Actual Tokens | Model Prediction | |
| | ------------- | ---------------- | |
| | 197 | 239 | |
| | 1333 | 1395 | |
| | 5973 | 6609 | |
| | 18569 | 20423 | |
|
|
| Note: Rounded to nearest integer. |
|
|
| #### Example |
|
|
| - Input Text (taken from wikipedia): |
| ``` |
| Artificial intelligence (AI) is the capability of computational systems to perform tasks typically associated with human intelligence, such as learning, reasoning, problem-solving, perception, and decision-making. It is a field of research in engineering, mathematics and computer science that develops and studies methods and software that enable machines to perceive their environment and use learning and intelligence to take actions that maximize their chances of achieving defined goals.[1] |
| ``` |
| - Input vocab size (matching tokenizer): `2560` |
| - Tokenizer (baseline): FromZero/Er-Tiny-1.3M |
|
|
| Out: |
| ``` |
| { |
| "actual_token_count": 139, |
| "prediction" "190.5435028076172", |
| "model_latency_ms": "0.24457614858252544", |
| "tokenizer_latency_ms": 0.3174110000009023 |
| } |
| ``` |
|
|
| ### Is it faster than a tokenizer, though? |
|
|
| We came across a dilemma: why build this model if a tokenizer is more accurate anyway? |
|
|
| The answer is **speed**. |
|
|
| In our tests, especially on long texts, the model is significantly faster than a tokenizer. |
|
|
| | Tokens | Model Latency (ms) | Tokenizer Latency (ms) | |
| |--------|--------------------|--------------------------| |
| | 197 | 0.2429 | 0.4134 | |
| | 1333 | 0.3409 | 1.8775 | |
| | 5973 | 0.9827 | 7.6504 | |
| | 18569 | 5.2890 | 28.8244 | |
|
|
| Note: Model latency changes based on hardware. |
|
|
| ## Use Cases |
|
|
| 1. Educational work and research |
| 2. API Pricing Estimation |
| 3. Dataset labeling |
| 4. Or just for fun. |
|
|
| ## Limitations |
|
|
| 1. The model is an approximation and can produce errors on out-of-distrubtion texts. |
| 2. Prediction accuracy heavily depends on the corectness of the input features. |
| 3. It does not perform actual tokenization and therefore is much less accurate than an actual tokenizer. |
| 4. Vocabulary sizes larger than 128k can result in performance degradation. |
| 5. Short text (under 32 tokens) can result in performance degradation. |
| ## License |
|
|
| Before using, distributing, selling, or modifying this software, you must read the license [here](https://huggingface.co/fromziro/Er-Tiny-1.3M/blob/main/LICENSE.txt). |
|
|
| ## Inference |
|
|
| ```python |
| from __future__ import annotations |
| |
| import json |
| import re |
| import time |
| from dataclasses import dataclass |
| from typing import Tuple |
| |
| import torch |
| from transformers import AutoModel, AutoTokenizer |
| |
| MODEL_ID = "fromziro/JetonCount" |
| TOKENIZER_ID = "fromziro/Er-Tiny-1.3M" |
| |
| FEATURE_MEAN = None |
| FEATURE_STD = None |
| TARGET_OFFSET = 0.0 |
| |
| DEFAULT_VOCAB_SIZE = 2564 |
| |
| TEXT = "Put your text here." |
| TOKENIZER_ROUNDS = 100 |
| MODEL_ROUNDS = 1000 |
| |
| PUNCTUATION_CHARS = set(r""".,!?;:'"`~@#$%^&*()-_=+[]{}<>/\|""") |
| SYMBOL_CHARS = set(r"""@#$%^&*()-_=+[]{}<>/\|~`""") |
| |
| |
| @dataclass |
| class TextStats: |
| chars: float |
| words: float |
| avg_chars_per_word: float |
| punctuation_ratio: float |
| symbol_ratio: float |
| longest_word_chars: float |
| vocab_size: float |
| |
| |
| def compute_text_stats(text: str, vocab_size: int) -> TextStats: |
| chars = len(text) |
| words_list = re.findall(r"\b\w+\b", text, flags=re.UNICODE) |
| words = len(words_list) |
| |
| total_word_chars = sum(len(w) for w in words_list) |
| avg_chars_per_word = (total_word_chars / words) if words else 0.0 |
| longest_word_chars = max((len(w) for w in words_list), default=0) |
| |
| if chars: |
| punctuation_count = sum(1 for ch in text if ch in PUNCTUATION_CHARS) |
| symbol_count = sum(1 for ch in text if ch in SYMBOL_CHARS) |
| punctuation_ratio = punctuation_count / chars |
| symbol_ratio = symbol_count / chars |
| else: |
| punctuation_ratio = 0.0 |
| symbol_ratio = 0.0 |
| |
| return TextStats( |
| chars=float(chars), |
| words=float(words), |
| avg_chars_per_word=float(avg_chars_per_word), |
| punctuation_ratio=float(punctuation_ratio), |
| symbol_ratio=float(symbol_ratio), |
| longest_word_chars=float(longest_word_chars), |
| vocab_size=float(vocab_size), |
| ) |
| |
| |
| def build_feature_tensor(stats: TextStats) -> torch.Tensor: |
| base = torch.tensor( |
| [ |
| stats.chars, |
| stats.words, |
| stats.avg_chars_per_word, |
| stats.punctuation_ratio, |
| stats.symbol_ratio, |
| stats.longest_word_chars, |
| stats.vocab_size, |
| ], |
| dtype=torch.float32, |
| ) |
| |
| chars, words, avg_chars_per_word, punctuation_ratio, symbol_ratio, longest_word_chars, vocab_size = base |
| eps = 1e-6 |
| |
| extra = torch.tensor( |
| [ |
| chars / max(words.item(), 1.0), |
| words / max(chars.item(), 1.0), |
| torch.log1p(torch.clamp(chars, min=0.0)).item(), |
| torch.log1p(torch.clamp(words, min=0.0)).item(), |
| torch.log1p(torch.clamp(vocab_size, min=0.0)).item(), |
| (chars * punctuation_ratio).item(), |
| (chars * symbol_ratio).item(), |
| (words * avg_chars_per_word).item(), |
| (words * punctuation_ratio).item(), |
| (longest_word_chars * punctuation_ratio).item(), |
| ((avg_chars_per_word + longest_word_chars) * (1.0 + punctuation_ratio + symbol_ratio)).item(), |
| ((chars + eps) * (punctuation_ratio + symbol_ratio + eps)).item(), |
| ], |
| dtype=torch.float32, |
| ) |
| |
| return torch.cat([base, extra], dim=0) |
| |
| |
| def standardize_features(x: torch.Tensor) -> torch.Tensor: |
| if FEATURE_MEAN is None or FEATURE_STD is None: |
| return x |
| mean = torch.tensor(FEATURE_MEAN, dtype=x.dtype, device=x.device) |
| std = torch.tensor(FEATURE_STD, dtype=x.dtype, device=x.device) |
| safe_std = torch.where(torch.isfinite(std) & (std != 0), std, torch.ones_like(std)) |
| safe_mean = torch.where(torch.isfinite(mean), mean, torch.zeros_like(mean)) |
| return (x - safe_mean) / safe_std |
| |
| |
| def benchmark_tokenizer(tokenizer, text: str, rounds: int = 100) -> Tuple[int, float]: |
| tokenizer(text) |
| start = time.perf_counter() |
| actual_count = 0 |
| for _ in range(rounds): |
| ids = tokenizer(text, add_special_tokens=False).input_ids |
| actual_count = len(ids) |
| elapsed_ms = (time.perf_counter() - start) * 1000.0 / rounds |
| return actual_count, elapsed_ms |
| |
| |
| @torch.inference_mode() |
| def benchmark_model(model, feature_tensor: torch.Tensor, rounds: int = 1000) -> Tuple[float, float]: |
| x = standardize_features(feature_tensor).unsqueeze(0) |
| |
| _ = model(input_features=x) |
| |
| start = time.perf_counter() |
| pred = 0.0 |
| for _ in range(rounds): |
| out = model(input_features=x) |
| pred = float(out.logits.squeeze().item()) |
| elapsed_ms = (time.perf_counter() - start) * 1000.0 / rounds |
| return pred, elapsed_ms |
| |
| |
| def main() -> None: |
| tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_ID, use_fast=True) |
| model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True) |
| model.eval() |
| |
| stats = compute_text_stats(TEXT, DEFAULT_VOCAB_SIZE) |
| feature_tensor = build_feature_tensor(stats) |
| |
| actual_count, tokenizer_latency_ms = benchmark_tokenizer(tokenizer, TEXT, rounds=TOKENIZER_ROUNDS) |
| prediction, model_latency_ms = benchmark_model(model, feature_tensor, rounds=MODEL_ROUNDS) |
| |
| result = { |
| "actual_token_count": actual_count, |
| "prediction": prediction, |
| "model_latency_ms": model_latency_ms, |
| "tokenizer_latency_ms": tokenizer_latency_ms, |
| "model_id": MODEL_ID, |
| "tokenizer_id": TOKENIZER_ID, |
| "vocab_size": DEFAULT_VOCAB_SIZE, |
| "features": { |
| "chars": stats.chars, |
| "words": stats.words, |
| "avg_chars_per_word": stats.avg_chars_per_word, |
| "punctuation_ratio": stats.punctuation_ratio, |
| "symbol_ratio": stats.symbol_ratio, |
| "longest_word_chars": stats.longest_word_chars, |
| "vocab_size": stats.vocab_size, |
| }, |
| } |
| |
| print(json.dumps(result, indent=2, ensure_ascii=False)) |
| |
| |
| if __name__ == "__main__": |
| main() |
| ``` |
|
|
| ### Demo (HF Space) |
|
|
| [JetonCount](https://huggingface.co/spaces/fromziro/JetonCount) |
|
|
| ## Copyright |
|
|
| ``` |
| Copyright (c) 2026 FromZero |
| Copyright (c) 2026 Paul Courneya |
| Copyright (c) 2026 Jonathon LY |
| ``` |
|
|
| ## Citation |
|
|
| ```bibtex |
| @misc{jetoncount, |
| title = {JetonCount}, |
| organization = [FromZero], |
| authors = {Paul Courneya, Jonathon LY}, |
| year = {2026}, |
| url = {https://huggingface.co/fromziro/JetonCount} |
| } |
| ``` |