File size: 5,132 Bytes
92e35ee |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
---
language: multilingual
license: mit
library_name: pytorch
tags:
- text-classification
- language-detection
- byte-level
- multilingual
- english-detection
- cnn
pipeline_tag: text-classification
datasets:
- custom
metrics:
- accuracy
model-index:
- name: innit
results:
- task:
type: text-classification
name: English vs Non-English Detection
metrics:
- type: accuracy
value: 99.94
name: Validation Accuracy
- type: accuracy
value: 100.0
name: Challenge Set Accuracy
---
# innit: Fast English vs Non-English Text Detection
A lightweight byte-level CNN for fast binary language detection (English vs Non-English).
## Model Details
- **Model Type**: Byte-level Convolutional Neural Network
- **Task**: Binary text classification (English vs Non-English)
- **Architecture**: TinyByteCNN_EN with 6 convolutional blocks
- **Parameters**: 156,642
- **Input**: Raw UTF-8 bytes (max 256 bytes)
- **Output**: Binary classification (0=Non-English, 1=English)
## Performance
- **Validation Accuracy**: 99.94%
- **Challenge Set Accuracy**: 100% (14/14 test cases)
- **Inference Speed**: Sub-millisecond on modern CPUs
- **Model Size**: ~600KB
## Supported Languages
Trained to distinguish English from 52+ languages across diverse scripts:
- **Latin scripts**: Spanish, French, German, Italian, Dutch, Portuguese, etc.
- **CJK scripts**: Chinese (Simplified/Traditional), Japanese, Korean
- **Cyrillic scripts**: Russian, Ukrainian, Bulgarian, Serbian
- **Other scripts**: Arabic, Hindi, Bengali, Thai, Hebrew, etc.
## Architecture
```
TinyByteCNN_EN:
βββ Embedding: 257 β 80 dimensions (256 bytes + padding)
βββ 6x Convolutional Blocks:
β βββ Conv1D (kernel=3, residual connections)
β βββ GELU activation
β βββ BatchNorm1D
β βββ Dropout (0.15)
βββ Enhanced Pooling: mean + max + std
βββ Classification Head: 240 β 80 β 2
```
## Training Data
- **Total samples**: 17,543 balanced samples
- **English**: 8,772 samples from diverse sources
- **Non-English**: 8,771 samples across 52+ languages
- **Text lengths**: 3-276 characters (optimized for short texts)
- **Special coverage**: Emoji handling, mathematical formulas, scientific notation
## Quick Start
### Option 1: ONNX Runtime (Recommended)
```python
import onnxruntime as ort
import numpy as np
# Load ONNX model
session = ort.InferenceSession("model.onnx")
def predict(text):
# Prepare input
bytes_data = text.encode('utf-8', errors='ignore')[:256]
padded = np.zeros(256, dtype=np.int64)
padded[:len(bytes_data)] = list(bytes_data)
# Run inference
outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
logits = outputs[0][0]
# Apply softmax
exp_logits = np.exp(logits - np.max(logits))
probs = exp_logits / np.sum(exp_logits)
return probs[1] # English probability
# Examples
print(predict("Hello world!")) # ~1.0 (English)
print(predict("Bonjour le monde")) # ~0.0 (French)
print(predict("Check our sale! π")) # ~1.0 (English with emoji)
```
### Option 2: Python Package
```bash
# Install the utility package
pip install innit-detector
# CLI usage
innit "Hello world!" # β English (confidence: 0.974)
innit --download # Download model first
innit "Hello" "Bonjour" "δ½ ε₯½" # Multiple texts
# Library usage
from innit_detector import InnitDetector
detector = InnitDetector()
result = detector.predict("Hello world!")
print(result['is_english']) # True
```
### Option 3: PyTorch (Advanced)
```python
import torch
import torch.nn.functional as F
from safetensors.torch import load_file
import numpy as np
# Load model (requires TinyByteCNN_EN class definition)
state_dict = load_file("model.safetensors")
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
model.load_state_dict(state_dict)
model.eval()
def predict(text):
bytes_data = text.encode('utf-8', errors='ignore')[:256]
padded = np.zeros(256, dtype=np.long)
padded[:len(bytes_data)] = list(bytes_data)
with torch.no_grad():
logits = model(torch.tensor(padded).unsqueeze(0))
probs = F.softmax(logits, dim=1)
return probs[0][1].item()
```
## ONNX Support
ONNX version available for cross-platform deployment:
- `model.onnx` - Full precision (FP32) for maximum compatibility
## Challenge Set Results
Perfect 100% accuracy on comprehensive test cases:
- Ultra-short texts: "Good morning!" β
- Emoji handling: "Check out our sale! π" β
- Mathematical formulas: "x = (-b Β± β(bΒ²-4ac))/2a" β
- Scientific notation: "COβ + HβO β CβHββOβ" β
- Diverse scripts: Arabic, CJK, Cyrillic, Devanagari β
- English-like languages: Dutch, German β
## Limitations
- Binary classification only (English vs Non-English)
- Optimized for texts up to 256 UTF-8 bytes
- May have reduced accuracy on very rare languages not in training data
- Not suitable for multilingual text (mixed languages in single input)
## License
MIT License - free for commercial use. |