File size: 5,132 Bytes
92e35ee
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
---
language: multilingual
license: mit
library_name: pytorch
tags:
- text-classification
- language-detection
- byte-level
- multilingual
- english-detection
- cnn
pipeline_tag: text-classification
datasets:
- custom
metrics:
- accuracy
model-index:
- name: innit
  results:
  - task:
      type: text-classification
      name: English vs Non-English Detection
    metrics:
    - type: accuracy
      value: 99.94
      name: Validation Accuracy
    - type: accuracy  
      value: 100.0
      name: Challenge Set Accuracy
---

# innit: Fast English vs Non-English Text Detection

A lightweight byte-level CNN for fast binary language detection (English vs Non-English).

## Model Details

- **Model Type**: Byte-level Convolutional Neural Network
- **Task**: Binary text classification (English vs Non-English)
- **Architecture**: TinyByteCNN_EN with 6 convolutional blocks
- **Parameters**: 156,642
- **Input**: Raw UTF-8 bytes (max 256 bytes)
- **Output**: Binary classification (0=Non-English, 1=English)

## Performance

- **Validation Accuracy**: 99.94%
- **Challenge Set Accuracy**: 100% (14/14 test cases)
- **Inference Speed**: Sub-millisecond on modern CPUs
- **Model Size**: ~600KB

## Supported Languages

Trained to distinguish English from 52+ languages across diverse scripts:
- **Latin scripts**: Spanish, French, German, Italian, Dutch, Portuguese, etc.
- **CJK scripts**: Chinese (Simplified/Traditional), Japanese, Korean
- **Cyrillic scripts**: Russian, Ukrainian, Bulgarian, Serbian
- **Other scripts**: Arabic, Hindi, Bengali, Thai, Hebrew, etc.

## Architecture

```
TinyByteCNN_EN:
β”œβ”€β”€ Embedding: 257 β†’ 80 dimensions (256 bytes + padding)
β”œβ”€β”€ 6x Convolutional Blocks:
β”‚   β”œβ”€β”€ Conv1D (kernel=3, residual connections)
β”‚   β”œβ”€β”€ GELU activation
β”‚   β”œβ”€β”€ BatchNorm1D  
β”‚   └── Dropout (0.15)
β”œβ”€β”€ Enhanced Pooling: mean + max + std
└── Classification Head: 240 β†’ 80 β†’ 2
```

## Training Data

- **Total samples**: 17,543 balanced samples
- **English**: 8,772 samples from diverse sources
- **Non-English**: 8,771 samples across 52+ languages
- **Text lengths**: 3-276 characters (optimized for short texts)
- **Special coverage**: Emoji handling, mathematical formulas, scientific notation

## Quick Start

### Option 1: ONNX Runtime (Recommended)
```python
import onnxruntime as ort
import numpy as np

# Load ONNX model
session = ort.InferenceSession("model.onnx")

def predict(text):
    # Prepare input
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.int64)
    padded[:len(bytes_data)] = list(bytes_data)
    
    # Run inference
    outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)})
    logits = outputs[0][0]
    
    # Apply softmax
    exp_logits = np.exp(logits - np.max(logits))
    probs = exp_logits / np.sum(exp_logits)
    return probs[1]  # English probability

# Examples
print(predict("Hello world!"))           # ~1.0 (English)
print(predict("Bonjour le monde"))       # ~0.0 (French)
print(predict("Check our sale! πŸŽ‰"))     # ~1.0 (English with emoji)
```

### Option 2: Python Package
```bash
# Install the utility package
pip install innit-detector

# CLI usage
innit "Hello world!"                    # β†’ English (confidence: 0.974)
innit --download                        # Download model first
innit "Hello" "Bonjour" "δ½ ε₯½"          # Multiple texts

# Library usage
from innit_detector import InnitDetector
detector = InnitDetector()
result = detector.predict("Hello world!")
print(result['is_english'])  # True
```

### Option 3: PyTorch (Advanced)
```python
import torch
import torch.nn.functional as F
from safetensors.torch import load_file
import numpy as np

# Load model (requires TinyByteCNN_EN class definition)
state_dict = load_file("model.safetensors")
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15)
model.load_state_dict(state_dict)
model.eval()

def predict(text):
    bytes_data = text.encode('utf-8', errors='ignore')[:256]
    padded = np.zeros(256, dtype=np.long)
    padded[:len(bytes_data)] = list(bytes_data)
    
    with torch.no_grad():
        logits = model(torch.tensor(padded).unsqueeze(0))
        probs = F.softmax(logits, dim=1)
        return probs[0][1].item()
```

## ONNX Support

ONNX version available for cross-platform deployment:
- `model.onnx` - Full precision (FP32) for maximum compatibility

## Challenge Set Results

Perfect 100% accuracy on comprehensive test cases:
- Ultra-short texts: "Good morning!" βœ…
- Emoji handling: "Check out our sale! πŸŽ‰" βœ…  
- Mathematical formulas: "x = (-b Β± √(bΒ²-4ac))/2a" βœ…
- Scientific notation: "COβ‚‚ + Hβ‚‚O β†’ C₆H₁₂O₆" βœ…
- Diverse scripts: Arabic, CJK, Cyrillic, Devanagari βœ…
- English-like languages: Dutch, German βœ…

## Limitations

- Binary classification only (English vs Non-English)
- Optimized for texts up to 256 UTF-8 bytes
- May have reduced accuracy on very rare languages not in training data
- Not suitable for multilingual text (mixed languages in single input)

## License

MIT License - free for commercial use.