|
|
--- |
|
|
language: multilingual |
|
|
license: mit |
|
|
library_name: pytorch |
|
|
tags: |
|
|
- text-classification |
|
|
- language-detection |
|
|
- byte-level |
|
|
- multilingual |
|
|
- english-detection |
|
|
- cnn |
|
|
pipeline_tag: text-classification |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: innit |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: English vs Non-English Detection |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 99.94 |
|
|
name: Validation Accuracy |
|
|
- type: accuracy |
|
|
value: 100.0 |
|
|
name: Challenge Set Accuracy |
|
|
--- |
|
|
|
|
|
# innit: Fast English vs Non-English Text Detection |
|
|
|
|
|
A lightweight byte-level CNN for fast binary language detection (English vs Non-English). |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Type**: Byte-level Convolutional Neural Network |
|
|
- **Task**: Binary text classification (English vs Non-English) |
|
|
- **Architecture**: TinyByteCNN_EN with 6 convolutional blocks |
|
|
- **Parameters**: 156,642 |
|
|
- **Input**: Raw UTF-8 bytes (max 256 bytes) |
|
|
- **Output**: Binary classification (0=Non-English, 1=English) |
|
|
|
|
|
## Performance |
|
|
|
|
|
- **Validation Accuracy**: 99.94% |
|
|
- **Challenge Set Accuracy**: 100% (14/14 test cases) |
|
|
- **Inference Speed**: Sub-millisecond on modern CPUs |
|
|
- **Model Size**: ~600KB |
|
|
|
|
|
## Supported Languages |
|
|
|
|
|
Trained to distinguish English from 52+ languages across diverse scripts: |
|
|
- **Latin scripts**: Spanish, French, German, Italian, Dutch, Portuguese, etc. |
|
|
- **CJK scripts**: Chinese (Simplified/Traditional), Japanese, Korean |
|
|
- **Cyrillic scripts**: Russian, Ukrainian, Bulgarian, Serbian |
|
|
- **Other scripts**: Arabic, Hindi, Bengali, Thai, Hebrew, etc. |
|
|
|
|
|
## Architecture |
|
|
|
|
|
``` |
|
|
TinyByteCNN_EN: |
|
|
βββ Embedding: 257 β 80 dimensions (256 bytes + padding) |
|
|
βββ 6x Convolutional Blocks: |
|
|
β βββ Conv1D (kernel=3, residual connections) |
|
|
β βββ GELU activation |
|
|
β βββ BatchNorm1D |
|
|
β βββ Dropout (0.15) |
|
|
βββ Enhanced Pooling: mean + max + std |
|
|
βββ Classification Head: 240 β 80 β 2 |
|
|
``` |
|
|
|
|
|
## Training Data |
|
|
|
|
|
- **Total samples**: 17,543 balanced samples |
|
|
- **English**: 8,772 samples from diverse sources |
|
|
- **Non-English**: 8,771 samples across 52+ languages |
|
|
- **Text lengths**: 3-276 characters (optimized for short texts) |
|
|
- **Special coverage**: Emoji handling, mathematical formulas, scientific notation |
|
|
|
|
|
## Quick Start |
|
|
|
|
|
### Option 1: ONNX Runtime (Recommended) |
|
|
```python |
|
|
import onnxruntime as ort |
|
|
import numpy as np |
|
|
|
|
|
# Load ONNX model |
|
|
session = ort.InferenceSession("model.onnx") |
|
|
|
|
|
def predict(text): |
|
|
# Prepare input |
|
|
bytes_data = text.encode('utf-8', errors='ignore')[:256] |
|
|
padded = np.zeros(256, dtype=np.int64) |
|
|
padded[:len(bytes_data)] = list(bytes_data) |
|
|
|
|
|
# Run inference |
|
|
outputs = session.run(['logits'], {'input_bytes': padded.reshape(1, -1)}) |
|
|
logits = outputs[0][0] |
|
|
|
|
|
# Apply softmax |
|
|
exp_logits = np.exp(logits - np.max(logits)) |
|
|
probs = exp_logits / np.sum(exp_logits) |
|
|
return probs[1] # English probability |
|
|
|
|
|
# Examples |
|
|
print(predict("Hello world!")) # ~1.0 (English) |
|
|
print(predict("Bonjour le monde")) # ~0.0 (French) |
|
|
print(predict("Check our sale! π")) # ~1.0 (English with emoji) |
|
|
``` |
|
|
|
|
|
### Option 2: Python Package |
|
|
```bash |
|
|
# Install the utility package |
|
|
pip install innit-detector |
|
|
|
|
|
# CLI usage |
|
|
innit "Hello world!" # β English (confidence: 0.974) |
|
|
innit --download # Download model first |
|
|
innit "Hello" "Bonjour" "δ½ ε₯½" # Multiple texts |
|
|
|
|
|
# Library usage |
|
|
from innit_detector import InnitDetector |
|
|
detector = InnitDetector() |
|
|
result = detector.predict("Hello world!") |
|
|
print(result['is_english']) # True |
|
|
``` |
|
|
|
|
|
### Option 3: PyTorch (Advanced) |
|
|
```python |
|
|
import torch |
|
|
import torch.nn.functional as F |
|
|
from safetensors.torch import load_file |
|
|
import numpy as np |
|
|
|
|
|
# Load model (requires TinyByteCNN_EN class definition) |
|
|
state_dict = load_file("model.safetensors") |
|
|
model = TinyByteCNN_EN(emb=80, blocks=6, dropout=0.15) |
|
|
model.load_state_dict(state_dict) |
|
|
model.eval() |
|
|
|
|
|
def predict(text): |
|
|
bytes_data = text.encode('utf-8', errors='ignore')[:256] |
|
|
padded = np.zeros(256, dtype=np.long) |
|
|
padded[:len(bytes_data)] = list(bytes_data) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(torch.tensor(padded).unsqueeze(0)) |
|
|
probs = F.softmax(logits, dim=1) |
|
|
return probs[0][1].item() |
|
|
``` |
|
|
|
|
|
## ONNX Support |
|
|
|
|
|
ONNX version available for cross-platform deployment: |
|
|
- `model.onnx` - Full precision (FP32) for maximum compatibility |
|
|
|
|
|
## Challenge Set Results |
|
|
|
|
|
Perfect 100% accuracy on comprehensive test cases: |
|
|
- Ultra-short texts: "Good morning!" β
|
|
|
- Emoji handling: "Check out our sale! π" β
|
|
|
- Mathematical formulas: "x = (-b Β± β(bΒ²-4ac))/2a" β
|
|
|
- Scientific notation: "COβ + HβO β CβHββOβ" β
|
|
|
- Diverse scripts: Arabic, CJK, Cyrillic, Devanagari β
|
|
|
- English-like languages: Dutch, German β
|
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Binary classification only (English vs Non-English) |
|
|
- Optimized for texts up to 256 UTF-8 bytes |
|
|
- May have reduced accuracy on very rare languages not in training data |
|
|
- Not suitable for multilingual text (mixed languages in single input) |
|
|
|
|
|
## License |
|
|
|
|
|
MIT License - free for commercial use. |