FineType CharCNN
Precision format detection for text data. Given any string value, FineType classifies it into one of 151 semantic types across 6 domains β each type is a transformation contract that guarantees a DuckDB cast expression will succeed.
- GitHub: noon-org/finetype
- Project page: noon.sh/projects/finetype
Model Description
FineType uses a character-level CNN (CharCNN) architecture for classifying raw text strings into semantic types. The model operates at the character level β no tokenizer needed β making it effective for structured data formats like dates, IPs, UUIDs, and phone numbers.
Two model configurations are provided:
| Model | Architecture | Accuracy | Size | Description |
|---|---|---|---|---|
| char-cnn-v2 (flat) | Single 151-class CharCNN | 91.97% | 331 KB | Best for single-value classification |
| tiered | 38 hierarchical CharCNNs | 90.00% | 11 MB | Tier 0 β Tier 1 β Tier 2 cascade |
Architecture Details
Input: raw string (up to 128 characters)
β
Character embedding: vocab=97, embed_dim=32
β
Parallel 1D convolutions: kernel_sizes=[2,3,4,5], num_filters=64 each
β
Max pooling over sequence length (256-dim feature vector)
β
Dense: 256 β hidden_dim=128 (ReLU)
β
Output: 128 β n_classes (softmax)
- Parameters: ~56K (flat model)
- Framework: Candle (pure Rust)
- Weights format: SafeTensors
Taxonomy
151 types across 6 domains:
| Domain | Types | Examples |
|---|---|---|
| datetime | 46 | ISO dates, RFC 3339 timestamps, Unix epochs, time zones |
| technology | 34 | IPv4/IPv6, MAC addresses, UUIDs, hashes, URLs |
| identity | 25 | Emails, phone numbers, credit cards, SSNs, names |
| representation | 19 | JSON, CSV, XML, YAML, integers, floats, booleans |
| geography | 16 | Coordinates, postal codes, country codes, addresses |
| container | 11 | Arrays (comma/pipe/space-separated), key-value pairs |
Training
- Data: Synthetically generated using 151 type-specific generators with locale-aware sampling
- Training set: 74,500 examples (v1), balanced at ~500 per type
- Test set: 14,900 examples (v1), 100 per type
- Optimizer: AdamW, learning rate 1e-3
- Epochs: 20 (v2)
- Hardware: CPU-only training (completes in minutes)
Post-Processing Rules
6 deterministic format-based corrections are applied after model inference, improving macro F1 from 87.9% to 90.8%:
- RFC 3339 vs ISO 8601 β T vs space separator detection
- Hash vs token_hex β standard hash lengths (32/40/64/128)
- Emoji vs gender_symbol β Unicode character identity check
- ISSN vs postal_code β XXXX-XXX[0-9X] pattern matching
- Longitude vs latitude β out-of-range check (|value| > 90)
- Email rescue β @ sign check for hostname/username/slug predictions
Performance
Per-Domain F1
| Domain | Types | Avg F1 |
|---|---|---|
| container | 11 | 0.983 |
| identity | 25 | 0.937 |
| datetime | 46 | 0.920 |
| technology | 34 | 0.906 |
| geography | 16 | 0.886 |
| representation | 19 | 0.874 |
Inference Speed
| Metric | Value |
|---|---|
| Model load (cold) | 66 ms |
| Model load (warm) | 25-30 ms |
| Single inference (p50) | 26 ms |
| Single inference (p95) | 41 ms |
| Batch throughput | 600-750 values/sec |
| Peak memory | 8.5 MB |
Real-World Validation
Evaluated against the GitTables benchmark (2,363 columns, 883 tables):
- Format-detectable types: 85-100% accuracy (URLs 90%, timestamps 100%, dates 88%)
- Column-mode with disambiguation improves geography by +9.7%
Usage
Rust (CLI)
# Install
cargo install finetype-cli
# Classify a value
finetype infer "192.168.1.1"
# β technology.internet.ip_v4
# Profile a CSV
finetype profile data.csv --model models/default
DuckDB Extension
LOAD finetype;
SELECT finetype('hello@example.com');
-- β identity.person.email
-- Profile a column
SELECT finetype_detail(email_column) FROM my_table;
Rust Library
use finetype_model::FlatClassifier;
let model = FlatClassifier::load("models/char-cnn-v2")?;
let result = model.classify("2024-01-15")?;
println!("{} ({:.1}%)", result.label, result.confidence * 100.0);
// β datetime.date.iso (98.5%)
Files
.
βββ char-cnn-v2/ # Flat 151-class model (recommended)
β βββ model.safetensors # 331 KB weights
β βββ config.yaml # Architecture config
β βββ labels.json # 151 class labels
β βββ eval_results.json # Per-class metrics
βββ char-cnn-v1/ # Earlier version (v1 training data)
β βββ model.safetensors
β βββ config.yaml
β βββ labels.json
βββ tiered/ # Hierarchical 38-model cascade
βββ tier0/ # Broad type classifier (15 classes)
βββ tier1_*/ # Category classifiers
βββ tier2_*/ # Specific type classifiers
βββ tier_graph.json # Routing graph
βββ eval_results.json # Per-class metrics
Limitations
- Format, not semantics: FineType detects data format, not meaning. A column of years (2020, 2021) may classify as
integer_numberwithout column context. - Synthetic training data: The model is trained on generated examples, which may not cover all real-world formatting variations.
- English-centric: While locale-aware for dates and addresses, the taxonomy is primarily designed for English-language data.
- Ambiguous short values: Single digits, short codes, and PIN-like values are inherently ambiguous without column context.
- Column-mode required for disambiguation: Year detection, coordinate resolution, and postal code disambiguation require multiple values from the same column.
Citation
@software{finetype2026,
title = {FineType: Precision Format Detection for Text Data},
author = {Cameron, Hugh},
year = {2026},
url = {https://github.com/noon-org/finetype},
license = {MIT}
}
Dataset used to train noon-org/finetype-char-cnn
Evaluation results
- Test Accuracy (flat model) on finetype-training (synthetic)self-reported0.920
- Test Accuracy (tiered model) on finetype-training (synthetic)self-reported0.900
- Macro F1 (flat + post-processing) on finetype-training (synthetic)self-reported0.908