FineType CharCNN

Precision format detection for text data. Given any string value, FineType classifies it into one of 151 semantic types across 6 domains — each type is a transformation contract that guarantees a DuckDB cast expression will succeed.

GitHub: noon-org/finetype
Project page: noon.sh/projects/finetype

Model Description

FineType uses a character-level CNN (CharCNN) architecture for classifying raw text strings into semantic types. The model operates at the character level — no tokenizer needed — making it effective for structured data formats like dates, IPs, UUIDs, and phone numbers.

Two model configurations are provided:

Model	Architecture	Accuracy	Size	Description
char-cnn-v2 (flat)	Single 151-class CharCNN	91.97%	331 KB	Best for single-value classification
tiered	38 hierarchical CharCNNs	90.00%	11 MB	Tier 0 → Tier 1 → Tier 2 cascade

Architecture Details

Input: raw string (up to 128 characters)
  ↓
Character embedding: vocab=97, embed_dim=32
  ↓
Parallel 1D convolutions: kernel_sizes=[2,3,4,5], num_filters=64 each
  ↓
Max pooling over sequence length (256-dim feature vector)
  ↓
Dense: 256 → hidden_dim=128 (ReLU)
  ↓
Output: 128 → n_classes (softmax)

Parameters: ~56K (flat model)
Framework: Candle (pure Rust)
Weights format: SafeTensors

Taxonomy

151 types across 6 domains:

Domain	Types	Examples
datetime	46	ISO dates, RFC 3339 timestamps, Unix epochs, time zones
technology	34	IPv4/IPv6, MAC addresses, UUIDs, hashes, URLs
identity	25	Emails, phone numbers, credit cards, SSNs, names
representation	19	JSON, CSV, XML, YAML, integers, floats, booleans
geography	16	Coordinates, postal codes, country codes, addresses
container	11	Arrays (comma/pipe/space-separated), key-value pairs

Training

Data: Synthetically generated using 151 type-specific generators with locale-aware sampling
Training set: 74,500 examples (v1), balanced at ~500 per type
Test set: 14,900 examples (v1), 100 per type
Optimizer: AdamW, learning rate 1e-3
Epochs: 20 (v2)
Hardware: CPU-only training (completes in minutes)

Post-Processing Rules

6 deterministic format-based corrections are applied after model inference, improving macro F1 from 87.9% to 90.8%:

RFC 3339 vs ISO 8601 — T vs space separator detection
Hash vs token_hex — standard hash lengths (32/40/64/128)
Emoji vs gender_symbol — Unicode character identity check
ISSN vs postal_code — XXXX-XXX[0-9X] pattern matching
Longitude vs latitude — out-of-range check (|value| > 90)
Email rescue — @ sign check for hostname/username/slug predictions

Performance

Per-Domain F1

Domain	Types	Avg F1
container	11	0.983
identity	25	0.937
datetime	46	0.920
technology	34	0.906
geography	16	0.886
representation	19	0.874

Inference Speed

Metric	Value
Model load (cold)	66 ms
Model load (warm)	25-30 ms
Single inference (p50)	26 ms
Single inference (p95)	41 ms
Batch throughput	600-750 values/sec
Peak memory	8.5 MB

Real-World Validation

Evaluated against the GitTables benchmark (2,363 columns, 883 tables):

Format-detectable types: 85-100% accuracy (URLs 90%, timestamps 100%, dates 88%)
Column-mode with disambiguation improves geography by +9.7%

Usage

Rust (CLI)

# Install
cargo install finetype-cli

# Classify a value
finetype infer "192.168.1.1"
# → technology.internet.ip_v4

# Profile a CSV
finetype profile data.csv --model models/default

DuckDB Extension

LOAD finetype;

SELECT finetype('hello@example.com');
-- → identity.person.email

-- Profile a column
SELECT finetype_detail(email_column) FROM my_table;

Rust Library

use finetype_model::FlatClassifier;

let model = FlatClassifier::load("models/char-cnn-v2")?;
let result = model.classify("2024-01-15")?;
println!("{} ({:.1}%)", result.label, result.confidence * 100.0);
// → datetime.date.iso (98.5%)

Files

.
├── char-cnn-v2/           # Flat 151-class model (recommended)
│   ├── model.safetensors  # 331 KB weights
│   ├── config.yaml        # Architecture config
│   ├── labels.json        # 151 class labels
│   └── eval_results.json  # Per-class metrics
├── char-cnn-v1/           # Earlier version (v1 training data)
│   ├── model.safetensors
│   ├── config.yaml
│   └── labels.json
└── tiered/                # Hierarchical 38-model cascade
    ├── tier0/             # Broad type classifier (15 classes)
    ├── tier1_*/           # Category classifiers
    ├── tier2_*/           # Specific type classifiers
    ├── tier_graph.json    # Routing graph
    └── eval_results.json  # Per-class metrics

Limitations

Format, not semantics: FineType detects data format, not meaning. A column of years (2020, 2021) may classify as integer_number without column context.
Synthetic training data: The model is trained on generated examples, which may not cover all real-world formatting variations.
English-centric: While locale-aware for dates and addresses, the taxonomy is primarily designed for English-language data.
Ambiguous short values: Single digits, short codes, and PIN-like values are inherently ambiguous without column context.
Column-mode required for disambiguation: Year detection, coordinate resolution, and postal code disambiguation require multiple values from the same column.

Citation

@software{finetype2026,
  title = {FineType: Precision Format Detection for Text Data},
  author = {Cameron, Hugh},
  year = {2026},
  url = {https://github.com/noon-org/finetype},
  license = {MIT}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Dataset used to train noon-org/finetype-char-cnn

Evaluation results

Test Accuracy (flat model) on finetype-training (synthetic)
self-reported

0.920
Test Accuracy (tiered model) on finetype-training (synthetic)
self-reported

0.900
Macro F1 (flat + post-processing) on finetype-training (synthetic)
self-reported

0.908