FineType CharCNN

Precision format detection for text data. Given any string value, FineType classifies it into one of 151 semantic types across 6 domains β€” each type is a transformation contract that guarantees a DuckDB cast expression will succeed.

Model Description

FineType uses a character-level CNN (CharCNN) architecture for classifying raw text strings into semantic types. The model operates at the character level β€” no tokenizer needed β€” making it effective for structured data formats like dates, IPs, UUIDs, and phone numbers.

Two model configurations are provided:

Model Architecture Accuracy Size Description
char-cnn-v2 (flat) Single 151-class CharCNN 91.97% 331 KB Best for single-value classification
tiered 38 hierarchical CharCNNs 90.00% 11 MB Tier 0 β†’ Tier 1 β†’ Tier 2 cascade

Architecture Details

Input: raw string (up to 128 characters)
  ↓
Character embedding: vocab=97, embed_dim=32
  ↓
Parallel 1D convolutions: kernel_sizes=[2,3,4,5], num_filters=64 each
  ↓
Max pooling over sequence length (256-dim feature vector)
  ↓
Dense: 256 β†’ hidden_dim=128 (ReLU)
  ↓
Output: 128 β†’ n_classes (softmax)
  • Parameters: ~56K (flat model)
  • Framework: Candle (pure Rust)
  • Weights format: SafeTensors

Taxonomy

151 types across 6 domains:

Domain Types Examples
datetime 46 ISO dates, RFC 3339 timestamps, Unix epochs, time zones
technology 34 IPv4/IPv6, MAC addresses, UUIDs, hashes, URLs
identity 25 Emails, phone numbers, credit cards, SSNs, names
representation 19 JSON, CSV, XML, YAML, integers, floats, booleans
geography 16 Coordinates, postal codes, country codes, addresses
container 11 Arrays (comma/pipe/space-separated), key-value pairs

Training

  • Data: Synthetically generated using 151 type-specific generators with locale-aware sampling
  • Training set: 74,500 examples (v1), balanced at ~500 per type
  • Test set: 14,900 examples (v1), 100 per type
  • Optimizer: AdamW, learning rate 1e-3
  • Epochs: 20 (v2)
  • Hardware: CPU-only training (completes in minutes)

Post-Processing Rules

6 deterministic format-based corrections are applied after model inference, improving macro F1 from 87.9% to 90.8%:

  1. RFC 3339 vs ISO 8601 β€” T vs space separator detection
  2. Hash vs token_hex β€” standard hash lengths (32/40/64/128)
  3. Emoji vs gender_symbol β€” Unicode character identity check
  4. ISSN vs postal_code β€” XXXX-XXX[0-9X] pattern matching
  5. Longitude vs latitude β€” out-of-range check (|value| > 90)
  6. Email rescue β€” @ sign check for hostname/username/slug predictions

Performance

Per-Domain F1

Domain Types Avg F1
container 11 0.983
identity 25 0.937
datetime 46 0.920
technology 34 0.906
geography 16 0.886
representation 19 0.874

Inference Speed

Metric Value
Model load (cold) 66 ms
Model load (warm) 25-30 ms
Single inference (p50) 26 ms
Single inference (p95) 41 ms
Batch throughput 600-750 values/sec
Peak memory 8.5 MB

Real-World Validation

Evaluated against the GitTables benchmark (2,363 columns, 883 tables):

  • Format-detectable types: 85-100% accuracy (URLs 90%, timestamps 100%, dates 88%)
  • Column-mode with disambiguation improves geography by +9.7%

Usage

Rust (CLI)

# Install
cargo install finetype-cli

# Classify a value
finetype infer "192.168.1.1"
# β†’ technology.internet.ip_v4

# Profile a CSV
finetype profile data.csv --model models/default

DuckDB Extension

LOAD finetype;

SELECT finetype('hello@example.com');
-- β†’ identity.person.email

-- Profile a column
SELECT finetype_detail(email_column) FROM my_table;

Rust Library

use finetype_model::FlatClassifier;

let model = FlatClassifier::load("models/char-cnn-v2")?;
let result = model.classify("2024-01-15")?;
println!("{} ({:.1}%)", result.label, result.confidence * 100.0);
// β†’ datetime.date.iso (98.5%)

Files

.
β”œβ”€β”€ char-cnn-v2/           # Flat 151-class model (recommended)
β”‚   β”œβ”€β”€ model.safetensors  # 331 KB weights
β”‚   β”œβ”€β”€ config.yaml        # Architecture config
β”‚   β”œβ”€β”€ labels.json        # 151 class labels
β”‚   └── eval_results.json  # Per-class metrics
β”œβ”€β”€ char-cnn-v1/           # Earlier version (v1 training data)
β”‚   β”œβ”€β”€ model.safetensors
β”‚   β”œβ”€β”€ config.yaml
β”‚   └── labels.json
└── tiered/                # Hierarchical 38-model cascade
    β”œβ”€β”€ tier0/             # Broad type classifier (15 classes)
    β”œβ”€β”€ tier1_*/           # Category classifiers
    β”œβ”€β”€ tier2_*/           # Specific type classifiers
    β”œβ”€β”€ tier_graph.json    # Routing graph
    └── eval_results.json  # Per-class metrics

Limitations

  • Format, not semantics: FineType detects data format, not meaning. A column of years (2020, 2021) may classify as integer_number without column context.
  • Synthetic training data: The model is trained on generated examples, which may not cover all real-world formatting variations.
  • English-centric: While locale-aware for dates and addresses, the taxonomy is primarily designed for English-language data.
  • Ambiguous short values: Single digits, short codes, and PIN-like values are inherently ambiguous without column context.
  • Column-mode required for disambiguation: Year detection, coordinate resolution, and postal code disambiguation require multiple values from the same column.

Citation

@software{finetype2026,
  title = {FineType: Precision Format Detection for Text Data},
  author = {Cameron, Hugh},
  year = {2026},
  url = {https://github.com/noon-org/finetype},
  license = {MIT}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train noon-org/finetype-char-cnn

Evaluation results

  • Test Accuracy (flat model) on finetype-training (synthetic)
    self-reported
    0.920
  • Test Accuracy (tiered model) on finetype-training (synthetic)
    self-reported
    0.900
  • Macro F1 (flat + post-processing) on finetype-training (synthetic)
    self-reported
    0.908