---
language:
- en
license: mit
library_name: fasttext
tags:
- fasttext
- text-classification
- register-classification
- text-type
- content-filtering
- data-curation
pipeline_tag: text-classification
datasets:
- TurkuNLP/register_oscar
metrics:
- precision
- recall
- f1
---

# Text Register FastText Classifier

A FastText classifier that detects the **communicative register** (text type) of any English text at ~500k predictions/sec on CPU.

## Labels

| Code | Register | Description | Example |
|------|----------|-------------|---------|
| `IN` | Informational | Factual, encyclopedic, descriptive | Wikipedia articles, reports |
| `NA` | Narrative | Story-like, temporal sequence of events | News stories, fiction, blog posts |
| `OP` | Opinion | Subjective evaluation, personal views | Reviews, editorials, comments |
| `IP` | Persuasion | Attempts to convince or sell | Marketing copy, ads, fundraising |
| `HI` | HowTo | Instructions, procedures, recipes | Tutorials, manuals, FAQs |
| `ID` | Discussion | Interactive, forum-style dialogue | Forum threads, Q&A, comments |
| `SP` | Spoken | Transcribed or spoken-style text | Interviews, podcasts, speeches |
| `LY` | Lyrical | Poetic, artistic, song-like | Poetry, song lyrics, creative prose |

Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).

## Quick Start

```python
import fasttext
from huggingface_hub import hf_hub_download

# Download model (quantized, 151 MB)
model_path = hf_hub_download(
    "oneryalcin/text-register-fasttext-classifier",
    "register_fasttext_q.bin"
)
model = fasttext.load_model(model_path)

# Predict
labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
# -> [('__label__IP', 1.0), ...]  # IP = Persuasion
```

> **Note**: If you get a numpy error, pin `numpy<2`: `pip install "numpy<2"`

## Performance

Trained on 10 English shards from [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (~1.9M documents), balanced via oversampling/undersampling to median class size.

### Overall Metrics

| Metric | Full Model | Quantized |
|--------|-----------|-----------|
| Precision@1 | 0.831 | 0.796 |
| Recall@1 | 0.759 | 0.727 |
| Precision@2 | 0.491 | — |
| Recall@2 | 0.898 | — |
| Speed | ~500k pred/s | ~500k pred/s |
| Size | 1.1 GB | 151 MB |

### Per-Class F1 (threshold=0.3, k=2)

| Register | Precision | Recall | F1 | Test Support |
|----------|-----------|--------|-----|-------------|
| Informational | 0.910 | 0.666 | 0.769 | 108,672 |
| Narrative | 0.764 | 0.766 | 0.765 | 44,238 |
| Discussion | 0.640 | 0.774 | 0.701 | 7,420 |
| Persuasion | 0.553 | 0.794 | 0.652 | 19,193 |
| Opinion | 0.567 | 0.736 | 0.640 | 20,014 |
| HowTo | 0.515 | 0.766 | 0.616 | 7,281 |
| Spoken | 0.551 | 0.513 | 0.531 | 831 |
| Lyrical | 0.657 | 0.442 | 0.529 | 251 |

### Example Predictions

```
"The company reported revenue of $4.2 billion..."       -> Informational (1.00), Narrative (0.99)
"Once upon a time in a small village..."                -> Narrative
"I honestly think this movie is terrible..."            -> Opinion (1.00)
"To install the package, first run pip install..."      -> HowTo (1.00)
"Buy now and save 50%! Limited time offer..."           -> Persuasion (1.00)
"So like, I was telling her yesterday..."               -> Spoken (1.00)
"I've been walking these streets alone..."              -> Lyrical (1.00)
"Hey everyone! What do you think about..."              -> Discussion (1.00)
"Introducing the revolutionary SkinGlow Pro..."         -> Persuasion (1.00)
```

## Use Cases

- **Data curation**: Filter pretraining corpora by register (e.g., keep only Informational + HowTo)
- **Content routing**: Route incoming text to different processing pipelines
- **Boilerplate removal**: Flag Persuasion/Marketing text in document corpora
- **Signal extraction**: Identify which paragraphs in a document carry factual vs opinion content
- **RAG preprocessing**: Score chunks by register before feeding to LLMs

## Reproduce from Scratch

### 1. Download data

```bash
pip install huggingface_hub

# Download 10 English shards (~4 GB)
for i in $(seq 0 9); do
    hf download TurkuNLP/register_oscar \
        $(printf "en/en_%05d.jsonl.gz" $i) \
        --repo-type dataset --local-dir ./data
done
```

### 2. Prepare balanced training data

```bash
python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared
```

### 3. Train

```bash
pip install fasttext-wheel "numpy<2"
python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model
```

### 4. Predict

```bash
# Interactive
python scripts/predict.py --model ./model/register_fasttext_q.bin

# Single text
python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"

# Batch
python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl
```

## Training Details

- **Source data**: [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (English, 10 shards, ~1.9M labeled documents)
- **Balancing**: Minority classes oversampled, majority classes undersampled to median class size (~129k per class)
- **Architecture**: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss
- **Hyperparameters**: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M
- **Text preprocessing**: Whitespace collapsed, truncated to 500 words

## Limitations

- **Spoken & Lyrical** classes have lower F1 (~0.53) due to limited unique training data even after oversampling
- Trained on web text only — may not generalize well to domain-specific text (legal, medical)
- Bag-of-words model — does not understand word order or deep semantics
- English only (the source dataset has other languages that could be used for multilingual training)

## Citation

If you use this model, please cite the source dataset:

```bibtex
@inproceedings{register_oscar,
  title={Multilingual register classification on the full OSCAR data},
  author={R{\"o}nnqvist, Samuel and others},
  year={2023},
  note={TurkuNLP, University of Turku}
}

@article{biber2018register,
  title={Register as a predictor of linguistic variation},
  author={Biber, Douglas and Egbert, Jesse},
  journal={Corpus Linguistics and Linguistic Theory},
  year={2018}
}
```

## License

The model weights inherit the license of the source dataset ([TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar)). Scripts are released under MIT.