--- language: - en license: mit library_name: fasttext tags: - fasttext - text-classification - register-classification - text-type - content-filtering - data-curation pipeline_tag: text-classification datasets: - TurkuNLP/register_oscar metrics: - precision - recall - f1 --- # Text Register FastText Classifier A FastText classifier that detects the **communicative register** (text type) of any English text at ~500k predictions/sec on CPU. ## Labels | Code | Register | Description | Example | |------|----------|-------------|---------| | `IN` | Informational | Factual, encyclopedic, descriptive | Wikipedia articles, reports | | `NA` | Narrative | Story-like, temporal sequence of events | News stories, fiction, blog posts | | `OP` | Opinion | Subjective evaluation, personal views | Reviews, editorials, comments | | `IP` | Persuasion | Attempts to convince or sell | Marketing copy, ads, fundraising | | `HI` | HowTo | Instructions, procedures, recipes | Tutorials, manuals, FAQs | | `ID` | Discussion | Interactive, forum-style dialogue | Forum threads, Q&A, comments | | `SP` | Spoken | Transcribed or spoken-style text | Interviews, podcasts, speeches | | `LY` | Lyrical | Poetic, artistic, song-like | Poetry, song lyrics, creative prose | Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative). ## Quick Start ```python import fasttext from huggingface_hub import hf_hub_download # Download model (quantized, 151 MB) model_path = hf_hub_download( "oneryalcin/text-register-fasttext-classifier", "register_fasttext_q.bin" ) model = fasttext.load_model(model_path) # Predict labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3) # -> [('__label__IP', 1.0), ...] # IP = Persuasion ``` > **Note**: If you get a numpy error, pin `numpy<2`: `pip install "numpy<2"` ## Performance Trained on 10 English shards from [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (~1.9M documents), balanced via oversampling/undersampling to median class size. ### Overall Metrics | Metric | Full Model | Quantized | |--------|-----------|-----------| | Precision@1 | 0.831 | 0.796 | | Recall@1 | 0.759 | 0.727 | | Precision@2 | 0.491 | — | | Recall@2 | 0.898 | — | | Speed | ~500k pred/s | ~500k pred/s | | Size | 1.1 GB | 151 MB | ### Per-Class F1 (threshold=0.3, k=2) | Register | Precision | Recall | F1 | Test Support | |----------|-----------|--------|-----|-------------| | Informational | 0.910 | 0.666 | 0.769 | 108,672 | | Narrative | 0.764 | 0.766 | 0.765 | 44,238 | | Discussion | 0.640 | 0.774 | 0.701 | 7,420 | | Persuasion | 0.553 | 0.794 | 0.652 | 19,193 | | Opinion | 0.567 | 0.736 | 0.640 | 20,014 | | HowTo | 0.515 | 0.766 | 0.616 | 7,281 | | Spoken | 0.551 | 0.513 | 0.531 | 831 | | Lyrical | 0.657 | 0.442 | 0.529 | 251 | ### Example Predictions ``` "The company reported revenue of $4.2 billion..." -> Informational (1.00), Narrative (0.99) "Once upon a time in a small village..." -> Narrative "I honestly think this movie is terrible..." -> Opinion (1.00) "To install the package, first run pip install..." -> HowTo (1.00) "Buy now and save 50%! Limited time offer..." -> Persuasion (1.00) "So like, I was telling her yesterday..." -> Spoken (1.00) "I've been walking these streets alone..." -> Lyrical (1.00) "Hey everyone! What do you think about..." -> Discussion (1.00) "Introducing the revolutionary SkinGlow Pro..." -> Persuasion (1.00) ``` ## Use Cases - **Data curation**: Filter pretraining corpora by register (e.g., keep only Informational + HowTo) - **Content routing**: Route incoming text to different processing pipelines - **Boilerplate removal**: Flag Persuasion/Marketing text in document corpora - **Signal extraction**: Identify which paragraphs in a document carry factual vs opinion content - **RAG preprocessing**: Score chunks by register before feeding to LLMs ## Reproduce from Scratch ### 1. Download data ```bash pip install huggingface_hub # Download 10 English shards (~4 GB) for i in $(seq 0 9); do hf download TurkuNLP/register_oscar \ $(printf "en/en_%05d.jsonl.gz" $i) \ --repo-type dataset --local-dir ./data done ``` ### 2. Prepare balanced training data ```bash python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared ``` ### 3. Train ```bash pip install fasttext-wheel "numpy<2" python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model ``` ### 4. Predict ```bash # Interactive python scripts/predict.py --model ./model/register_fasttext_q.bin # Single text python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!" # Batch python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl ``` ## Training Details - **Source data**: [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (English, 10 shards, ~1.9M labeled documents) - **Balancing**: Minority classes oversampled, majority classes undersampled to median class size (~129k per class) - **Architecture**: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss - **Hyperparameters**: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M - **Text preprocessing**: Whitespace collapsed, truncated to 500 words ## Limitations - **Spoken & Lyrical** classes have lower F1 (~0.53) due to limited unique training data even after oversampling - Trained on web text only — may not generalize well to domain-specific text (legal, medical) - Bag-of-words model — does not understand word order or deep semantics - English only (the source dataset has other languages that could be used for multilingual training) ## Citation If you use this model, please cite the source dataset: ```bibtex @inproceedings{register_oscar, title={Multilingual register classification on the full OSCAR data}, author={R{\"o}nnqvist, Samuel and others}, year={2023}, note={TurkuNLP, University of Turku} } @article{biber2018register, title={Register as a predictor of linguistic variation}, author={Biber, Douglas and Egbert, Jesse}, journal={Corpus Linguistics and Linguistic Theory}, year={2018} } ``` ## License The model weights inherit the license of the source dataset ([TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar)). Scripts are released under MIT.