Add text register FastText classifier with training scripts

Browse files

Files changed (6) hide show

README.md +168 -0
register_fasttext.bin +3 -0
register_fasttext_q.bin +3 -0
scripts/predict.py +114 -0
scripts/prepare_data.py +173 -0
scripts/train.py +91 -0

README.md ADDED Viewed

	@@ -0,0 +1,168 @@

+# Text Register FastText Classifier
+A FastText classifier that detects the **communicative register** (text type) of any English text at ~500k predictions/sec on CPU.
+## Labels
+| Code | Register | Description | Example |
+|------|----------|-------------|---------|
+| `IN` | Informational | Factual, encyclopedic, descriptive | Wikipedia articles, reports |
+| `NA` | Narrative | Story-like, temporal sequence of events | News stories, fiction, blog posts |
+| `OP` | Opinion | Subjective evaluation, personal views | Reviews, editorials, comments |
+| `IP` | Persuasion | Attempts to convince or sell | Marketing copy, ads, fundraising |
+| `HI` | HowTo | Instructions, procedures, recipes | Tutorials, manuals, FAQs |
+| `ID` | Discussion | Interactive, forum-style dialogue | Forum threads, Q&A, comments |
+| `SP` | Spoken | Transcribed or spoken-style text | Interviews, podcasts, speeches |
+| `LY` | Lyrical | Poetic, artistic, song-like | Poetry, song lyrics, creative prose |
+Based on the Biber & Egbert (2018) register taxonomy. Multi-label supported (a text can be both Informational and Narrative).
+## Quick Start
+```python
+import fasttext
+from huggingface_hub import hf_hub_download
+# Download model (quantized, 151 MB)
+model_path = hf_hub_download(
+    "oneryalcin/text-register-fasttext-classifier",
+    "register_fasttext_q.bin"
+)
+model = fasttext.load_model(model_path)
+# Predict
+labels, probs = model.predict("Buy now and save 50%! Limited time offer!", k=3)
+# -> [('__label__IP', 1.0), ...]  # IP = Persuasion
+```
+> **Note**: If you get a numpy error, pin `numpy<2`: `pip install "numpy<2"`
+## Performance
+Trained on 10 English shards from [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (~1.9M documents), balanced via oversampling/undersampling to median class size.
+### Overall Metrics
+| Metric | Full Model | Quantized |
+|--------|-----------|-----------|
+| Precision@1 | 0.831 | 0.796 |
+| Recall@1 | 0.759 | 0.727 |
+| Precision@2 | 0.491 | — |
+| Recall@2 | 0.898 | — |
+| Speed | ~500k pred/s | ~500k pred/s |
+| Size | 1.1 GB | 151 MB |
+### Per-Class F1 (threshold=0.3, k=2)
+| Register | Precision | Recall | F1 | Test Support |
+|----------|-----------|--------|-----|-------------|
+| Informational | 0.910 | 0.666 | 0.769 | 108,672 |
+| Narrative | 0.764 | 0.766 | 0.765 | 44,238 |
+| Discussion | 0.640 | 0.774 | 0.701 | 7,420 |
+| Persuasion | 0.553 | 0.794 | 0.652 | 19,193 |
+| Opinion | 0.567 | 0.736 | 0.640 | 20,014 |
+| HowTo | 0.515 | 0.766 | 0.616 | 7,281 |
+| Spoken | 0.551 | 0.513 | 0.531 | 831 |
+| Lyrical | 0.657 | 0.442 | 0.529 | 251 |
+### Example Predictions
+```
+"The company reported revenue of $4.2 billion..."       -> Informational (1.00), Narrative (0.99)
+"Once upon a time in a small village..."                -> Narrative
+"I honestly think this movie is terrible..."            -> Opinion (1.00)
+"To install the package, first run pip install..."      -> HowTo (1.00)
+"Buy now and save 50%! Limited time offer..."           -> Persuasion (1.00)
+"So like, I was telling her yesterday..."               -> Spoken (1.00)
+"I've been walking these streets alone..."              -> Lyrical (1.00)
+"Hey everyone! What do you think about..."              -> Discussion (1.00)
+"Introducing the revolutionary SkinGlow Pro..."         -> Persuasion (1.00)
+```
+## Use Cases
+- **Data curation**: Filter pretraining corpora by register (e.g., keep only Informational + HowTo)
+- **Content routing**: Route incoming text to different processing pipelines
+- **Boilerplate removal**: Flag Persuasion/Marketing text in document corpora
+- **Signal extraction**: Identify which paragraphs in a document carry factual vs opinion content
+- **RAG preprocessing**: Score chunks by register before feeding to LLMs
+## Reproduce from Scratch
+### 1. Download data
+```bash
+pip install huggingface_hub
+# Download 10 English shards (~4 GB)
+for i in $(seq 0 9); do
+    hf download TurkuNLP/register_oscar \
+        $(printf "en/en_%05d.jsonl.gz" $i) \
+        --repo-type dataset --local-dir ./data
+done
+```
+### 2. Prepare balanced training data
+```bash
+python scripts/prepare_data.py --data-dir ./data/en --output-dir ./prepared
+```
+### 3. Train
+```bash
+pip install fasttext-wheel "numpy<2"
+python scripts/train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model
+```
+### 4. Predict
+```bash
+# Interactive
+python scripts/predict.py --model ./model/register_fasttext_q.bin
+# Single text
+python scripts/predict.py --model ./model/register_fasttext_q.bin --text "Buy now! 50% off!"
+# Batch
+python scripts/predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output out.jsonl
+```
+## Training Details
+- **Source data**: [TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar) (English, 10 shards, ~1.9M labeled documents)
+- **Balancing**: Minority classes oversampled, majority classes undersampled to median class size (~129k per class)
+- **Architecture**: FastText supervised with bigrams, 100-dim embeddings, one-vs-all loss
+- **Hyperparameters**: lr=0.5, epoch=25, wordNgrams=2, dim=100, loss=ova, bucket=2M
+- **Text preprocessing**: Whitespace collapsed, truncated to 500 words
+## Limitations
+- **Spoken & Lyrical** classes have lower F1 (~0.53) due to limited unique training data even after oversampling
+- Trained on web text only — may not generalize well to domain-specific text (legal, medical)
+- Bag-of-words model — does not understand word order or deep semantics
+- English only (the source dataset has other languages that could be used for multilingual training)
+## Citation
+If you use this model, please cite the source dataset:
+```bibtex
+@inproceedings{register_oscar,
+  title={Multilingual register classification on the full OSCAR data},
+  author={R{\"o}nnqvist, Samuel and others},
+  year={2023},
+  note={TurkuNLP, University of Turku}
+}
+@article{biber2018register,
+  title={Register as a predictor of linguistic variation},
+  author={Biber, Douglas and Egbert, Jesse},
+  journal={Corpus Linguistics and Linguistic Theory},
+  year={2018}
+}
+```
+## License
+The model weights inherit the license of the source dataset ([TurkuNLP/register_oscar](https://huggingface.co/datasets/TurkuNLP/register_oscar)). Scripts are released under MIT.

register_fasttext.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3e76a01fa9946bd26ab2eeeea1842ff643cc486634c0f4db4dbe85b6b7c78017
+size 1156314566

register_fasttext_q.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7b1d55be8d490dbbcb17773592e367d58fb857dfe0a603c322246aef6855de86
+size 158362937

scripts/predict.py ADDED Viewed

	@@ -0,0 +1,114 @@

+"""
+Predict text register using the trained FastText model.
+Usage:
+    # Interactive mode
+    python predict.py --model ./model/register_fasttext_q.bin
+    # Single text
+    python predict.py --model ./model/register_fasttext_q.bin --text "Buy now! Limited offer!"
+    # File mode (one text per line)
+    python predict.py --model ./model/register_fasttext_q.bin --input texts.txt --output predictions.jsonl
+"""
+import fasttext
+import json
+import sys
+import argparse
+import time
+REGISTER_LABELS = {
+    "IN": "Informational",
+    "NA": "Narrative",
+    "OP": "Opinion",
+    "IP": "Persuasion",
+    "HI": "HowTo",
+    "ID": "Discussion",
+    "SP": "Spoken",
+    "LY": "Lyrical",
+}
+def predict_one(model, text: str, k: int = 3, threshold: float = 0.1):
+    """Predict register labels for a single text."""
+    labels, probs = model.predict(text.replace("\n", " "), k=k, threshold=threshold)
+    results = []
+    for label, prob in zip(labels, probs):
+        code = label.replace("__label__", "")
+        results.append({
+            "label": code,
+            "name": REGISTER_LABELS.get(code, code),
+            "score": round(float(prob), 4),
+        })
+    return results
+def main():
+    parser = argparse.ArgumentParser(description="Predict text register")
+    parser.add_argument("--model", required=True, help="Path to FastText .bin model")
+    parser.add_argument("--text", help="Single text to classify")
+    parser.add_argument("--input", help="Input file (one text per line)")
+    parser.add_argument("--output", help="Output JSONL file")
+    parser.add_argument("--k", type=int, default=3, help="Top-k labels to return")
+    parser.add_argument("--threshold", type=float, default=0.1, help="Min probability threshold")
+    args = parser.parse_args()
+    # Suppress load warning
+    try:
+        fasttext.FastText.eprint = lambda x: None
+    except Exception:
+        pass
+    model = fasttext.load_model(args.model)
+    if args.text:
+        # Single prediction
+        results = predict_one(model, args.text, args.k, args.threshold)
+        for r in results:
+            print(f"  {r['name']:<15} ({r['label']})  {r['score']:.3f}")
+    elif args.input:
+        # Batch mode
+        out_f = open(args.output, "w") if args.output else sys.stdout
+        count = 0
+        start = time.time()
+        with open(args.input) as f:
+            for line in f:
+                text = line.strip()
+                if not text:
+                    continue
+                results = predict_one(model, text, args.k, args.threshold)
+                record = {"text": text[:200], "predictions": results}
+                out_f.write(json.dumps(record) + "\n")
+                count += 1
+        elapsed = time.time() - start
+        if args.output:
+            out_f.close()
+        print(f"Processed {count} texts in {elapsed:.2f}s ({count / elapsed:.0f}/sec)", file=sys.stderr)
+    else:
+        # Interactive mode
+        print("Text Register Classifier (type 'quit' to exit)")
+        print(f"Labels: {', '.join(f'{k}={v}' for k, v in REGISTER_LABELS.items())}")
+        print()
+        while True:
+            try:
+                text = input("> ").strip()
+            except (EOFError, KeyboardInterrupt):
+                break
+            if text.lower() in ("quit", "exit", "q"):
+                break
+            if not text:
+                continue
+            results = predict_one(model, text, args.k, args.threshold)
+            for r in results:
+                print(f"  {r['name']:<15} ({r['label']})  {r['score']:.3f}")
+            print()
+if __name__ == "__main__":
+    main()

scripts/prepare_data.py ADDED Viewed

	@@ -0,0 +1,173 @@

+"""
+Prepare balanced FastText training data from TurkuNLP/register_oscar dataset.
+Downloads English shards, extracts labeled documents, and creates a balanced
+training set by oversampling minority classes and undersampling majority classes
+to the median class size.
+Requirements:
+    pip install huggingface_hub
+Usage:
+    # Download shards first:
+    for i in $(seq 0 9); do
+        hf download TurkuNLP/register_oscar \
+            $(printf "en/en_%05d.jsonl.gz" $i) \
+            --repo-type dataset --local-dir ./data
+    done
+    # Then run:
+    python prepare_data.py --data-dir ./data/en --output-dir ./prepared
+"""
+import json
+import gzip
+import re
+import random
+import glob
+import argparse
+from collections import Counter, defaultdict
+from pathlib import Path
+REGISTER_LABELS = {
+    "IN": "Informational",
+    "NA": "Narrative",
+    "OP": "Opinion",
+    "IP": "Persuasion",
+    "HI": "HowTo",
+    "ID": "Discussion",
+    "SP": "Spoken",
+    "LY": "Lyrical",
+}
+def clean_text(text: str, max_words: int = 500) -> str:
+    """Collapse whitespace and truncate to max_words."""
+    text = re.sub(r"\s+", " ", text).strip()
+    words = text.split()[:max_words]
+    return " ".join(words)
+def main():
+    parser = argparse.ArgumentParser(description="Prepare balanced FastText training data")
+    parser.add_argument("--data-dir", default="./data/en", help="Directory with .jsonl.gz shards")
+    parser.add_argument("--output-dir", default="./prepared", help="Output directory for train/test files")
+    parser.add_argument("--max-words", type=int, default=500, help="Max words per document")
+    parser.add_argument("--min-text-len", type=int, default=50, help="Min character length to keep")
+    parser.add_argument("--test-ratio", type=float, default=0.1, help="Fraction held out for test")
+    parser.add_argument("--seed", type=int, default=42)
+    args = parser.parse_args()
+    random.seed(args.seed)
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    # Collect all labeled docs grouped by primary label
+    by_label = defaultdict(list)
+    total = 0
+    skipped_nolabel = 0
+    skipped_short = 0
+    shard_files = sorted(glob.glob(f"{args.data_dir}/*.jsonl.gz"))
+    if not shard_files:
+        raise FileNotFoundError(f"No .jsonl.gz files found in {args.data_dir}")
+    print(f"Found {len(shard_files)} shard(s)")
+    for shard_file in shard_files:
+        print(f"  Processing {Path(shard_file).name}...")
+        with gzip.open(shard_file, "rt") as f:
+            for line in f:
+                d = json.loads(line)
+                labels = d.get("labels", [])
+                text = d.get("text", "")
+                if not labels:
+                    skipped_nolabel += 1
+                    continue
+                if len(text) < args.min_text_len:
+                    skipped_short += 1
+                    continue
+                cleaned = clean_text(text, args.max_words)
+                if not cleaned:
+                    continue
+                label_str = " ".join(f"__label__{l}" for l in labels)
+                ft_line = f"{label_str} {cleaned}\n"
+                primary = labels[0]
+                by_label[primary].append(ft_line)
+                total += 1
+    print(f"\nTotal labeled docs: {total}")
+    print(f"Skipped (no label): {skipped_nolabel}")
+    print(f"Skipped (too short): {skipped_short}")
+    # Raw distribution
+    print("\nRaw distribution:")
+    for label in sorted(by_label.keys()):
+        name = REGISTER_LABELS.get(label, label)
+        print(f"  {label} ({name}): {len(by_label[label])}")
+    # Balance: oversample minority to median, undersample majority to median
+    sizes = {k: len(v) for k, v in by_label.items()}
+    sorted_sizes = sorted(sizes.values())
+    median_size = sorted_sizes[len(sorted_sizes) // 2]
+    target = median_size
+    print(f"\nBalancing target (median): {target}")
+    train_lines = []
+    test_lines = []
+    for label, lines in by_label.items():
+        random.shuffle(lines)
+        n_test = max(len(lines) // 10, 50)
+        test_pool = lines[:n_test]
+        train_pool = lines[n_test:]
+        test_lines.extend(test_pool)
+        n_train = len(train_pool)
+        if n_train >= target:
+            sampled = random.sample(train_pool, target)
+            train_lines.extend(sampled)
+            print(f"  {label}: {n_train} -> {target} (undersampled)")
+        else:
+            train_lines.extend(train_pool)
+            n_needed = target - n_train
+            oversampled = random.choices(train_pool, k=n_needed)
+            train_lines.extend(oversampled)
+            print(f"  {label}: {n_train} -> {target} (oversampled +{n_needed})")
+    random.shuffle(train_lines)
+    random.shuffle(test_lines)
+    train_path = output_dir / "train.txt"
+    test_path = output_dir / "test.txt"
+    with open(train_path, "w") as f:
+        f.writelines(train_lines)
+    with open(test_path, "w") as f:
+        f.writelines(test_lines)
+    print(f"\nTrain: {len(train_lines)} -> {train_path}")
+    print(f"Test:  {len(test_lines)} -> {test_path}")
+    # Verify balance
+    c = Counter()
+    for line in train_lines:
+        for tok in line.split():
+            if tok.startswith("__label__"):
+                c[tok] += 1
+    print("\nFinal train label distribution:")
+    for l, cnt in c.most_common():
+        name = REGISTER_LABELS.get(l.replace("__label__", ""), l)
+        print(f"  {l} ({name}): {cnt}")
+if __name__ == "__main__":
+    main()

scripts/train.py ADDED Viewed

	@@ -0,0 +1,91 @@

+"""
+Train a FastText text register classifier.
+Usage:
+    python train.py --train ./prepared/train.txt --test ./prepared/test.txt --output ./model
+This produces:
+    - model/register_fasttext.bin       (full model)
+    - model/register_fasttext_q.bin     (quantized, ~7x smaller)
+"""
+import fasttext
+import time
+import os
+import argparse
+from pathlib import Path
+def main():
+    parser = argparse.ArgumentParser(description="Train FastText register classifier")
+    parser.add_argument("--train", default="./prepared/train.txt", help="Training data file")
+    parser.add_argument("--test", default="./prepared/test.txt", help="Test data file")
+    parser.add_argument("--output", default="./model", help="Output directory")
+    parser.add_argument("--lr", type=float, default=0.5, help="Learning rate")
+    parser.add_argument("--epoch", type=int, default=25, help="Number of epochs")
+    parser.add_argument("--dim", type=int, default=100, help="Embedding dimension")
+    parser.add_argument("--wordNgrams", type=int, default=2, help="Max n-gram length")
+    parser.add_argument("--bucket", type=int, default=2000000, help="Hash bucket size")
+    parser.add_argument("--thread", type=int, default=8, help="Number of threads")
+    parser.add_argument("--min-count", type=int, default=5, help="Min word count")
+    args = parser.parse_args()
+    output_dir = Path(args.output)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    print("=== Training FastText register classifier ===")
+    start = time.time()
+    model = fasttext.train_supervised(
+        input=args.train,
+        lr=args.lr,
+        epoch=args.epoch,
+        wordNgrams=args.wordNgrams,
+        dim=args.dim,
+        loss="ova",  # one-vs-all for multi-label
+        minCount=args.min_count,
+        bucket=args.bucket,
+        thread=args.thread,
+        verbose=2,
+    )
+    train_time = time.time() - start
+    print(f"Training time: {train_time:.1f}s")
+    # Save full model
+    full_path = output_dir / "register_fasttext.bin"
+    model.save_model(str(full_path))
+    size_mb = os.path.getsize(full_path) / 1024 / 1024
+    print(f"\nFull model: {full_path} ({size_mb:.1f} MB)")
+    # Evaluate
+    print("\n=== Evaluation ===")
+    for k in [1, 2]:
+        r = model.test(args.test, k=k)
+        print(f"  k={k}: Precision={r[1]:.4f}  Recall={r[2]:.4f}  (n={r[0]})")
+    # Quantize
+    print("\nQuantizing...")
+    model.quantize(input=args.train, retrain=True)
+    q_path = output_dir / "register_fasttext_q.bin"
+    model.save_model(str(q_path))
+    size_q = os.path.getsize(q_path) / 1024 / 1024
+    print(f"Quantized model: {q_path} ({size_q:.1f} MB)")
+    r = model.test(args.test, k=1)
+    print(f"  Quantized k=1: Precision={r[1]:.4f}  Recall={r[2]:.4f}")
+    # Speed test
+    print("\n=== Speed Test ===")
+    test_text = "The algorithm processes data in O(n log n) time complexity."
+    start = time.time()
+    for _ in range(100000):
+        model.predict(test_text)
+    elapsed = time.time() - start
+    print(f"{100000 / elapsed:.0f} predictions/sec")
+    print("\nDone!")
+if __name__ == "__main__":
+    main()