feat(training): pipeline minimal train/test + artefacts HF

- Entraîne un classifieur à partir de FlowRank/labeled_emails
- Ajoute scripts d’évaluation et de préparation des artefacts sous model/
- Documente l’usage (train/test/publish) dans README

Co-authored-by: Hilarion Lefuneste <hilarionlefuneste@tutamail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>

Files changed (15) hide show

.gitignore +15 -0
.python-version +1 -0
README.md +111 -0
main.py +7 -0
model/config.json +52 -0
model/model.safetensors +3 -0
model/tokenizer.json +0 -0
model/tokenizer_config.json +15 -0
model/training_args.bin +3 -0
pyproject.toml +21 -0
src/mailsort/__init__.py +2 -0
src/mailsort/eval.py +112 -0
src/mailsort/prepare_model.py +61 -0
src/mailsort/train.py +171 -0
uv.lock +0 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,15 @@

+.venv/
+__pycache__/
+*.pyc
+.DS_Store
+# training outputs (local only)
+outputs/
+outputs_smoke/
+# HF caches
+.cache/
+**/.cache/
+# build metadata
+*.egg-info/

.python-version ADDED Viewed

	@@ -0,0 +1 @@


1	+ 3.11

README.md ADDED Viewed

	@@ -0,0 +1,111 @@

+---
+library_name: transformers
+pipeline_tag: text-classification
+tags:
+  - email
+  - text-classification
+language:
+  - en
+---
+## mailSort
+Repo minimal (Python) pour **entraîner, évaluer et publier** un modèle de classification multi-classes d’e-mails à partir du dataset Hugging Face [`FlowRank/labeled_emails`](https://huggingface.co/datasets/FlowRank/labeled_emails).
+Le script principal est `mailsort.train` (Transformers `Trainer`).
+### Prérequis
+- Python géré par `uv` (ce repo est prévu pour être lancé avec `uv run`)
+- (Optionnel) un GPU CUDA si tu veux accélérer l’entraînement
+- (Optionnel) un token HF si tu veux publier sur le Hub
+### Installation
+`uv` installe/synchronise automatiquement les dépendances la première fois.
+```bash
+uv sync
+```
+### Entraîner + évaluer (train + test)
+Par défaut, le script charge le dataset **depuis le Hub** et utilise ses splits `train` et `test`.
+L’évaluation se fait à chaque epoch, puis une évaluation finale est exécutée à la fin.
+Les artefacts (modèle, tokenizer) sont sauvegardés dans `outputs/` (ou le dossier passé via `--output-dir`).
+```bash
+uv run python -m mailsort.train \
+  --dataset-id FlowRank/labeled_emails \
+  --model-name distilbert-base-uncased \
+  --hub-model-id FlowRank/mailSort \
+  --num-train-epochs 2
+```
+### Tester / évaluer uniquement
+Le script n’a pas (encore) de mode “eval-only”.
+Le **minimum** pour faire uniquement une passe rapide est de mettre `--num-train-epochs 0` (ce qui évite l’entraînement) et de garder la phase `evaluate`.
+```bash
+uv run python -m mailsort.train --num-train-epochs 0
+```
+### Évaluer sur le split `test` du dataset (recommandé)
+Après ton entraînement dans `outputs/`, tu peux évaluer proprement sur le **split `test`** de `FlowRank/labeled_emails` :
+```bash
+uv run python -m mailsort.eval --model outputs --dataset-id FlowRank/labeled_emails --split test
+```
+(Optionnel) Pour un test rapide :
+```bash
+uv run python -m mailsort.eval --model outputs --split test --max-samples 200
+```
+### Publier sur le Hub (FlowRank/mailSort)
+Le push se fait automatiquement si la variable d’environnement `HF_TOKEN` (ou `HUGGINGFACE_HUB_TOKEN`) est définie.
+```bash
+export HF_TOKEN="..."
+uv run python -m mailsort.train --hub-model-id FlowRank/mailSort
+```
+### Publier via Git (README à la racine + artefacts dans `model/`)
+Pour avoir un repo Hugging Face “complet” (doc + poids) tout en gardant une structure propre, on met :
+- `README.md` à la racine (documentation + model card)
+- les artefacts (config, poids, tokenizer) dans `model/`
+Hugging Face pourra charger le modèle via `subfolder="model"`.
+1) Préparer le dossier `model/` à partir de `outputs/` :
+```bash
+uv run python -m mailsort.prepare_model --outputs-dir outputs --model-dir model
+```
+2) Commit + push vers le repo Hugging Face `FlowRank/mailSort` :
+```bash
+git add README.md model
+git commit -m "Add model artifacts under model/ + docs"
+git push
+```
+### Inférence (utiliser le modèle publié)
+```python
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+tok = AutoTokenizer.from_pretrained("FlowRank/mailSort", subfolder="model")
+model = AutoModelForSequenceClassification.from_pretrained("FlowRank/mailSort", subfolder="model")
+clf = pipeline("text-classification", model=model, tokenizer=tok, truncation=True)
+text = "Subject: Insurance claim\n\nBody: Hello, I need to update my policy..."
+print(clf(text))
+```

main.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from __future__ import annotations
+from mailsort.train import main
+if __name__ == "__main__":
+    raise SystemExit(main())

model/config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertForSequenceClassification"
+  ],
+  "attention_dropout": 0.1,
+  "bos_token_id": null,
+  "dim": 768,
+  "dropout": 0.1,
+  "dtype": "float32",
+  "eos_token_id": null,
+  "hidden_dim": 3072,
+  "id2label": {
+    "0": "family",
+    "1": "finance",
+    "2": "games",
+    "3": "human resources",
+    "4": "medical",
+    "5": "pets",
+    "6": "school",
+    "7": "software engineering",
+    "8": "sport",
+    "9": "work/airbus"
+  },
+  "initializer_range": 0.02,
+  "label2id": {
+    "family": 0,
+    "finance": 1,
+    "games": 2,
+    "human resources": 3,
+    "medical": 4,
+    "pets": 5,
+    "school": 6,
+    "software engineering": 7,
+    "sport": 8,
+    "work/airbus": 9
+  },
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "problem_type": "single_label_classification",
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "tie_word_embeddings": true,
+  "transformers_version": "5.8.0",
+  "use_cache": false,
+  "vocab_size": 30522
+}

model/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e88532e8b0d0583f7a96aa686f19f0ab2256ee741f6b55773cbb8f36f520c276
+size 267857176

model/tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

model/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "backend": "tokenizers",
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "is_local": false,
+  "local_files_only": false,
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

model/training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e5f07cdfe1546cecc7b90ed8a2b923d1de3ed2925d0c1de004f011b09e643764
+size 5265

pyproject.toml ADDED Viewed

	@@ -0,0 +1,21 @@

+[project]
+name = "mailsort"
+version = "0.1.0"
+description = "Add your description here"
+requires-python = ">=3.11"
+dependencies = [
+    "accelerate>=1.13.0",
+    "datasets>=4.8.5",
+    "huggingface-hub>=1.13.0",
+    "safetensors>=0.7.0",
+    "torch>=2.11.0",
+    "transformers>=5.8.0",
+]
+[project.scripts]
+mailsort-train = "mailsort.train:main"
+mailsort-eval = "mailsort.eval:main"
+mailsort-prepare-model = "mailsort.prepare_model:main"
+[tool.uv]
+package = true

src/mailsort/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ __all__ = []
2	+

src/mailsort/eval.py ADDED Viewed

	@@ -0,0 +1,112 @@

+from __future__ import annotations
+import argparse
+from collections import Counter, defaultdict
+from dataclasses import dataclass
+from datasets import load_dataset
+from transformers import AutoModelForSequenceClassification, AutoTokenizer, pipeline
+@dataclass(frozen=True)
+class Config:
+    dataset_id: str
+    model_id_or_path: str
+    subfolder: str | None
+    split: str
+    max_samples: int | None
+def _build_text(subject: str, body: str) -> str:
+    subject = "" if subject is None else str(subject)
+    body = "" if body is None else str(body)
+    if subject and body:
+        return f"Subject: {subject}\n\nBody: {body}"
+    return subject or body
+def _parse_args() -> Config:
+    p = argparse.ArgumentParser(description="Evaluate a model (local or Hub) against a HF dataset split.")
+    p.add_argument("--dataset-id", default="FlowRank/labeled_emails")
+    p.add_argument("--model", default="outputs", help="Local path OR Hugging Face repo id (e.g. FlowRank/mailSort).")
+    p.add_argument("--subfolder", default=None, help="Optional subfolder (e.g. model).")
+    p.add_argument("--split", default="test", help="Which split to evaluate (e.g. test).")
+    p.add_argument("--max-samples", type=int, default=None, help="Limit evaluation to N samples.")
+    a = p.parse_args()
+    return Config(
+        dataset_id=a.dataset_id,
+        model_id_or_path=a.model,
+        subfolder=a.subfolder,
+        split=a.split,
+        max_samples=a.max_samples,
+    )
+def main() -> int:
+    cfg = _parse_args()
+    ds = load_dataset(cfg.dataset_id)
+    if cfg.split not in ds:
+        raise SystemExit(f"Split '{cfg.split}' not found. Available: {list(ds.keys())}")
+    rows = ds[cfg.split]
+    if cfg.max_samples is not None:
+        rows = rows.select(range(min(cfg.max_samples, len(rows))))
+    kwargs = {}
+    if cfg.subfolder:
+        kwargs["subfolder"] = cfg.subfolder
+    tokenizer = AutoTokenizer.from_pretrained(cfg.model_id_or_path, **kwargs)
+    model = AutoModelForSequenceClassification.from_pretrained(cfg.model_id_or_path, **kwargs)
+    clf = pipeline(
+        "text-classification",
+        model=model,
+        tokenizer=tokenizer,
+        truncation=True,
+    )
+    correct = 0
+    total = 0
+    per_label = Counter()
+    per_label_ok = Counter()
+    confusion = defaultdict(Counter)  # true -> pred -> count
+    for ex in rows:
+        text = _build_text(ex.get("subject"), ex.get("body"))
+        true_label = str(ex["label"])
+        pred = clf(text, top_k=1)[0]["label"]
+        total += 1
+        per_label[true_label] += 1
+        confusion[true_label][pred] += 1
+        if pred == true_label:
+            correct += 1
+            per_label_ok[true_label] += 1
+    acc = correct / total if total else 0.0
+    print(f"dataset={cfg.dataset_id} split={cfg.split} samples={total}")
+    print(f"accuracy={acc:.4f} ({correct}/{total})")
+    print("\nper-label accuracy:")
+    for label in sorted(per_label.keys()):
+        denom = per_label[label]
+        num = per_label_ok[label]
+        print(f"- {label}: {num}/{denom} = {num/denom:.4f}")
+    # print top confusions per label (lightweight)
+    print("\ncommon confusions (top-2 per true label):")
+    for true_label in sorted(confusion.keys()):
+        most = confusion[true_label].most_common(3)
+        # skip perfect-only rows
+        if len(most) == 1 and most[0][0] == true_label:
+            continue
+        top = ", ".join([f"{pred}:{cnt}" for pred, cnt in most])
+        print(f"- {true_label}: {top}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

src/mailsort/prepare_model.py ADDED Viewed

	@@ -0,0 +1,61 @@

+from __future__ import annotations
+import argparse
+import shutil
+from dataclasses import dataclass
+from pathlib import Path
+@dataclass(frozen=True)
+class Config:
+    outputs_dir: Path
+    model_dir: Path
+def _parse_args() -> Config:
+    p = argparse.ArgumentParser(description="Prepare model/ folder from an outputs/ training directory.")
+    p.add_argument("--outputs-dir", default="outputs", help="Training output directory (from mailsort.train).")
+    p.add_argument("--model-dir", default="model", help="Target folder to commit/push to Hugging Face.")
+    a = p.parse_args()
+    return Config(outputs_dir=Path(a.outputs_dir), model_dir=Path(a.model_dir))
+def main() -> int:
+    cfg = _parse_args()
+    if not cfg.outputs_dir.exists():
+        raise SystemExit(f"outputs-dir not found: {cfg.outputs_dir}")
+    cfg.model_dir.mkdir(parents=True, exist_ok=True)
+    # clean target (keep it explicit and predictable)
+    for p in cfg.model_dir.iterdir():
+        if p.is_dir():
+            shutil.rmtree(p)
+        else:
+            p.unlink()
+    # Copy only final artifacts (root files), ignore trainer checkpoints.
+    for p in cfg.outputs_dir.iterdir():
+        if p.is_dir():
+            # ignore checkpoint-* dirs
+            continue
+        shutil.copy2(p, cfg.model_dir / p.name)
+    # sanity: expected minimum files
+    expected_any = [
+        "config.json",
+        "tokenizer.json",
+        "tokenizer_config.json",
+    ]
+    missing = [n for n in expected_any if not (cfg.model_dir / n).exists()]
+    if missing:
+        raise SystemExit(f"Missing expected files in {cfg.model_dir}: {missing}")
+    print(f"Prepared {cfg.model_dir} from {cfg.outputs_dir}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

src/mailsort/train.py ADDED Viewed

	@@ -0,0 +1,171 @@

+from __future__ import annotations
+import argparse
+import os
+from dataclasses import dataclass
+import numpy as np
+from datasets import DatasetDict, load_dataset
+from transformers import (
+    AutoModelForSequenceClassification,
+    AutoTokenizer,
+    DataCollatorWithPadding,
+    Trainer,
+    TrainingArguments,
+)
+@dataclass(frozen=True)
+class Config:
+    dataset_id: str
+    model_name: str
+    hub_model_id: str
+    output_dir: str
+    max_length: int
+    num_train_epochs: float
+    per_device_train_batch_size: int
+    per_device_eval_batch_size: int
+    learning_rate: float
+    weight_decay: float
+    seed: int
+def _build_text(subject: str, body: str) -> str:
+    subject = "" if subject is None else str(subject)
+    body = "" if body is None else str(body)
+    if subject and body:
+        return f"Subject: {subject}\n\nBody: {body}"
+    return subject or body
+def _parse_args() -> Config:
+    p = argparse.ArgumentParser(description="Train & push email classifier to Hugging Face Hub.")
+    p.add_argument("--dataset-id", default="FlowRank/labeled_emails")
+    p.add_argument("--model-name", default="distilbert-base-uncased")
+    p.add_argument("--hub-model-id", default="FlowRank/mailSort")
+    p.add_argument("--output-dir", default="outputs")
+    p.add_argument("--max-length", type=int, default=256)
+    p.add_argument("--num-train-epochs", type=float, default=2)
+    p.add_argument("--per-device-train-batch-size", type=int, default=16)
+    p.add_argument("--per-device-eval-batch-size", type=int, default=32)
+    p.add_argument("--learning-rate", type=float, default=2e-5)
+    p.add_argument("--weight-decay", type=float, default=0.01)
+    p.add_argument("--seed", type=int, default=42)
+    a = p.parse_args()
+    return Config(
+        dataset_id=a.dataset_id,
+        model_name=a.model_name,
+        hub_model_id=a.hub_model_id,
+        output_dir=a.output_dir,
+        max_length=a.max_length,
+        num_train_epochs=a.num_train_epochs,
+        per_device_train_batch_size=a.per_device_train_batch_size,
+        per_device_eval_batch_size=a.per_device_eval_batch_size,
+        learning_rate=a.learning_rate,
+        weight_decay=a.weight_decay,
+        seed=a.seed,
+    )
+def _load_ds(dataset_id: str, seed: int) -> DatasetDict:
+    ds = load_dataset(dataset_id)
+    if "train" in ds and "test" in ds:
+        return ds  # already split
+    # fallback: split if only a single split exists
+    if "train" in ds and "test" not in ds:
+        return ds["train"].train_test_split(test_size=0.1, seed=seed)
+    # if weird structure, just return as-is and let Trainer fail loudly
+    return ds
+def _prepare(ds: DatasetDict, tokenizer: AutoTokenizer, label2id: dict[str, int], max_length: int) -> DatasetDict:
+    def preprocess(ex):
+        text = _build_text(ex.get("subject"), ex.get("body"))
+        out = tokenizer(text, truncation=True, max_length=max_length)
+        out["labels"] = label2id[str(ex["label"])]
+        return out
+    cols_to_remove = [c for c in ["subject", "body", "label"] if c in ds["train"].column_names]
+    return ds.map(preprocess, remove_columns=cols_to_remove)
+def _compute_metrics(eval_pred):
+    logits, labels = eval_pred
+    preds = np.argmax(logits, axis=-1)
+    acc = (preds == labels).astype(np.float32).mean().item()
+    return {"accuracy": acc}
+def main() -> int:
+    cfg = _parse_args()
+    ds = _load_ds(cfg.dataset_id, seed=cfg.seed)
+    train_split = "train" if "train" in ds else list(ds.keys())[0]
+    test_split = "test" if "test" in ds else ("validation" if "validation" in ds else None)
+    if test_split is None:
+        raise SystemExit(f"Dataset must have a test/validation split. Found: {list(ds.keys())}")
+    tokenizer = AutoTokenizer.from_pretrained(cfg.model_name, use_fast=True)
+    labels = sorted({str(x) for x in ds[train_split]["label"]})
+    label2id = {l: i for i, l in enumerate(labels)}
+    id2label = {i: l for l, i in label2id.items()}
+    encoded = _prepare(ds, tokenizer, label2id=label2id, max_length=cfg.max_length)
+    data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
+    model = AutoModelForSequenceClassification.from_pretrained(
+        cfg.model_name,
+        num_labels=len(labels),
+        label2id=label2id,
+        id2label=id2label,
+    )
+    push_to_hub = bool(os.getenv("HF_TOKEN")) or bool(os.getenv("HUGGINGFACE_HUB_TOKEN"))
+    args = TrainingArguments(
+        output_dir=cfg.output_dir,
+        num_train_epochs=cfg.num_train_epochs,
+        learning_rate=cfg.learning_rate,
+        per_device_train_batch_size=cfg.per_device_train_batch_size,
+        per_device_eval_batch_size=cfg.per_device_eval_batch_size,
+        weight_decay=cfg.weight_decay,
+        eval_strategy="epoch",
+        save_strategy="epoch",
+        logging_strategy="steps",
+        logging_steps=50,
+        load_best_model_at_end=True,
+        metric_for_best_model="accuracy",
+        seed=cfg.seed,
+        report_to="none",
+        push_to_hub=push_to_hub,
+        hub_model_id=cfg.hub_model_id if push_to_hub else None,
+        hub_strategy="end" if push_to_hub else "every_save",
+    )
+    trainer = Trainer(
+        model=model,
+        args=args,
+        train_dataset=encoded[train_split],
+        eval_dataset=encoded[test_split],
+        processing_class=tokenizer,
+        data_collator=data_collator,
+        compute_metrics=_compute_metrics,
+    )
+    trainer.train()
+    trainer.evaluate()
+    trainer.save_model(cfg.output_dir)
+    tokenizer.save_pretrained(cfg.output_dir)
+    if args.push_to_hub:
+        trainer.push_to_hub()
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff