Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +6 -0
README.md +287 -0
REPORT.md +140 -0
baselines/embedding-lightgbm/embedding-lightgbm.joblib +3 -0
baselines/embedding-lightgbm/test_predictions.csv +0 -0
baselines/embedding-lightgbm/validation_predictions.csv +0 -0
baselines/embedding-logistic/embedding-logistic.joblib +3 -0
baselines/embedding-logistic/test_predictions.csv +0 -0
baselines/embedding-logistic/validation_predictions.csv +0 -0
baselines/embedding-svm/embedding-svm.joblib +3 -0
baselines/embedding-svm/test_predictions.csv +0 -0
baselines/embedding-svm/validation_predictions.csv +0 -0
baselines/logistic/logistic_tfidf.joblib +3 -0
baselines/logistic/test_predictions.csv +0 -0
baselines/logistic/validation_predictions.csv +0 -0
baselines/xgboost/test_predictions.csv +0 -0
baselines/xgboost/validation_predictions.csv +0 -0
baselines/xgboost/xgboost_tfidf.joblib +3 -0
report.json +704 -0
transformer/checkpoint-1220/config.json +39 -0
transformer/checkpoint-1220/model.safetensors +3 -0
transformer/checkpoint-1220/optimizer.pt +3 -0
transformer/checkpoint-1220/rng_state.pth +3 -0
transformer/checkpoint-1220/scaler.pt +3 -0
transformer/checkpoint-1220/scheduler.pt +3 -0
transformer/checkpoint-1220/tokenizer.json +3 -0
transformer/checkpoint-1220/tokenizer_config.json +15 -0
transformer/checkpoint-1220/trainer_state.json +431 -0
transformer/checkpoint-1220/training_args.bin +3 -0
transformer/checkpoint-1525/config.json +39 -0
transformer/checkpoint-1525/model.safetensors +3 -0
transformer/checkpoint-1525/optimizer.pt +3 -0
transformer/checkpoint-1525/rng_state.pth +3 -0
transformer/checkpoint-1525/scaler.pt +3 -0
transformer/checkpoint-1525/scheduler.pt +3 -0
transformer/checkpoint-1525/tokenizer.json +3 -0
transformer/checkpoint-1525/tokenizer_config.json +15 -0
transformer/checkpoint-1525/trainer_state.json +535 -0
transformer/checkpoint-1525/training_args.bin +3 -0
transformer/checkpoint-305/config.json +39 -0
transformer/checkpoint-305/model.safetensors +3 -0
transformer/checkpoint-305/optimizer.pt +3 -0
transformer/checkpoint-305/rng_state.pth +3 -0
transformer/checkpoint-305/scaler.pt +3 -0
transformer/checkpoint-305/scheduler.pt +3 -0
transformer/checkpoint-305/tokenizer.json +3 -0
transformer/checkpoint-305/tokenizer_config.json +15 -0
transformer/checkpoint-305/trainer_state.json +140 -0
transformer/checkpoint-305/training_args.bin +3 -0
transformer/checkpoint-610/config.json +39 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+transformer/checkpoint-1220/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+transformer/checkpoint-1525/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+transformer/checkpoint-305/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+transformer/checkpoint-610/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+transformer/checkpoint-915/tokenizer.json filter=lfs diff=lfs merge=lfs -text
+transformer/tokenizer.json filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,287 @@

+---
+library_name: transformers
+pipeline_tag: text-classification
+base_model: FacebookAI/xlm-roberta-base
+tags:
+- text-classification
+- binary-classification
+- amis
+- agriculture
+language: multilingual
+---
+# AMIS Commodity Classifier
+This model repository contains artifacts from an AMIS commodity relevance classifier training run.
+It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report.
+- Dataset: `faodl/amis-agri-utilization`
+- Dataset subset: ``
+- Text column: `chunk_text`
+- Label column: `label`
+- Transformer: `FacebookAI/xlm-roberta-base`
+- Generated at: `2026-05-25T19:23:29.605062+00:00`
+## Dataset Summary
+| Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| train | 4877 | 4347 | 530 | 2513 | 696.6 |
+| validation | 978 | 899 | 79 | 538 | 690.6 |
+| test | 1016 | 904 | 112 | 539 | 690.7 |
+## Threshold Comparison on Test Split
+| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| logistic_tfidf | 0.500 | 0.926 | 0.691 | 0.598 | 0.641 | 0.899 | 0.726 |
+| logistic_tfidf | 0.608 | 0.930 | 0.902 | 0.411 | 0.564 | 0.899 | 0.726 |
+| xgboost_tfidf | 0.500 | 0.924 | 1.000 | 0.312 | 0.476 | 0.892 | 0.692 |
+| xgboost_tfidf | 0.177 | 0.918 | 0.663 | 0.527 | 0.587 | 0.892 | 0.692 |
+| embedding-logistic_sentence_embeddings | 0.500 | 0.899 | 0.524 | 0.866 | 0.653 | 0.952 | 0.759 |
+| embedding-logistic_sentence_embeddings | 0.616 | 0.929 | 0.632 | 0.857 | 0.727 | 0.952 | 0.759 |
+| embedding-svm_sentence_embeddings | 0.500 | 0.941 | 0.771 | 0.661 | 0.712 | 0.952 | 0.743 |
+| embedding-svm_sentence_embeddings | 0.276 | 0.935 | 0.667 | 0.821 | 0.736 | 0.952 | 0.743 |
+| embedding-lightgbm_sentence_embeddings | 0.500 | 0.946 | 0.788 | 0.696 | 0.739 | 0.959 | 0.801 |
+| embedding-lightgbm_sentence_embeddings | 0.052 | 0.933 | 0.657 | 0.821 | 0.730 | 0.959 | 0.801 |
+| transformer | 0.500 | 0.950 | 0.748 | 0.821 | 0.783 | 0.951 | 0.785 |
+| transformer | 0.616 | 0.950 | 0.748 | 0.821 | 0.783 | 0.951 | 0.785 |
+## Confusion Matrices on Test Split
+Rows are true labels and columns are predicted labels.
+### logistic_tfidf at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 874 | 30 |
+| RELEVANT | 45 | 67 |
+### logistic_tfidf at threshold 0.608
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 899 | 5 |
+| RELEVANT | 66 | 46 |
+### xgboost_tfidf at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 904 | 0 |
+| RELEVANT | 77 | 35 |
+### xgboost_tfidf at threshold 0.177
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 874 | 30 |
+| RELEVANT | 53 | 59 |
+### embedding-logistic_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 816 | 88 |
+| RELEVANT | 15 | 97 |
+### embedding-logistic_sentence_embeddings at threshold 0.616
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 848 | 56 |
+| RELEVANT | 16 | 96 |
+### embedding-svm_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 882 | 22 |
+| RELEVANT | 38 | 74 |
+### embedding-svm_sentence_embeddings at threshold 0.276
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 858 | 46 |
+| RELEVANT | 20 | 92 |
+### embedding-lightgbm_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 883 | 21 |
+| RELEVANT | 34 | 78 |
+### embedding-lightgbm_sentence_embeddings at threshold 0.052
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 856 | 48 |
+| RELEVANT | 20 | 92 |
+### transformer at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 873 | 31 |
+| RELEVANT | 20 | 92 |
+### transformer at threshold 0.616
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 873 | 31 |
+| RELEVANT | 20 | 92 |
+## Validation-Tuned Thresholds
+- `logistic_tfidf`: threshold `0.608` (validation F1 `0.578`); test F1 change vs 0.5: `-0.077`.
+- `xgboost_tfidf`: threshold `0.177` (validation F1 `0.581`); test F1 change vs 0.5: `+0.111`.
+- `embedding-logistic_sentence_embeddings`: threshold `0.616` (validation F1 `0.728`); test F1 change vs 0.5: `+0.074`.
+- `embedding-svm_sentence_embeddings`: threshold `0.276` (validation F1 `0.731`); test F1 change vs 0.5: `+0.024`.
+- `embedding-lightgbm_sentence_embeddings`: threshold `0.052` (validation F1 `0.739`); test F1 change vs 0.5: `-0.009`.
+- `transformer`: threshold `0.616` (validation F1 `0.807`); test F1 change vs 0.5: `+0.000`.
+## Artifacts
+- `logistic_tfidf`: `/content/agri-utilization-classifier/baselines/logistic`
+- `xgboost_tfidf`: `/content/agri-utilization-classifier/baselines/xgboost`
+- `embedding-logistic_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-logistic`
+- `embedding-svm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-svm`
+- `embedding-lightgbm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-lightgbm`
+- `transformer`: `/content/agri-utilization-classifier/transformer`
+## Inference
+Install the runtime dependencies:
+```bash
+pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost sentence-transformers lightgbm
+```
+### Transformer
+```python
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+MODEL_ID = "faodl/agri-utilization-classifier"
+texts = [
+    "Rice export prices increased after new procurement rules were announced.",
+    "The finance ministry released its monthly fuel tax bulletin.",
+]
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer")
+model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer")
+threshold = float(getattr(model.config, "threshold", 0.5))
+encoded = tokenizer(
+    texts,
+    truncation=True,
+    padding=True,
+    max_length=256,
+    return_tensors="pt",
+)
+with torch.no_grad():
+    logits = model(**encoded).logits
+    probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist()
+for text, probability in zip(texts, probabilities):
+    label = model.config.id2label[int(probability >= threshold)]
+    print({"text": text, "probability_positive": probability, "label": label})
+```
+### TF-IDF Baselines
+Available baseline names in this run: "logistic", "xgboost".
+```python
+import json
+import joblib
+from huggingface_hub import hf_hub_download
+MODEL_ID = "faodl/agri-utilization-classifier"
+BASELINE = "logistic"
+texts = [
+    "Maize production forecasts were revised after delayed rains.",
+    "The central bank published new exchange rate statistics.",
+]
+model_path = hf_hub_download(
+    repo_id=MODEL_ID,
+    repo_type="model",
+    filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib",
+)
+report_path = hf_hub_download(
+    repo_id=MODEL_ID,
+    repo_type="model",
+    filename="report.json",
+)
+pipeline = joblib.load(model_path)
+with open(report_path, encoding="utf-8") as handle:
+    report = json.load(handle)
+threshold = next(
+    result["validation_best_threshold"]["threshold"]
+    for result in report["results"]
+    if result["model_type"] == f"{BASELINE}_tfidf"
+)
+probabilities = pipeline.predict_proba(texts)[:, 1]
+for text, probability in zip(texts, probabilities):
+    label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
+    print({"text": text, "probability_positive": float(probability), "label": label})
+```
+### Sentence-Embedding Baselines
+Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm".
+```python
+import joblib
+from huggingface_hub import hf_hub_download
+from sentence_transformers import SentenceTransformer
+MODEL_ID = "faodl/agri-utilization-classifier"
+BASELINE = "embedding-logistic"
+texts = [
+    "Wheat export inspections rose as demand from importers increased.",
+    "The sports ministry announced a new stadium renovation plan.",
+]
+model_path = hf_hub_download(
+    repo_id=MODEL_ID,
+    repo_type="model",
+    filename=f"baselines/{BASELINE}/{BASELINE}.joblib",
+)
+artifact = joblib.load(model_path)
+embedding_model = SentenceTransformer(artifact["embedding_model_name"])
+embeddings = embedding_model.encode(
+    texts,
+    batch_size=artifact.get("embedding_batch_size", 64),
+    convert_to_numpy=True,
+    normalize_embeddings=artifact.get("normalize_embeddings", True),
+)
+probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1]
+threshold = artifact["validation_best_threshold"]["threshold"]
+for text, probability in zip(texts, probabilities):
+    label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
+    print({"text": text, "probability_positive": float(probability), "label": label})
+```
+## Files
+- `REPORT.md`: Markdown report for this training run.
+- `report.json`: Machine-readable report containing metrics and thresholds.
+- `transformer/`: Fine-tuned Transformer artifacts, when Transformer training is enabled.
+- `baselines/`: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled.
+- `*/validation_predictions.csv` and `*/test_predictions.csv`: Split-level predictions.

REPORT.md ADDED Viewed

	@@ -0,0 +1,140 @@

+# AMIS Commodity Classifier Training Report
+- Dataset: `faodl/amis-agri-utilization`
+- Dataset subset: ``
+- Text column: `chunk_text`
+- Label column: `label`
+- Transformer: `FacebookAI/xlm-roberta-base`
+- Generated at: `2026-05-25T19:23:29.605062+00:00`
+## Dataset Summary
+| Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length |
+| --- | ---: | ---: | ---: | ---: | ---: |
+| train | 4877 | 4347 | 530 | 2513 | 696.6 |
+| validation | 978 | 899 | 79 | 538 | 690.6 |
+| test | 1016 | 904 | 112 | 539 | 690.7 |
+## Threshold Comparison on Test Split
+| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
+| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
+| logistic_tfidf | 0.500 | 0.926 | 0.691 | 0.598 | 0.641 | 0.899 | 0.726 |
+| logistic_tfidf | 0.608 | 0.930 | 0.902 | 0.411 | 0.564 | 0.899 | 0.726 |
+| xgboost_tfidf | 0.500 | 0.924 | 1.000 | 0.312 | 0.476 | 0.892 | 0.692 |
+| xgboost_tfidf | 0.177 | 0.918 | 0.663 | 0.527 | 0.587 | 0.892 | 0.692 |
+| embedding-logistic_sentence_embeddings | 0.500 | 0.899 | 0.524 | 0.866 | 0.653 | 0.952 | 0.759 |
+| embedding-logistic_sentence_embeddings | 0.616 | 0.929 | 0.632 | 0.857 | 0.727 | 0.952 | 0.759 |
+| embedding-svm_sentence_embeddings | 0.500 | 0.941 | 0.771 | 0.661 | 0.712 | 0.952 | 0.743 |
+| embedding-svm_sentence_embeddings | 0.276 | 0.935 | 0.667 | 0.821 | 0.736 | 0.952 | 0.743 |
+| embedding-lightgbm_sentence_embeddings | 0.500 | 0.946 | 0.788 | 0.696 | 0.739 | 0.959 | 0.801 |
+| embedding-lightgbm_sentence_embeddings | 0.052 | 0.933 | 0.657 | 0.821 | 0.730 | 0.959 | 0.801 |
+| transformer | 0.500 | 0.950 | 0.748 | 0.821 | 0.783 | 0.951 | 0.785 |
+| transformer | 0.616 | 0.950 | 0.748 | 0.821 | 0.783 | 0.951 | 0.785 |
+## Confusion Matrices on Test Split
+Rows are true labels and columns are predicted labels.
+### logistic_tfidf at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 874 | 30 |
+| RELEVANT | 45 | 67 |
+### logistic_tfidf at threshold 0.608
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 899 | 5 |
+| RELEVANT | 66 | 46 |
+### xgboost_tfidf at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 904 | 0 |
+| RELEVANT | 77 | 35 |
+### xgboost_tfidf at threshold 0.177
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 874 | 30 |
+| RELEVANT | 53 | 59 |
+### embedding-logistic_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 816 | 88 |
+| RELEVANT | 15 | 97 |
+### embedding-logistic_sentence_embeddings at threshold 0.616
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 848 | 56 |
+| RELEVANT | 16 | 96 |
+### embedding-svm_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 882 | 22 |
+| RELEVANT | 38 | 74 |
+### embedding-svm_sentence_embeddings at threshold 0.276
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 858 | 46 |
+| RELEVANT | 20 | 92 |
+### embedding-lightgbm_sentence_embeddings at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 883 | 21 |
+| RELEVANT | 34 | 78 |
+### embedding-lightgbm_sentence_embeddings at threshold 0.052
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 856 | 48 |
+| RELEVANT | 20 | 92 |
+### transformer at threshold 0.500
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 873 | 31 |
+| RELEVANT | 20 | 92 |
+### transformer at threshold 0.616
+| True / Predicted | NOT_RELEVANT | RELEVANT |
+| --- | ---: | ---: |
+| NOT_RELEVANT | 873 | 31 |
+| RELEVANT | 20 | 92 |
+## Validation-Tuned Thresholds
+- `logistic_tfidf`: threshold `0.608` (validation F1 `0.578`); test F1 change vs 0.5: `-0.077`.
+- `xgboost_tfidf`: threshold `0.177` (validation F1 `0.581`); test F1 change vs 0.5: `+0.111`.
+- `embedding-logistic_sentence_embeddings`: threshold `0.616` (validation F1 `0.728`); test F1 change vs 0.5: `+0.074`.
+- `embedding-svm_sentence_embeddings`: threshold `0.276` (validation F1 `0.731`); test F1 change vs 0.5: `+0.024`.
+- `embedding-lightgbm_sentence_embeddings`: threshold `0.052` (validation F1 `0.739`); test F1 change vs 0.5: `-0.009`.
+- `transformer`: threshold `0.616` (validation F1 `0.807`); test F1 change vs 0.5: `+0.000`.
+## Artifacts
+- `logistic_tfidf`: `/content/agri-utilization-classifier/baselines/logistic`
+- `xgboost_tfidf`: `/content/agri-utilization-classifier/baselines/xgboost`
+- `embedding-logistic_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-logistic`
+- `embedding-svm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-svm`
+- `embedding-lightgbm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-lightgbm`
+- `transformer`: `/content/agri-utilization-classifier/transformer`

baselines/embedding-lightgbm/embedding-lightgbm.joblib ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3a14be333902e726d49155cf98ec689843edfa4320b39724da54a187bea078e8
+size 1467460

baselines/embedding-lightgbm/test_predictions.csv ADDED Viewed