---
library_name: transformers
pipeline_tag: text-classification
base_model: FacebookAI/xlm-roberta-base
tags:
- text-classification
- binary-classification
- amis
- agriculture
language: multilingual
---

# AMIS Commodity Classifier

This model repository contains artifacts from an AMIS commodity relevance classifier training run.
It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report.

- Dataset: `faodl/amis-agri-maize_corn`
- Dataset subset: ``
- Dataset revision: `main`
- Text column: `chunk_text`
- Label column: `label`
- Transformer: `FacebookAI/xlm-roberta-base`
- Generated at: `2026-06-05T20:32:51.106165+00:00`

## Dataset Summary

| Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length |
| --- | ---: | ---: | ---: | ---: | ---: |
| train | 4724 | 3822 | 902 | 2226 | 702.9 |
| validation | 1060 | 843 | 217 | 477 | 708.3 |
| test | 1054 | 819 | 235 | 478 | 711.9 |

## Threshold Comparison on Validation Split

Validation metrics document threshold selection and tuning behavior; test metrics remain the primary estimate of out-of-sample performance.

| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| logistic_tfidf | 0.500 | 0.896 | 0.700 | 0.862 | 0.773 | 0.929 | 0.841 |
| logistic_tfidf | 0.586 | 0.915 | 0.777 | 0.820 | 0.798 | 0.929 | 0.841 |
| xgboost_tfidf | 0.500 | 0.957 | 0.913 | 0.871 | 0.892 | 0.967 | 0.914 |
| xgboost_tfidf | 0.379 | 0.958 | 0.902 | 0.894 | 0.898 | 0.967 | 0.914 |
| embedding-logistic_sentence_embeddings | 0.500 | 0.881 | 0.649 | 0.912 | 0.759 | 0.959 | 0.867 |
| embedding-logistic_sentence_embeddings | 0.744 | 0.923 | 0.808 | 0.816 | 0.812 | 0.959 | 0.867 |
| embedding-svm_sentence_embeddings | 0.500 | 0.913 | 0.849 | 0.700 | 0.768 | 0.955 | 0.858 |
| embedding-svm_sentence_embeddings | 0.401 | 0.914 | 0.789 | 0.793 | 0.791 | 0.955 | 0.858 |
| embedding-lightgbm_sentence_embeddings | 0.500 | 0.916 | 0.791 | 0.802 | 0.796 | 0.963 | 0.878 |
| embedding-lightgbm_sentence_embeddings | 0.145 | 0.916 | 0.746 | 0.894 | 0.813 | 0.963 | 0.878 |
| transformer | 0.500 | 0.958 | 0.913 | 0.876 | 0.894 | 0.973 | 0.943 |
| transformer | 0.328 | 0.959 | 0.907 | 0.894 | 0.900 | 0.973 | 0.943 |

## Threshold Comparison on Test Split

| Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision |
| --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: |
| logistic_tfidf | 0.500 | 0.910 | 0.787 | 0.817 | 0.802 | 0.953 | 0.863 |
| logistic_tfidf | 0.586 | 0.915 | 0.857 | 0.740 | 0.795 | 0.953 | 0.863 |
| xgboost_tfidf | 0.500 | 0.942 | 0.914 | 0.817 | 0.863 | 0.968 | 0.920 |
| xgboost_tfidf | 0.379 | 0.948 | 0.895 | 0.868 | 0.881 | 0.968 | 0.920 |
| embedding-logistic_sentence_embeddings | 0.500 | 0.884 | 0.704 | 0.830 | 0.762 | 0.936 | 0.844 |
| embedding-logistic_sentence_embeddings | 0.744 | 0.896 | 0.831 | 0.668 | 0.741 | 0.936 | 0.844 |
| embedding-svm_sentence_embeddings | 0.500 | 0.894 | 0.892 | 0.596 | 0.714 | 0.932 | 0.842 |
| embedding-svm_sentence_embeddings | 0.401 | 0.902 | 0.851 | 0.681 | 0.757 | 0.932 | 0.842 |
| embedding-lightgbm_sentence_embeddings | 0.500 | 0.907 | 0.870 | 0.685 | 0.767 | 0.953 | 0.873 |
| embedding-lightgbm_sentence_embeddings | 0.145 | 0.901 | 0.784 | 0.770 | 0.777 | 0.953 | 0.873 |
| transformer | 0.500 | 0.935 | 0.892 | 0.804 | 0.846 | 0.953 | 0.890 |
| transformer | 0.328 | 0.935 | 0.881 | 0.817 | 0.848 | 0.953 | 0.890 |

## Confusion Matrices on Test Split

Rows are true labels and columns are predicted labels.

### logistic_tfidf at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 767 | 52 |
| RELEVANT | 43 | 192 |

### logistic_tfidf at threshold 0.586

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 790 | 29 |
| RELEVANT | 61 | 174 |

### xgboost_tfidf at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 801 | 18 |
| RELEVANT | 43 | 192 |

### xgboost_tfidf at threshold 0.379

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 795 | 24 |
| RELEVANT | 31 | 204 |

### embedding-logistic_sentence_embeddings at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 737 | 82 |
| RELEVANT | 40 | 195 |

### embedding-logistic_sentence_embeddings at threshold 0.744

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 787 | 32 |
| RELEVANT | 78 | 157 |

### embedding-svm_sentence_embeddings at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 802 | 17 |
| RELEVANT | 95 | 140 |

### embedding-svm_sentence_embeddings at threshold 0.401

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 791 | 28 |
| RELEVANT | 75 | 160 |

### embedding-lightgbm_sentence_embeddings at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 795 | 24 |
| RELEVANT | 74 | 161 |

### embedding-lightgbm_sentence_embeddings at threshold 0.145

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 769 | 50 |
| RELEVANT | 54 | 181 |

### transformer at threshold 0.500

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 796 | 23 |
| RELEVANT | 46 | 189 |

### transformer at threshold 0.328

| True / Predicted | NOT_RELEVANT | RELEVANT |
| --- | ---: | ---: |
| NOT_RELEVANT | 793 | 26 |
| RELEVANT | 43 | 192 |


## Validation-Tuned Thresholds

- `logistic_tfidf`: threshold `0.586` (validation F1 `0.798`); test F1 change vs 0.5: `-0.007`.
- `xgboost_tfidf`: threshold `0.379` (validation F1 `0.898`); test F1 change vs 0.5: `+0.018`.
- `embedding-logistic_sentence_embeddings`: threshold `0.744` (validation F1 `0.812`); test F1 change vs 0.5: `-0.021`.
- `embedding-svm_sentence_embeddings`: threshold `0.401` (validation F1 `0.791`); test F1 change vs 0.5: `+0.042`.
- `embedding-lightgbm_sentence_embeddings`: threshold `0.145` (validation F1 `0.813`); test F1 change vs 0.5: `+0.010`.
- `transformer`: threshold `0.328` (validation F1 `0.900`); test F1 change vs 0.5: `+0.002`.

## Artifacts

- `logistic_tfidf`: `/content/agri-maize_corn-classifier/baselines/logistic`
- `xgboost_tfidf`: `/content/agri-maize_corn-classifier/baselines/xgboost`
- `embedding-logistic_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-logistic`
- `embedding-svm_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-svm`
- `embedding-lightgbm_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-lightgbm`
- `transformer`: `/content/agri-maize_corn-classifier/transformer`

## Inference

Install the runtime dependencies:

```bash
pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost lightgbm
```

### Transformer

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

MODEL_ID = "faodl/agri-maize_corn-classifier"

texts = [
    "Rice export prices increased after new procurement rules were announced.",
    "The finance ministry released its monthly fuel tax bulletin.",
]

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer")
model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer")
threshold = float(getattr(model.config, "threshold", 0.5))

encoded = tokenizer(
    texts,
    truncation=True,
    padding=True,
    max_length=256,
    return_tensors="pt",
)

with torch.no_grad():
    logits = model(**encoded).logits
    probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist()

for text, probability in zip(texts, probabilities):
    label = model.config.id2label[int(probability >= threshold)]
    print({"text": text, "probability_positive": probability, "label": label})
```

### TF-IDF Baselines

Available baseline names in this run: "logistic", "xgboost".

```python
import json
import joblib
from huggingface_hub import hf_hub_download

MODEL_ID = "faodl/agri-maize_corn-classifier"
BASELINE = "logistic"

texts = [
    "Maize production forecasts were revised after delayed rains.",
    "The central bank published new exchange rate statistics.",
]

model_path = hf_hub_download(
    repo_id=MODEL_ID,
    repo_type="model",
    filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib",
)
report_path = hf_hub_download(
    repo_id=MODEL_ID,
    repo_type="model",
    filename="report.json",
)

pipeline = joblib.load(model_path)
with open(report_path, encoding="utf-8") as handle:
    report = json.load(handle)

threshold = next(
    result["validation_best_threshold"]["threshold"]
    for result in report["results"]
    if result["model_type"] == f"{BASELINE}_tfidf"
)

probabilities = pipeline.predict_proba(texts)[:, 1]
for text, probability in zip(texts, probabilities):
    label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
    print({"text": text, "probability_positive": float(probability), "label": label})
```

### Sentence-Embedding Baselines

Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm".

```python
import joblib
import torch
from huggingface_hub import hf_hub_download
from transformers import AutoModel, AutoTokenizer

MODEL_ID = "faodl/agri-maize_corn-classifier"
BASELINE = "embedding-logistic"

texts = [
    "Wheat export inspections rose as demand from importers increased.",
    "The sports ministry announced a new stadium renovation plan.",
]

model_path = hf_hub_download(
    repo_id=MODEL_ID,
    repo_type="model",
    filename=f"baselines/{BASELINE}/{BASELINE}.joblib",
)
artifact = joblib.load(model_path)
tokenizer = AutoTokenizer.from_pretrained(artifact["embedding_model_name"])
encoder = AutoModel.from_pretrained(artifact["embedding_model_name"])
encoder.eval()

encoded_batches = []
batch_size = artifact.get("embedding_batch_size", 64)
for start in range(0, len(texts), batch_size):
    batch_texts = texts[start : start + batch_size]
    inputs = tokenizer(
        batch_texts,
        padding=True,
        truncation=True,
        max_length=artifact.get("embedding_max_length", 256),
        return_tensors="pt",
    )
    with torch.no_grad():
        outputs = encoder(**inputs)
    token_embeddings = outputs.last_hidden_state
    attention_mask = inputs["attention_mask"].unsqueeze(-1).to(token_embeddings.dtype)
    embeddings = (token_embeddings * attention_mask).sum(dim=1)
    embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
    if artifact.get("normalize_embeddings", True):
        embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
    encoded_batches.append(embeddings)
embeddings = torch.cat(encoded_batches).numpy()
probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1]
threshold = artifact["validation_best_threshold"]["threshold"]

for text, probability in zip(texts, probabilities):
    label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
    print({"text": text, "probability_positive": float(probability), "label": label})
```

## Files

- `REPORT.md`: Markdown report for this training run.
- `report.json`: Machine-readable report containing metrics and thresholds.
- `transformer/`: Fine-tuned Transformer artifacts, when Transformer training is enabled.
- `baselines/`: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled.
- `*/validation_predictions.csv` and `*/test_predictions.csv`: Split-level predictions.