--- library_name: transformers pipeline_tag: text-classification base_model: FacebookAI/xlm-roberta-base tags: - text-classification - binary-classification - amis - agriculture language: multilingual --- # AMIS Commodity Classifier This model repository contains artifacts from an AMIS commodity relevance classifier training run. It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report. - Dataset: `faodl/amis-agri-wheat` - Dataset subset: `` - Dataset revision: `main` - Text column: `chunk_text` - Label column: `label` - Transformer: `FacebookAI/xlm-roberta-base` - Generated at: `2026-05-29T18:13:08.384805+00:00` ## Dataset Summary | Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length | | --- | ---: | ---: | ---: | ---: | ---: | | train | 3622 | 2163 | 1459 | 1850 | 644.8 | | validation | 759 | 486 | 273 | 396 | 636.7 | | test | 762 | 470 | 292 | 397 | 643.3 | ## Threshold Comparison on Validation Split Validation metrics document threshold selection and tuning behavior; test metrics remain the primary estimate of out-of-sample performance. | Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | logistic_tfidf | 0.500 | 0.818 | 0.718 | 0.813 | 0.763 | 0.907 | 0.867 | | logistic_tfidf | 0.470 | 0.823 | 0.709 | 0.864 | 0.779 | 0.907 | 0.867 | | xgboost_tfidf | 0.500 | 0.868 | 0.808 | 0.832 | 0.819 | 0.935 | 0.892 | | xgboost_tfidf | 0.520 | 0.871 | 0.816 | 0.828 | 0.822 | 0.935 | 0.892 | | embedding-logistic_sentence_embeddings | 0.500 | 0.783 | 0.658 | 0.824 | 0.732 | 0.862 | 0.780 | | embedding-logistic_sentence_embeddings | 0.521 | 0.791 | 0.673 | 0.813 | 0.736 | 0.862 | 0.780 | | embedding-svm_sentence_embeddings | 0.500 | 0.804 | 0.714 | 0.758 | 0.735 | 0.869 | 0.792 | | embedding-svm_sentence_embeddings | 0.473 | 0.805 | 0.704 | 0.791 | 0.745 | 0.869 | 0.792 | | embedding-lightgbm_sentence_embeddings | 0.500 | 0.791 | 0.694 | 0.747 | 0.720 | 0.868 | 0.786 | | embedding-lightgbm_sentence_embeddings | 0.433 | 0.800 | 0.693 | 0.795 | 0.741 | 0.868 | 0.786 | | transformer | 0.500 | 0.925 | 0.894 | 0.897 | 0.896 | 0.956 | 0.914 | | transformer | 0.203 | 0.926 | 0.883 | 0.916 | 0.899 | 0.956 | 0.914 | ## Threshold Comparison on Test Split | Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision | | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | logistic_tfidf | 0.500 | 0.803 | 0.715 | 0.808 | 0.759 | 0.888 | 0.827 | | logistic_tfidf | 0.470 | 0.797 | 0.688 | 0.860 | 0.764 | 0.888 | 0.827 | | xgboost_tfidf | 0.500 | 0.835 | 0.773 | 0.805 | 0.789 | 0.910 | 0.831 | | xgboost_tfidf | 0.520 | 0.835 | 0.777 | 0.798 | 0.787 | 0.910 | 0.831 | | embedding-logistic_sentence_embeddings | 0.500 | 0.782 | 0.699 | 0.757 | 0.727 | 0.877 | 0.821 | | embedding-logistic_sentence_embeddings | 0.521 | 0.789 | 0.713 | 0.750 | 0.731 | 0.877 | 0.821 | | embedding-svm_sentence_embeddings | 0.500 | 0.818 | 0.778 | 0.733 | 0.755 | 0.883 | 0.824 | | embedding-svm_sentence_embeddings | 0.473 | 0.812 | 0.758 | 0.750 | 0.754 | 0.883 | 0.824 | | embedding-lightgbm_sentence_embeddings | 0.500 | 0.798 | 0.740 | 0.729 | 0.734 | 0.892 | 0.847 | | embedding-lightgbm_sentence_embeddings | 0.433 | 0.806 | 0.735 | 0.771 | 0.753 | 0.892 | 0.847 | | transformer | 0.500 | 0.885 | 0.862 | 0.832 | 0.847 | 0.943 | 0.915 | | transformer | 0.203 | 0.890 | 0.854 | 0.860 | 0.857 | 0.943 | 0.915 | ## Confusion Matrices on Test Split Rows are true labels and columns are predicted labels. ### logistic_tfidf at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 376 | 94 | | RELEVANT | 56 | 236 | ### logistic_tfidf at threshold 0.470 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 356 | 114 | | RELEVANT | 41 | 251 | ### xgboost_tfidf at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 401 | 69 | | RELEVANT | 57 | 235 | ### xgboost_tfidf at threshold 0.520 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 403 | 67 | | RELEVANT | 59 | 233 | ### embedding-logistic_sentence_embeddings at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 375 | 95 | | RELEVANT | 71 | 221 | ### embedding-logistic_sentence_embeddings at threshold 0.521 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 382 | 88 | | RELEVANT | 73 | 219 | ### embedding-svm_sentence_embeddings at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 409 | 61 | | RELEVANT | 78 | 214 | ### embedding-svm_sentence_embeddings at threshold 0.473 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 400 | 70 | | RELEVANT | 73 | 219 | ### embedding-lightgbm_sentence_embeddings at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 395 | 75 | | RELEVANT | 79 | 213 | ### embedding-lightgbm_sentence_embeddings at threshold 0.433 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 389 | 81 | | RELEVANT | 67 | 225 | ### transformer at threshold 0.500 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 431 | 39 | | RELEVANT | 49 | 243 | ### transformer at threshold 0.203 | True / Predicted | NOT_RELEVANT | RELEVANT | | --- | ---: | ---: | | NOT_RELEVANT | 427 | 43 | | RELEVANT | 41 | 251 | ## Validation-Tuned Thresholds - `logistic_tfidf`: threshold `0.470` (validation F1 `0.779`); test F1 change vs 0.5: `+0.005`. - `xgboost_tfidf`: threshold `0.520` (validation F1 `0.822`); test F1 change vs 0.5: `-0.001`. - `embedding-logistic_sentence_embeddings`: threshold `0.521` (validation F1 `0.736`); test F1 change vs 0.5: `+0.004`. - `embedding-svm_sentence_embeddings`: threshold `0.473` (validation F1 `0.745`); test F1 change vs 0.5: `-0.001`. - `embedding-lightgbm_sentence_embeddings`: threshold `0.433` (validation F1 `0.741`); test F1 change vs 0.5: `+0.018`. - `transformer`: threshold `0.203` (validation F1 `0.899`); test F1 change vs 0.5: `+0.010`. ## Artifacts - `logistic_tfidf`: `/content/agri-wheat-classifier/baselines/logistic` - `xgboost_tfidf`: `/content/agri-wheat-classifier/baselines/xgboost` - `embedding-logistic_sentence_embeddings`: `/content/agri-wheat-classifier/baselines/embedding-logistic` - `embedding-svm_sentence_embeddings`: `/content/agri-wheat-classifier/baselines/embedding-svm` - `embedding-lightgbm_sentence_embeddings`: `/content/agri-wheat-classifier/baselines/embedding-lightgbm` - `transformer`: `/content/agri-wheat-classifier/transformer` ## Inference Install the runtime dependencies: ```bash pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost lightgbm ``` ### Transformer ```python import torch from transformers import AutoModelForSequenceClassification, AutoTokenizer MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO" texts = [ "Rice export prices increased after new procurement rules were announced.", "The finance ministry released its monthly fuel tax bulletin.", ] tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer") model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer") threshold = float(getattr(model.config, "threshold", 0.5)) encoded = tokenizer( texts, truncation=True, padding=True, max_length=256, return_tensors="pt", ) with torch.no_grad(): logits = model(**encoded).logits probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist() for text, probability in zip(texts, probabilities): label = model.config.id2label[int(probability >= threshold)] print({"text": text, "probability_positive": probability, "label": label}) ``` ### TF-IDF Baselines Available baseline names in this run: "logistic", "xgboost". ```python import json import joblib from huggingface_hub import hf_hub_download MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO" BASELINE = "logistic" texts = [ "Maize production forecasts were revised after delayed rains.", "The central bank published new exchange rate statistics.", ] model_path = hf_hub_download( repo_id=MODEL_ID, repo_type="model", filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib", ) report_path = hf_hub_download( repo_id=MODEL_ID, repo_type="model", filename="report.json", ) pipeline = joblib.load(model_path) with open(report_path, encoding="utf-8") as handle: report = json.load(handle) threshold = next( result["validation_best_threshold"]["threshold"] for result in report["results"] if result["model_type"] == f"{BASELINE}_tfidf" ) probabilities = pipeline.predict_proba(texts)[:, 1] for text, probability in zip(texts, probabilities): label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT" print({"text": text, "probability_positive": float(probability), "label": label}) ``` ### Sentence-Embedding Baselines Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm". ```python import joblib import torch from huggingface_hub import hf_hub_download from transformers import AutoModel, AutoTokenizer MODEL_ID = "YOUR_USERNAME/YOUR_MODEL_REPO" BASELINE = "embedding-logistic" texts = [ "Wheat export inspections rose as demand from importers increased.", "The sports ministry announced a new stadium renovation plan.", ] model_path = hf_hub_download( repo_id=MODEL_ID, repo_type="model", filename=f"baselines/{BASELINE}/{BASELINE}.joblib", ) artifact = joblib.load(model_path) tokenizer = AutoTokenizer.from_pretrained(artifact["embedding_model_name"]) encoder = AutoModel.from_pretrained(artifact["embedding_model_name"]) encoder.eval() encoded_batches = [] batch_size = artifact.get("embedding_batch_size", 64) for start in range(0, len(texts), batch_size): batch_texts = texts[start : start + batch_size] inputs = tokenizer( batch_texts, padding=True, truncation=True, max_length=artifact.get("embedding_max_length", 256), return_tensors="pt", ) with torch.no_grad(): outputs = encoder(**inputs) token_embeddings = outputs.last_hidden_state attention_mask = inputs["attention_mask"].unsqueeze(-1).to(token_embeddings.dtype) embeddings = (token_embeddings * attention_mask).sum(dim=1) embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9) if artifact.get("normalize_embeddings", True): embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) encoded_batches.append(embeddings) embeddings = torch.cat(encoded_batches).numpy() probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1] threshold = artifact["validation_best_threshold"]["threshold"] for text, probability in zip(texts, probabilities): label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT" print({"text": text, "probability_positive": float(probability), "label": label}) ``` ## Files - `REPORT.md`: Markdown report for this training run. - `report.json`: Machine-readable report containing metrics and thresholds. - `transformer/`: Fine-tuned Transformer artifacts, when Transformer training is enabled. - `baselines/`: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled. - `*/validation_predictions.csv` and `*/test_predictions.csv`: Split-level predictions.