Text Classification
Transformers
Joblib
Safetensors
multilingual
binary-classification
amis
agriculture
Instructions to use faodl/agri-maize_corn-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use faodl/agri-maize_corn-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="faodl/agri-maize_corn-classifier")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("faodl/agri-maize_corn-classifier", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| pipeline_tag: text-classification | |
| base_model: FacebookAI/xlm-roberta-base | |
| tags: | |
| - text-classification | |
| - binary-classification | |
| - amis | |
| - agriculture | |
| language: multilingual | |
| # AMIS Commodity Classifier | |
| This model repository contains artifacts from an AMIS commodity relevance classifier training run. | |
| It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report. | |
| - Dataset: `faodl/amis-agri-maize_corn` | |
| - Dataset subset: `` | |
| - Dataset revision: `main` | |
| - Text column: `chunk_text` | |
| - Label column: `label` | |
| - Transformer: `FacebookAI/xlm-roberta-base` | |
| - Generated at: `2026-06-05T20:32:51.106165+00:00` | |
| ## Dataset Summary | |
| | Split | Rows | Label 0 | Label 1 | Unique groups | Mean text length | | |
| | --- | ---: | ---: | ---: | ---: | ---: | | |
| | train | 4724 | 3822 | 902 | 2226 | 702.9 | | |
| | validation | 1060 | 843 | 217 | 477 | 708.3 | | |
| | test | 1054 | 819 | 235 | 478 | 711.9 | | |
| ## Threshold Comparison on Validation Split | |
| Validation metrics document threshold selection and tuning behavior; test metrics remain the primary estimate of out-of-sample performance. | |
| | Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | logistic_tfidf | 0.500 | 0.896 | 0.700 | 0.862 | 0.773 | 0.929 | 0.841 | | |
| | logistic_tfidf | 0.586 | 0.915 | 0.777 | 0.820 | 0.798 | 0.929 | 0.841 | | |
| | xgboost_tfidf | 0.500 | 0.957 | 0.913 | 0.871 | 0.892 | 0.967 | 0.914 | | |
| | xgboost_tfidf | 0.379 | 0.958 | 0.902 | 0.894 | 0.898 | 0.967 | 0.914 | | |
| | embedding-logistic_sentence_embeddings | 0.500 | 0.881 | 0.649 | 0.912 | 0.759 | 0.959 | 0.867 | | |
| | embedding-logistic_sentence_embeddings | 0.744 | 0.923 | 0.808 | 0.816 | 0.812 | 0.959 | 0.867 | | |
| | embedding-svm_sentence_embeddings | 0.500 | 0.913 | 0.849 | 0.700 | 0.768 | 0.955 | 0.858 | | |
| | embedding-svm_sentence_embeddings | 0.401 | 0.914 | 0.789 | 0.793 | 0.791 | 0.955 | 0.858 | | |
| | embedding-lightgbm_sentence_embeddings | 0.500 | 0.916 | 0.791 | 0.802 | 0.796 | 0.963 | 0.878 | | |
| | embedding-lightgbm_sentence_embeddings | 0.145 | 0.916 | 0.746 | 0.894 | 0.813 | 0.963 | 0.878 | | |
| | transformer | 0.500 | 0.958 | 0.913 | 0.876 | 0.894 | 0.973 | 0.943 | | |
| | transformer | 0.328 | 0.959 | 0.907 | 0.894 | 0.900 | 0.973 | 0.943 | | |
| ## Threshold Comparison on Test Split | |
| | Model | Threshold | Accuracy | Precision | Recall | F1 | ROC AUC | Average precision | | |
| | --- | ---: | ---: | ---: | ---: | ---: | ---: | ---: | | |
| | logistic_tfidf | 0.500 | 0.910 | 0.787 | 0.817 | 0.802 | 0.953 | 0.863 | | |
| | logistic_tfidf | 0.586 | 0.915 | 0.857 | 0.740 | 0.795 | 0.953 | 0.863 | | |
| | xgboost_tfidf | 0.500 | 0.942 | 0.914 | 0.817 | 0.863 | 0.968 | 0.920 | | |
| | xgboost_tfidf | 0.379 | 0.948 | 0.895 | 0.868 | 0.881 | 0.968 | 0.920 | | |
| | embedding-logistic_sentence_embeddings | 0.500 | 0.884 | 0.704 | 0.830 | 0.762 | 0.936 | 0.844 | | |
| | embedding-logistic_sentence_embeddings | 0.744 | 0.896 | 0.831 | 0.668 | 0.741 | 0.936 | 0.844 | | |
| | embedding-svm_sentence_embeddings | 0.500 | 0.894 | 0.892 | 0.596 | 0.714 | 0.932 | 0.842 | | |
| | embedding-svm_sentence_embeddings | 0.401 | 0.902 | 0.851 | 0.681 | 0.757 | 0.932 | 0.842 | | |
| | embedding-lightgbm_sentence_embeddings | 0.500 | 0.907 | 0.870 | 0.685 | 0.767 | 0.953 | 0.873 | | |
| | embedding-lightgbm_sentence_embeddings | 0.145 | 0.901 | 0.784 | 0.770 | 0.777 | 0.953 | 0.873 | | |
| | transformer | 0.500 | 0.935 | 0.892 | 0.804 | 0.846 | 0.953 | 0.890 | | |
| | transformer | 0.328 | 0.935 | 0.881 | 0.817 | 0.848 | 0.953 | 0.890 | | |
| ## Confusion Matrices on Test Split | |
| Rows are true labels and columns are predicted labels. | |
| ### logistic_tfidf at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 767 | 52 | | |
| | RELEVANT | 43 | 192 | | |
| ### logistic_tfidf at threshold 0.586 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 790 | 29 | | |
| | RELEVANT | 61 | 174 | | |
| ### xgboost_tfidf at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 801 | 18 | | |
| | RELEVANT | 43 | 192 | | |
| ### xgboost_tfidf at threshold 0.379 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 795 | 24 | | |
| | RELEVANT | 31 | 204 | | |
| ### embedding-logistic_sentence_embeddings at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 737 | 82 | | |
| | RELEVANT | 40 | 195 | | |
| ### embedding-logistic_sentence_embeddings at threshold 0.744 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 787 | 32 | | |
| | RELEVANT | 78 | 157 | | |
| ### embedding-svm_sentence_embeddings at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 802 | 17 | | |
| | RELEVANT | 95 | 140 | | |
| ### embedding-svm_sentence_embeddings at threshold 0.401 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 791 | 28 | | |
| | RELEVANT | 75 | 160 | | |
| ### embedding-lightgbm_sentence_embeddings at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 795 | 24 | | |
| | RELEVANT | 74 | 161 | | |
| ### embedding-lightgbm_sentence_embeddings at threshold 0.145 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 769 | 50 | | |
| | RELEVANT | 54 | 181 | | |
| ### transformer at threshold 0.500 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 796 | 23 | | |
| | RELEVANT | 46 | 189 | | |
| ### transformer at threshold 0.328 | |
| | True / Predicted | NOT_RELEVANT | RELEVANT | | |
| | --- | ---: | ---: | | |
| | NOT_RELEVANT | 793 | 26 | | |
| | RELEVANT | 43 | 192 | | |
| ## Validation-Tuned Thresholds | |
| - `logistic_tfidf`: threshold `0.586` (validation F1 `0.798`); test F1 change vs 0.5: `-0.007`. | |
| - `xgboost_tfidf`: threshold `0.379` (validation F1 `0.898`); test F1 change vs 0.5: `+0.018`. | |
| - `embedding-logistic_sentence_embeddings`: threshold `0.744` (validation F1 `0.812`); test F1 change vs 0.5: `-0.021`. | |
| - `embedding-svm_sentence_embeddings`: threshold `0.401` (validation F1 `0.791`); test F1 change vs 0.5: `+0.042`. | |
| - `embedding-lightgbm_sentence_embeddings`: threshold `0.145` (validation F1 `0.813`); test F1 change vs 0.5: `+0.010`. | |
| - `transformer`: threshold `0.328` (validation F1 `0.900`); test F1 change vs 0.5: `+0.002`. | |
| ## Artifacts | |
| - `logistic_tfidf`: `/content/agri-maize_corn-classifier/baselines/logistic` | |
| - `xgboost_tfidf`: `/content/agri-maize_corn-classifier/baselines/xgboost` | |
| - `embedding-logistic_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-logistic` | |
| - `embedding-svm_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-svm` | |
| - `embedding-lightgbm_sentence_embeddings`: `/content/agri-maize_corn-classifier/baselines/embedding-lightgbm` | |
| - `transformer`: `/content/agri-maize_corn-classifier/transformer` | |
| ## Inference | |
| Install the runtime dependencies: | |
| ```bash | |
| pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost lightgbm | |
| ``` | |
| ### Transformer | |
| ```python | |
| import torch | |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
| MODEL_ID = "faodl/agri-maize_corn-classifier" | |
| texts = [ | |
| "Rice export prices increased after new procurement rules were announced.", | |
| "The finance ministry released its monthly fuel tax bulletin.", | |
| ] | |
| tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer") | |
| model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer") | |
| threshold = float(getattr(model.config, "threshold", 0.5)) | |
| encoded = tokenizer( | |
| texts, | |
| truncation=True, | |
| padding=True, | |
| max_length=256, | |
| return_tensors="pt", | |
| ) | |
| with torch.no_grad(): | |
| logits = model(**encoded).logits | |
| probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist() | |
| for text, probability in zip(texts, probabilities): | |
| label = model.config.id2label[int(probability >= threshold)] | |
| print({"text": text, "probability_positive": probability, "label": label}) | |
| ``` | |
| ### TF-IDF Baselines | |
| Available baseline names in this run: "logistic", "xgboost". | |
| ```python | |
| import json | |
| import joblib | |
| from huggingface_hub import hf_hub_download | |
| MODEL_ID = "faodl/agri-maize_corn-classifier" | |
| BASELINE = "logistic" | |
| texts = [ | |
| "Maize production forecasts were revised after delayed rains.", | |
| "The central bank published new exchange rate statistics.", | |
| ] | |
| model_path = hf_hub_download( | |
| repo_id=MODEL_ID, | |
| repo_type="model", | |
| filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib", | |
| ) | |
| report_path = hf_hub_download( | |
| repo_id=MODEL_ID, | |
| repo_type="model", | |
| filename="report.json", | |
| ) | |
| pipeline = joblib.load(model_path) | |
| with open(report_path, encoding="utf-8") as handle: | |
| report = json.load(handle) | |
| threshold = next( | |
| result["validation_best_threshold"]["threshold"] | |
| for result in report["results"] | |
| if result["model_type"] == f"{BASELINE}_tfidf" | |
| ) | |
| probabilities = pipeline.predict_proba(texts)[:, 1] | |
| for text, probability in zip(texts, probabilities): | |
| label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT" | |
| print({"text": text, "probability_positive": float(probability), "label": label}) | |
| ``` | |
| ### Sentence-Embedding Baselines | |
| Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm". | |
| ```python | |
| import joblib | |
| import torch | |
| from huggingface_hub import hf_hub_download | |
| from transformers import AutoModel, AutoTokenizer | |
| MODEL_ID = "faodl/agri-maize_corn-classifier" | |
| BASELINE = "embedding-logistic" | |
| texts = [ | |
| "Wheat export inspections rose as demand from importers increased.", | |
| "The sports ministry announced a new stadium renovation plan.", | |
| ] | |
| model_path = hf_hub_download( | |
| repo_id=MODEL_ID, | |
| repo_type="model", | |
| filename=f"baselines/{BASELINE}/{BASELINE}.joblib", | |
| ) | |
| artifact = joblib.load(model_path) | |
| tokenizer = AutoTokenizer.from_pretrained(artifact["embedding_model_name"]) | |
| encoder = AutoModel.from_pretrained(artifact["embedding_model_name"]) | |
| encoder.eval() | |
| encoded_batches = [] | |
| batch_size = artifact.get("embedding_batch_size", 64) | |
| for start in range(0, len(texts), batch_size): | |
| batch_texts = texts[start : start + batch_size] | |
| inputs = tokenizer( | |
| batch_texts, | |
| padding=True, | |
| truncation=True, | |
| max_length=artifact.get("embedding_max_length", 256), | |
| return_tensors="pt", | |
| ) | |
| with torch.no_grad(): | |
| outputs = encoder(**inputs) | |
| token_embeddings = outputs.last_hidden_state | |
| attention_mask = inputs["attention_mask"].unsqueeze(-1).to(token_embeddings.dtype) | |
| embeddings = (token_embeddings * attention_mask).sum(dim=1) | |
| embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9) | |
| if artifact.get("normalize_embeddings", True): | |
| embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1) | |
| encoded_batches.append(embeddings) | |
| embeddings = torch.cat(encoded_batches).numpy() | |
| probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1] | |
| threshold = artifact["validation_best_threshold"]["threshold"] | |
| for text, probability in zip(texts, probabilities): | |
| label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT" | |
| print({"text": text, "probability_positive": float(probability), "label": label}) | |
| ``` | |
| ## Files | |
| - `REPORT.md`: Markdown report for this training run. | |
| - `report.json`: Machine-readable report containing metrics and thresholds. | |
| - `transformer/`: Fine-tuned Transformer artifacts, when Transformer training is enabled. | |
| - `baselines/`: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled. | |
| - `*/validation_predictions.csv` and `*/test_predictions.csv`: Split-level predictions. | |