Upload folder using huggingface_hub

4c67f07 verified 14 days ago

11.8 kB

	---
	library_name: transformers
	pipeline_tag: text-classification
	base_model: FacebookAI/xlm-roberta-base
	tags:
	- text-classification
	- binary-classification
	- amis
	- agriculture
	language: multilingual
	---

	# AMIS Commodity Classifier

	This model repository contains artifacts from an AMIS commodity relevance classifier training run.
	It includes the Transformer model, any configured TF-IDF or sentence-embedding baselines, prediction files, and the training report.

	- Dataset: `faodl/amis-agri-utilization`
	- Dataset subset: ``
	- Dataset revision: `ada4a04088a98f8f64bc7485c57d4c7f422c2151`
	- Text column: `chunk_text`
	- Label column: `label`
	- Transformer: `FacebookAI/xlm-roberta-base`
	- Generated at: `2026-05-27T10:50:45.867038+00:00`

	## Dataset Summary

	\| Split \| Rows \| Label 0 \| Label 1 \| Unique groups \| Mean text length \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| train \| 4877 \| 4347 \| 530 \| 2513 \| 696.6 \|
	\| validation \| 978 \| 899 \| 79 \| 538 \| 690.6 \|
	\| test \| 1016 \| 904 \| 112 \| 539 \| 690.7 \|

	## Threshold Comparison on Validation Split

	Validation metrics document threshold selection and tuning behavior; test metrics remain the primary estimate of out-of-sample performance.

	\| Model \| Threshold \| Accuracy \| Precision \| Recall \| F1 \| ROC AUC \| Average precision \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| logistic_tfidf \| 0.500 \| 0.912 \| 0.465 \| 0.582 \| 0.517 \| 0.872 \| 0.594 \|
	\| logistic_tfidf \| 0.608 \| 0.942 \| 0.696 \| 0.494 \| 0.578 \| 0.872 \| 0.594 \|
	\| xgboost_tfidf \| 0.500 \| 0.945 \| 0.931 \| 0.342 \| 0.500 \| 0.823 \| 0.588 \|
	\| xgboost_tfidf \| 0.177 \| 0.934 \| 0.592 \| 0.570 \| 0.581 \| 0.823 \| 0.588 \|
	\| embedding-logistic_sentence_embeddings \| 0.500 \| 0.912 \| 0.476 \| 0.861 \| 0.613 \| 0.953 \| 0.762 \|
	\| embedding-logistic_sentence_embeddings \| 0.722 \| 0.957 \| 0.703 \| 0.810 \| 0.753 \| 0.953 \| 0.762 \|
	\| embedding-svm_sentence_embeddings \| 0.500 \| 0.955 \| 0.807 \| 0.582 \| 0.676 \| 0.952 \| 0.754 \|
	\| embedding-svm_sentence_embeddings \| 0.310 \| 0.957 \| 0.713 \| 0.785 \| 0.747 \| 0.952 \| 0.754 \|
	\| embedding-lightgbm_sentence_embeddings \| 0.500 \| 0.954 \| 0.750 \| 0.646 \| 0.694 \| 0.948 \| 0.782 \|
	\| embedding-lightgbm_sentence_embeddings \| 0.042 \| 0.952 \| 0.670 \| 0.797 \| 0.728 \| 0.948 \| 0.782 \|
	\| transformer \| 0.500 \| 0.970 \| 0.798 \| 0.848 \| 0.822 \| 0.966 \| 0.854 \|
	\| transformer \| 0.471 \| 0.971 \| 0.800 \| 0.861 \| 0.829 \| 0.966 \| 0.854 \|

	## Threshold Comparison on Test Split

	\| Model \| Threshold \| Accuracy \| Precision \| Recall \| F1 \| ROC AUC \| Average precision \|
	\| --- \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \| ---: \|
	\| logistic_tfidf \| 0.500 \| 0.926 \| 0.691 \| 0.598 \| 0.641 \| 0.899 \| 0.726 \|
	\| logistic_tfidf \| 0.608 \| 0.930 \| 0.902 \| 0.411 \| 0.564 \| 0.899 \| 0.726 \|
	\| xgboost_tfidf \| 0.500 \| 0.924 \| 1.000 \| 0.312 \| 0.476 \| 0.892 \| 0.692 \|
	\| xgboost_tfidf \| 0.177 \| 0.918 \| 0.663 \| 0.527 \| 0.587 \| 0.892 \| 0.692 \|
	\| embedding-logistic_sentence_embeddings \| 0.500 \| 0.891 \| 0.503 \| 0.884 \| 0.641 \| 0.955 \| 0.710 \|
	\| embedding-logistic_sentence_embeddings \| 0.722 \| 0.935 \| 0.689 \| 0.750 \| 0.718 \| 0.955 \| 0.710 \|
	\| embedding-svm_sentence_embeddings \| 0.500 \| 0.930 \| 0.741 \| 0.562 \| 0.640 \| 0.956 \| 0.704 \|
	\| embedding-svm_sentence_embeddings \| 0.310 \| 0.934 \| 0.686 \| 0.741 \| 0.712 \| 0.956 \| 0.704 \|
	\| embedding-lightgbm_sentence_embeddings \| 0.500 \| 0.937 \| 0.740 \| 0.661 \| 0.698 \| 0.960 \| 0.791 \|
	\| embedding-lightgbm_sentence_embeddings \| 0.042 \| 0.929 \| 0.639 \| 0.821 \| 0.719 \| 0.960 \| 0.791 \|
	\| transformer \| 0.500 \| 0.951 \| 0.777 \| 0.777 \| 0.777 \| 0.968 \| 0.817 \|
	\| transformer \| 0.471 \| 0.950 \| 0.770 \| 0.777 \| 0.773 \| 0.968 \| 0.817 \|

	## Confusion Matrices on Test Split

	Rows are true labels and columns are predicted labels.

	### logistic_tfidf at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 874 \| 30 \|
	\| RELEVANT \| 45 \| 67 \|

	### logistic_tfidf at threshold 0.608

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 899 \| 5 \|
	\| RELEVANT \| 66 \| 46 \|

	### xgboost_tfidf at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 904 \| 0 \|
	\| RELEVANT \| 77 \| 35 \|

	### xgboost_tfidf at threshold 0.177

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 874 \| 30 \|
	\| RELEVANT \| 53 \| 59 \|

	### embedding-logistic_sentence_embeddings at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 806 \| 98 \|
	\| RELEVANT \| 13 \| 99 \|

	### embedding-logistic_sentence_embeddings at threshold 0.722

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 866 \| 38 \|
	\| RELEVANT \| 28 \| 84 \|

	### embedding-svm_sentence_embeddings at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 882 \| 22 \|
	\| RELEVANT \| 49 \| 63 \|

	### embedding-svm_sentence_embeddings at threshold 0.310

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 866 \| 38 \|
	\| RELEVANT \| 29 \| 83 \|

	### embedding-lightgbm_sentence_embeddings at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 878 \| 26 \|
	\| RELEVANT \| 38 \| 74 \|

	### embedding-lightgbm_sentence_embeddings at threshold 0.042

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 852 \| 52 \|
	\| RELEVANT \| 20 \| 92 \|

	### transformer at threshold 0.500

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 879 \| 25 \|
	\| RELEVANT \| 25 \| 87 \|

	### transformer at threshold 0.471

	\| True / Predicted \| NOT_RELEVANT \| RELEVANT \|
	\| --- \| ---: \| ---: \|
	\| NOT_RELEVANT \| 878 \| 26 \|
	\| RELEVANT \| 25 \| 87 \|


	## Validation-Tuned Thresholds

	- `logistic_tfidf`: threshold `0.608` (validation F1 `0.578`); test F1 change vs 0.5: `-0.077`.
	- `xgboost_tfidf`: threshold `0.177` (validation F1 `0.581`); test F1 change vs 0.5: `+0.111`.
	- `embedding-logistic_sentence_embeddings`: threshold `0.722` (validation F1 `0.753`); test F1 change vs 0.5: `+0.077`.
	- `embedding-svm_sentence_embeddings`: threshold `0.310` (validation F1 `0.747`); test F1 change vs 0.5: `+0.073`.
	- `embedding-lightgbm_sentence_embeddings`: threshold `0.042` (validation F1 `0.728`); test F1 change vs 0.5: `+0.021`.
	- `transformer`: threshold `0.471` (validation F1 `0.829`); test F1 change vs 0.5: `-0.003`.

	## Artifacts

	- `logistic_tfidf`: `/content/agri-utilization-classifier/baselines/logistic`
	- `xgboost_tfidf`: `/content/agri-utilization-classifier/baselines/xgboost`
	- `embedding-logistic_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-logistic`
	- `embedding-svm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-svm`
	- `embedding-lightgbm_sentence_embeddings`: `/content/agri-utilization-classifier/baselines/embedding-lightgbm`
	- `transformer`: `/content/agri-utilization-classifier/transformer`

	## Inference

	Install the runtime dependencies:

	```bash
	pip install transformers torch huggingface_hub pandas joblib scikit-learn xgboost lightgbm
	```

	### Transformer

	```python
	import torch
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	MODEL_ID = "faodl/agri-utilization-classifier"

	texts = [
	"Rice export prices increased after new procurement rules were announced.",
	"The finance ministry released its monthly fuel tax bulletin.",
	]

	tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, subfolder="transformer")
	model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID, subfolder="transformer")
	threshold = float(getattr(model.config, "threshold", 0.5))

	encoded = tokenizer(
	texts,
	truncation=True,
	padding=True,
	max_length=256,
	return_tensors="pt",
	)

	with torch.no_grad():
	logits = model(**encoded).logits
	probabilities = torch.softmax(logits, dim=-1)[:, 1].tolist()

	for text, probability in zip(texts, probabilities):
	label = model.config.id2label[int(probability >= threshold)]
	print({"text": text, "probability_positive": probability, "label": label})
	```

	### TF-IDF Baselines

	Available baseline names in this run: "logistic", "xgboost".

	```python
	import json
	import joblib
	from huggingface_hub import hf_hub_download

	MODEL_ID = "faodl/agri-utilization-classifier"
	BASELINE = "logistic"

	texts = [
	"Maize production forecasts were revised after delayed rains.",
	"The central bank published new exchange rate statistics.",
	]

	model_path = hf_hub_download(
	repo_id=MODEL_ID,
	repo_type="model",
	filename=f"baselines/{BASELINE}/{BASELINE}_tfidf.joblib",
	)
	report_path = hf_hub_download(
	repo_id=MODEL_ID,
	repo_type="model",
	filename="report.json",
	)

	pipeline = joblib.load(model_path)
	with open(report_path, encoding="utf-8") as handle:
	report = json.load(handle)

	threshold = next(
	result["validation_best_threshold"]["threshold"]
	for result in report["results"]
	if result["model_type"] == f"{BASELINE}_tfidf"
	)

	probabilities = pipeline.predict_proba(texts)[:, 1]
	for text, probability in zip(texts, probabilities):
	label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
	print({"text": text, "probability_positive": float(probability), "label": label})
	```

	### Sentence-Embedding Baselines

	Available embedding baseline names in this run: "embedding-logistic", "embedding-svm", "embedding-lightgbm".

	```python
	import joblib
	import torch
	from huggingface_hub import hf_hub_download
	from transformers import AutoModel, AutoTokenizer

	MODEL_ID = "faodl/agri-utilization-classifier"
	BASELINE = "embedding-logistic"

	texts = [
	"Wheat export inspections rose as demand from importers increased.",
	"The sports ministry announced a new stadium renovation plan.",
	]

	model_path = hf_hub_download(
	repo_id=MODEL_ID,
	repo_type="model",
	filename=f"baselines/{BASELINE}/{BASELINE}.joblib",
	)
	artifact = joblib.load(model_path)
	tokenizer = AutoTokenizer.from_pretrained(artifact["embedding_model_name"])
	encoder = AutoModel.from_pretrained(artifact["embedding_model_name"])
	encoder.eval()

	encoded_batches = []
	batch_size = artifact.get("embedding_batch_size", 64)
	for start in range(0, len(texts), batch_size):
	batch_texts = texts[start : start + batch_size]
	inputs = tokenizer(
	batch_texts,
	padding=True,
	truncation=True,
	max_length=artifact.get("embedding_max_length", 256),
	return_tensors="pt",
	)
	with torch.no_grad():
	outputs = encoder(**inputs)
	token_embeddings = outputs.last_hidden_state
	attention_mask = inputs["attention_mask"].unsqueeze(-1).to(token_embeddings.dtype)
	embeddings = (token_embeddings * attention_mask).sum(dim=1)
	embeddings = embeddings / attention_mask.sum(dim=1).clamp(min=1e-9)
	if artifact.get("normalize_embeddings", True):
	embeddings = torch.nn.functional.normalize(embeddings, p=2, dim=1)
	encoded_batches.append(embeddings)
	embeddings = torch.cat(encoded_batches).numpy()
	probabilities = artifact["classifier"].predict_proba(embeddings)[:, 1]
	threshold = artifact["validation_best_threshold"]["threshold"]

	for text, probability in zip(texts, probabilities):
	label = "RELEVANT" if probability >= threshold else "NOT_RELEVANT"
	print({"text": text, "probability_positive": float(probability), "label": label})
	```

	## Files

	- `REPORT.md`: Markdown report for this training run.
	- `report.json`: Machine-readable report containing metrics and thresholds.
	- `transformer/`: Fine-tuned Transformer artifacts, when Transformer training is enabled.
	- `baselines/`: TF-IDF and sentence-embedding baseline artifacts, when baseline training is enabled.
	- `/validation_predictions.csv` and `/test_predictions.csv`: Split-level predictions.