Cheese Origin Classification with DistilBERT

Model Description

This repository contains a fine-tuned DistilBERT model for classifying cheese descriptions into their country/region of origin. The model was trained on the aslan-ng/cheese-text dataset as part of Homework 1 for CMU 24-679 (Designing and Deploying AI/ML).

  • Base model: distilbert-base-uncased
  • Task: Multiclass text classification
  • Labels (21 classes): Belgium/Germany, Bulgaria, Cyprus, Denmark, England, France, Germany, Greece, India, Italy, Levant, Mexico, Netherlands, Norway, Peru, Philippines, Poland, Portugal, Spain, Switzerland, USA

Training and Evaluation

  • Train/Val/Test split (augmented data): 640 / 160 / 200
  • External validation (original data): 100
Dataset Accuracy F1 (Weighted) Precision Recall
Augmented Test 0.9300 0.9123 0.9168 0.9300
External Validation 0.9500 0.9323 0.9214 0.9500

The model generalizes well beyond the augmented split, achieving 95% accuracy on the original validation set.


Error Analysis

Common confusions occur between geographically or culturally close regions:

  • Germany vs Switzerland (e.g., Limburger misclassified as Swiss)
  • Norway vs Denmark (e.g., Jarlsberg → Denmark)
  • Cyprus vs Greece (e.g., Halloumi → Greece)
  • Philippines vs Spain (Queso de Bola → Spain)

These errors reflect real-world overlaps in cheese naming and history.


How to Use

Install dependencies

```bash pip install transformers torch datasets ```

Load model and tokenizer

```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch

model_id = "cassieli226/cheese-text-distilbert-predictor" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "Halloumi is a semi-hard Cypriot cheese known for its high melting point." inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad(): logits = model(**inputs).logits predicted_class = logits.argmax(dim=-1).item()

print("Predicted class:", model.config.id2label[predicted_class]) ```


Dataset

  • Source: aslan-ng/cheese-text
  • Structure:
    • `original`: 100 manually curated examples
    • `augmented`: 1000 synthetic examples (paraphrased and simplified)

Intended Use

  • Educational demonstration of fine-tuning DistilBERT for multiclass classification.
  • Baseline for exploring text augmentation and error analysis in NLP coursework.

Limitations

  • Dataset is small (≈1,100 examples), so predictions may be sensitive to phrasing.
  • Cultural and regional overlaps in cheese descriptions can lead to ambiguities.

Acknowledgments

Downloads last month
8
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train cassieli226/cheese-text-distilbert-predictor