DistilBERT Fine-tuned for News Article Classification (Agriculture vs. Non-Agriculture)

Model Description

This model is a fine-tuned version of distilbert-base-cased specifically optimized for the binary classification of news articles into agriculture related and non-agriculture related articles. It is intended to align with the broader, publicly-facing agricultural topics.

  • Developed by: Wadhwani AI
  • Model Type: Text Classification (Binary Classifier)
  • Base Model: distilbert-base-cased
  • Model Version: relevance_classifier_v2.0
  • Language(s): English

Training Data

This model was fine-tuned on a custom dataset of news articles.

  • Dataset Description: The dataset consists of news articles labeled as either agriculture (Relevant) or non-agriculture (Irrelevant). The relevance criteria align with the broader, publicly-facing topics.
  • Data Sources: A total of 40,295 articles were sourced from Common Crawl and Google News.
  • Text Fields Used: Only the title and short description from the news source metadata were used for training and evaluation. Full article text was not included.
  • Data Collection/Annotation: The dataset was annotated using a hybrid pipeline combining LLM-based preprocessing with detailed human annotation. First, an LLM (gpt-4.1-mini) was prompted to identify articles that were obviously relevant to agriculture such as, reports on aphid infestations in cotton or farmer suicides related to loan defaults. These articles were considered "obvious" because they directly and explicitly describe agricultural events, and the LLM was instructed to select only high-confidence cases and list the keywords that informed its decision. Articles deemed not relevant at this stage required the model to return an empty output. Next, all remaining articles were passed through a second LLM prompt designed to flag articles that were obviously irrelevant. Any articles still unclassified after both rounds were sent to human annotators, who labeled them according to the detailed guidelines provided in annotation.md. In total, the LLM labeled 9,297 such obvious cases, while 30,998 articles went through detailed manual annotation following the established guidelines.
  • Class Distribution:
    • Agriculture: 17,183 articles (43%)
    • Non-Agriculture: 23,114 articles (57%)
  • Data Splits (80:10:10):
    • Train: 32,236 articles (13,746 Agriculture, 18,490 Non-Agriculture)
    • Validation: 4,029 articles (1,718 Agriculture, 2,311 Non-Agriculture)
    • Test: 4,030 articles (1,719 Agriculture, 2,311 Non-Agriculture)
  • Preprocessing: Text was tokenized using the DistilBERT tokenizer and truncated to a maximum sequence length of 512 tokens.

Training Procedure

  • Framework: Hugging Face Transformers (PyTorch)
  • Hardware: 1x NVIDIA T4 GPU
  • Architecture: The base DistilBERT encoder layers (embedding layer and transformer layers 0-4) were frozen to preserve pretrained representations, while the final transformer layer (layer 5) and the classification head (pre-classifier and classifier layers) were kept trainable to enable task-specific adaptation for binary classification.
  • Key Hyperparameters:
    • Learning Rate: 3e-5
    • Batch Size: 16
    • Number of Epochs: Ran for 4 epochs (planned for 8 epochs), with lowest loss observed at 2 epochs
    • Optimizer: AdamW
    • Max Sequence Length: 512

Evaluation

The model was evaluated on a held-out test set of 4,030 articles.

  • Evaluation Metrics with 0.5 as threshold (Test Set):
    • Accuracy: 0.896
    • Precision: 0.878
    • Recall: 0.878
    • F1-score: 0.878
    • AuPRC: 0.97
    • AuROC: 0.97

Usage Guidelines

You can use this model with the Hugging Face transformers library.

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
model_name = "wadhwani-ai/agriculture-news-classifier" 
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
def classify_article(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)
    
    if predictions.item() == 1:
        return "Agriculture"
    else:
        return "Non-Agriculture"
# Example Usage
article_1 = "Scientists discover a new drought-resistant crop variety."
article_2 = "The stock market saw a significant rebound today."
print(f"Article 1: '{article_1}' -> Classification: {classify_article(article_1)}")
print(f"Article 2: '{article_2}' -> Classification: {classify_article(article_2)}")
def get_probabilities(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    probabilities = torch.softmax(outputs.logits, dim=-1)[0]
    return {
        "non_agriculture_prob": probabilities[0].item(),
        "agriculture_prob": probabilities[1].item()
    }
print(f"\nProbabilities for Article 1: {get_probabilities(article_1)}")
print(f"Probabilities for Article 2: {get_probabilities(article_2)}")

Intended Uses

This model is intended for:

  • Classifying news articles into "agriculture" or "non-agriculture" categories based on categories based on broader agricultural topics.
  • Automated content moderation or topic categorization in news feeds.
  • Filtering or routing news articles based on agricultural relevance.

Limitations and Biases

  • Domain Specificity: The model is optimized for news articles. Its performance may degrade substantially when applied to text from other domains such as academic literature, social media, or informal content.
  • Data Bias: Model behavior and potential biases are intrinsically linked to the training data. The relevance criteria follow broader, publicly-facing agricultural topics, which may introduce geographic or topical biases. Users are encouraged to conduct independent bias assessments on their target datasets.
  • Binary Nature: The model performs binary classification (agriculture vs. non-agriculture) and does not support finer-grained categorization within agriculture or across other sectors.
  • Factual Accuracy: The model identifies topical relevance only and does not assess or verify the factual correctness of the content.
  • Input Limitation: Training was conducted on news headlines and short descriptions. Applying the model to full-length articles or other text formats may yield inconsistent results.

Disclaimer

The models in this repository were developed using news sources and annotation guidelines tailored to the agricultural context of India. As a result, this model is unlikely to exhibit reliable performance outside this region. Differences in agricultural practices, terminology, socioeconomic conditions, journalistic conventions, and language use can lead to significant declines in classification accuracy or systematic mislabeling.

These resources should therefore not be assumed to generalize to other countries, domains, or linguistic environments without additional validation, adaptation, or retraining. Users intending to apply this model beyond the original context should conduct their own evaluation and perform appropriate domain adaptation.

Contact

For any queries, please feel free to reach out to us at this email: agri-testers@wadhwaniai.org

Downloads last month
28
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support