You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

BERT Spanish Clickbait Classifier

Fine-tuned BERT model for detecting clickbait in Spanish news articles.
The model performs binary text classification to determine whether a news item uses clickbait techniques or not.


Model Details

Model Description

This model is a fine-tuned version of dccuchile/bert-base-spanish-wwm-cased, adapted for the task of clickbait detection in Spanish news articles.

The classification is based on the title of the news, allowing the model to capture both lexical and contextual cues commonly associated with clickbait content.

  • Developed by: Julen Neila
  • Shared by: Julen Neila
  • Model type: Transformer-based text classifier (BERT)
  • Language(s): Spanish
  • License: Apache 2.0
  • Finetuned from model: dccuchile/bert-base-spanish-wwm-cased


Model Sources

  • Base model: dccuchile/bert-base-spanish-wwm-cased
  • Framework: Hugging Face Transformers

Uses

Direct Use

The model can be directly used to:

  • Detect clickbait in Spanish news headlines and articles
  • Support media analysis and journalism studies
  • Assist in content moderation and media monitoring pipelines

Downstream Use

The model can be integrated into:

  • News aggregation systems
  • Media bias and clickbait analysis
  • Academic NLP research projects
  • Larger information extraction or classification pipelines

Out-of-Scope Use

The model is not recommended for:

  • Social media posts or informal text
  • Non-Spanish content
  • Legal, medical, or high-stakes decision-making systems

Bias, Risks, and Limitations

  • The model reflects biases present in the training data.
  • It may underperform on very short texts or headlines without sufficient context.
  • It may not generalize well to domains outside traditional digital journalism.

Recommendations

Users should be aware of these limitations and avoid deploying the model in high-impact decision-making contexts without additional validation.


How to Get Started with the Model

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="JJNeila/bert-spanish-clickbait-oss",
    tokenizer="JJNeila/bert-spanish-clickbait-oss"
)

classifier("Estados Unidos entrena a 25.000 militares (1.400 españoles) para defender el este de Europa")

## Training Details

### Training Data

The model was trained on a curated dataset of Spanish news articles annotated for clickbait presence.

- **Size:** ~3,163 labeled samples  
- **Labels:**  
  - `0` → Non-clickbait 
  - `1` → Clickbait  

The input format used during training was:

  *title*

### Training Procedure

#### Preprocessing 

- Removal of unlabeled samples  
- Concatenation of title and article text  
- Tokenization using the base BERT Spanish tokenizer  
- Maximum sequence length: **512 tokens**


#### Training Hyperparameters

- **Training regime:** fp16 mixed precision  
- **Optimizer:** AdamW  
- **Learning rate:** 2e-5  
- **Batch size:** 8  
- **Epochs:** 3  
- **Weight decay:** 0.01  
- **Evaluation metric for model selection:** eval_loss
- **EarlyStoppingCallback**

#### Speeds, Sizes, Times

- **Training time:** ~0,5 hours  
- **Hardware:** NVIDIA T4 GPU  
- **Final model size:** ~440 MB  

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

A held-out validation set (20%) stratified by class labels.


#### Metrics

The following metrics were used due to class imbalance considerations:

- Accuracy  
- Precision  
- Recall  
- F1-score  

### Results

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.86  |
| Precision | 0.84  |
| Recall   | 0.88  |
| F1-score | 0.86  |

#### Summary

The model achieves a strong balance between precision and recall, making it particularly effective at identifying clickbait content without excessive false positives.

---

## Environmental Impact

- **Hardware Type:** NVIDIA T4 GPU  
- **Hours used:** ~0,5 hours  
- **Cloud Provider:** Google Colab  
- **Compute Region:** Europe  
- **Carbon Emitted:** Not explicitly measured  

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

## Technical Specifications 

### Model Architecture and Objective

- **Architecture:** BERT-base (12 layers, ~110M parameters)  
- **Objective:** Binary cross-entropy loss for text classification  

### Compute Infrastructure

[More Information Needed]

#### Hardware

- NVIDIA T4 GPU (16 GB VRAM)

#### Software

- Python 3.12  
- PyTorch  
- Transformers  
- Hugging Face Datasets  

**BibTeX:**

```bibtex
@misc{neila2025clickbait,
  title={BERT Spanish Clickbait Classifier},
  author={Neila, Julen},
  year={2026},
  publisher={Hugging Face}
}

## Model Card Authors

**Julen Neila Garcia**

## Model Card Contact

https://huggingface.co/JJNeila
Downloads last month
15
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for JJNeila/bert-spanish-clickbait-oss