You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

BERT Spanish Clickbait Classifier

Fine-tuned BERT model for detecting clickbait in Spanish news articles.
The model performs binary text classification to determine whether a news item uses clickbait techniques or not.

Model Details

Model Description

This model is a fine-tuned version of dccuchile/bert-base-spanish-wwm-cased, adapted for the task of clickbait detection in Spanish news articles.

The classification is based on the title of the news, allowing the model to capture both lexical and contextual cues commonly associated with clickbait content.

Developed by: Julen Neila
Shared by: Julen Neila
Model type: Transformer-based text classifier (BERT)
Language(s): Spanish
License: Apache 2.0
Finetuned from model: dccuchile/bert-base-spanish-wwm-cased

Model Sources

Base model: dccuchile/bert-base-spanish-wwm-cased
Framework: Hugging Face Transformers

Uses

Direct Use

The model can be directly used to:

Detect clickbait in Spanish news headlines and articles
Support media analysis and journalism studies
Assist in content moderation and media monitoring pipelines

Downstream Use

The model can be integrated into:

News aggregation systems
Media bias and clickbait analysis
Academic NLP research projects
Larger information extraction or classification pipelines

Out-of-Scope Use

The model is not recommended for:

Social media posts or informal text
Non-Spanish content
Legal, medical, or high-stakes decision-making systems

Bias, Risks, and Limitations

The model reflects biases present in the training data.
It may underperform on very short texts or headlines without sufficient context.
It may not generalize well to domains outside traditional digital journalism.

Recommendations

Users should be aware of these limitations and avoid deploying the model in high-impact decision-making contexts without additional validation.

How to Get Started with the Model

from transformers import pipeline

classifier = pipeline(
    "text-classification",
    model="JJNeila/bert-spanish-clickbait-oss",
    tokenizer="JJNeila/bert-spanish-clickbait-oss"
)

classifier("Estados Unidos entrena a 25.000 militares (1.400 españoles) para defender el este de Europa")

## Training Details

### Training Data

The model was trained on a curated dataset of Spanish news articles annotated for clickbait presence.

- **Size:** ~3,163 labeled samples  
- **Labels:**  
  - `0` → Non-clickbait 
  - `1` → Clickbait  

The input format used during training was:

  *title*

### Training Procedure

#### Preprocessing 

- Removal of unlabeled samples  
- Concatenation of title and article text  
- Tokenization using the base BERT Spanish tokenizer  
- Maximum sequence length: **512 tokens**


#### Training Hyperparameters

- **Training regime:** fp16 mixed precision  
- **Optimizer:** AdamW  
- **Learning rate:** 2e-5  
- **Batch size:** 8  
- **Epochs:** 3  
- **Weight decay:** 0.01  
- **Evaluation metric for model selection:** eval_loss
- **EarlyStoppingCallback**

#### Speeds, Sizes, Times

- **Training time:** ~0,5 hours  
- **Hardware:** NVIDIA T4 GPU  
- **Final model size:** ~440 MB  

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

A held-out validation set (20%) stratified by class labels.


#### Metrics

The following metrics were used due to class imbalance considerations:

- Accuracy  
- Precision  
- Recall  
- F1-score  

### Results

| Metric    | Value |
|-----------|-------|
| Accuracy  | 0.86  |
| Precision | 0.84  |
| Recall   | 0.88  |
| F1-score | 0.86  |

#### Summary

The model achieves a strong balance between precision and recall, making it particularly effective at identifying clickbait content without excessive false positives.

---

## Environmental Impact

- **Hardware Type:** NVIDIA T4 GPU  
- **Hours used:** ~0,5 hours  
- **Cloud Provider:** Google Colab  
- **Compute Region:** Europe  
- **Carbon Emitted:** Not explicitly measured  

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

## Technical Specifications 

### Model Architecture and Objective

- **Architecture:** BERT-base (12 layers, ~110M parameters)  
- **Objective:** Binary cross-entropy loss for text classification  

### Compute Infrastructure

[More Information Needed]

#### Hardware

- NVIDIA T4 GPU (16 GB VRAM)

#### Software

- Python 3.12  
- PyTorch  
- Transformers  
- Hugging Face Datasets  

**BibTeX:**

```bibtex
@misc{neila2025clickbait,
  title={BERT Spanish Clickbait Classifier},
  author={Neila, Julen},
  year={2026},
  publisher={Hugging Face}
}

## Model Card Authors

**Julen Neila Garcia**

## Model Card Contact

https://huggingface.co/JJNeila

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Paper for JJNeila/bert-spanish-clickbait-oss

Quantifying the Carbon Emissions of Machine Learning

Paper • 1910.09700 • Published Oct 21, 2019 • 41