Lobby NER Spain — Spanish News

A spaCy NER model for detecting mentions of officially registered lobbying organisations in Spanish news text. Fine-tuned from es_core_news_lg (spaCy 3.8.7) with a domain-specific LOBBY entity label grounded in four official Spanish transparency registers.

Model Description

This model extends spaCy's large Spanish pipeline with a LOBBY entity label that identifies organisations formally registered as interest group actors in one of four Spanish regional transparency registers. Unlike the base model's ORG label — which captures any organisation regardless of its role in the political process — LOBBY is restricted to entities that have formally declared their lobbying or representational activities to a Spanish regulatory body.

The model preserves all original entity labels (PER, ORG, LOC, MISC) alongside the new LOBBY label through class-incremental fine-tuning using resume_training(), which avoids catastrophic forgetting by continuing from existing model weights.

Training Data

The model was trained on a silver-labelled corpus of 287,238 sentences drawn from 19 Spanish news outlets covering the period May 2023 to January 2025. Silver labels were generated by applying a gazetteer of 8,419 registered organisations to the sentence corpus using spaCy's PhraseMatcher, with a capitalisation filter that discards matches whose first character is lowercase.

Outlet distribution (sentences with LOBBY match):

Outlet Sentences
El Español 62,091
El Confidencial 37,259
OKDiario 30,831
El País 28,274
El Periódico 28,034
20minutos 21,902
La Vanguardia 21,245
El Diario 14,388
Europa Press 13,603
Público 11,602
El Economista 11,190
Cinco Días 11,006
Libertad Digital 8,280
Huffington Post 7,770
El Mundo 7,558
La Razón 7,324
El Salto 4,683
ABC 3,774
Expansión 3,375

Gazetteer Sources

The entity list was compiled from four official Spanish transparency registers:

Register Entries Access method
Catalan Lobby Register (Decret Llei 1/2017) 5,930 Public API
CNMV Register (Art. 37, Ley 3/2013) 562 Web scraping
Community of Madrid (Ley 10/2019) 1,833 Web scraping
Valencian Community REGIA (Ley 25/2018) 1,361 Direct CSV download

Entries were manually reviewed to remove acronym-only entries, corporate legal suffixes (S.A., S.L., S.L.U.) without accompanying organisation names, and ambiguous single-token terms that generate systematic false positives in news text.

Training Procedure

  • Base model: es_core_news_lg (spaCy 3.8.7)
  • New label: LOBBY added via add_label()
  • Initialisation: resume_training() — continues from existing weights to prevent catastrophic forgetting
  • Disabled pipes: all non-NER components (morphologiser, parser, sentence recogniser, attribute ruler, lemmatiser)
  • Optimiser: Adam, learning rate 0.001, dropout 0.3, batch size 32
  • Iterations: 15
  • Hardware: NVIDIA GeForce RTX 4070 Ti SUPER (16.7 GB VRAM)
  • Train/test split: 80/20, seed 42 → 229,790 train / 57,448 test

Evaluation

Evaluated on the held-out 20% test split (57,448 sentences, seed 42):

Metric Value
Precision 0.9964
Recall 0.9968
F₁ 0.9966
True Positives (spans) 66,560
False Positives (spans) 238
False Negatives (spans) 216

TP/FP/FN refer to entity span counts; multiple spans may occur per sentence.

Intended Use

This model is intended for large-scale, longitudinal measurement of the media visibility of registered interest groups in Spanish news. It identifies entity spans whose surface form matches organisations present in the four transparency registers. It is not a detector of lobbying activity or influence attempts — the pragmatic role of a mention in context is outside the model's scope.

Suitable research applications include:

  • Tracking media salience of specific interest groups over time
  • Cross-outlet comparisons of lobbying organisation coverage
  • Combining lobby detection with topic classification for issue-specific analyses

Limitations

  • Coverage is bounded by the four Spanish transparency registers used to construct the gazetteer. Organisations not registered at the regional level (large multinationals, EU-level associations, etc.) are outside the model's reliable detection capacity.
  • Evaluation is on silver-labelled data derived from the same gazetteer; no manually annotated gold set is available.
  • The model requires text pre-processing to remove diacritics before inference (see usage example below).

Usage

import spacy
import unicodedata
from huggingface_hub import snapshot_download

def quitar_tildes(text):
    return "".join(
        c for c in unicodedata.normalize("NFD", str(text))
        if unicodedata.category(c) != "Mn"
    )

model_path = snapshot_download("lobby-ner-spain/lobby-ner-spain-final")
nlp = spacy.load(model_path)

text = "Telefónica ha pedido al Gobierno que revise la regulación del 5G. La CEOE y CEPYME han convocado una reunión con el Ministerio de Economía."
doc = nlp(quitar_tildes(text))

for ent in doc.ents:
    if ent.label_ == "LOBBY":
        print(f"{ent.text} [{ent.label_}]")

# Output:
# Telefonica [LOBBY]
# CEOE [LOBBY]
# CEPYME [LOBBY]

Interactive Demo

An interactive demo is available at https://huggingface.co/spaces/lobby-ner-spain/lobby-ner-spain-demo.

Citation

If you use this model, please cite the accompanying paper (forthcoming):

@article{lobbynerspain2026,
  title   = {Who Gets into the News? A Gazetteer-Based {NER} System for Tracking Lobbying Organisations in {Spanish} Media},
  journal = {Computational Communication Research},
  year    = {2026},
  note    = {Under review}
}

License

MIT

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Space using lobby-ner-spain/lobby-ner-spain-final 1