Lobby NER Spain — Spanish News

A spaCy NER model for detecting mentions of officially registered lobbying organisations in Spanish news text. Fine-tuned from es_core_news_lg (spaCy 3.8.7) with a domain-specific LOBBY entity label grounded in four official Spanish transparency registers.

Model Description

This model extends spaCy's large Spanish pipeline with a LOBBY entity label that identifies organisations formally registered as interest group actors in one of four Spanish regional transparency registers. Unlike the base model's ORG label — which captures any organisation regardless of its role in the political process — LOBBY is restricted to entities that have formally declared their lobbying or representational activities to a Spanish regulatory body.

The model preserves all original entity labels (PER, ORG, LOC, MISC) alongside the new LOBBY label through class-incremental fine-tuning using resume_training(), which avoids catastrophic forgetting by continuing from existing model weights.

Training Data

The model was trained on a silver-labelled corpus of 287,238 sentences drawn from 19 Spanish news outlets covering the period May 2023 to January 2025. Silver labels were generated by applying a gazetteer of 8,419 registered organisations to the sentence corpus using spaCy's PhraseMatcher, with a capitalisation filter that discards matches whose first character is lowercase.

Outlet distribution (sentences with LOBBY match):

Outlet	Sentences
El Español	62,091
El Confidencial	37,259
OKDiario	30,831
El País	28,274
El Periódico	28,034
20minutos	21,902
La Vanguardia	21,245
El Diario	14,388
Europa Press	13,603
Público	11,602
El Economista	11,190
Cinco Días	11,006
Libertad Digital	8,280
Huffington Post	7,770
El Mundo	7,558
La Razón	7,324
El Salto	4,683
ABC	3,774
Expansión	3,375

Gazetteer Sources

The entity list was compiled from four official Spanish transparency registers:

Register	Entries	Access method
Catalan Lobby Register (Decret Llei 1/2017)	5,930	Public API
CNMV Register (Art. 37, Ley 3/2013)	562	Web scraping
Community of Madrid (Ley 10/2019)	1,833	Web scraping
Valencian Community REGIA (Ley 25/2018)	1,361	Direct CSV download

Entries were manually reviewed to remove acronym-only entries, corporate legal suffixes (S.A., S.L., S.L.U.) without accompanying organisation names, and ambiguous single-token terms that generate systematic false positives in news text.

Training Procedure

Base model: es_core_news_lg (spaCy 3.8.7)
New label: LOBBY added via add_label()
Initialisation: resume_training() — continues from existing weights to prevent catastrophic forgetting
Disabled pipes: all non-NER components (morphologiser, parser, sentence recogniser, attribute ruler, lemmatiser)
Optimiser: Adam, learning rate 0.001, dropout 0.3, batch size 32
Iterations: 15
Hardware: NVIDIA GeForce RTX 4070 Ti SUPER (16.7 GB VRAM)
Train/test split: 80/20, seed 42 → 229,790 train / 57,448 test

Evaluation

Evaluated on the held-out 20% test split (57,448 sentences, seed 42):

Metric	Value
Precision	0.9964
Recall	0.9968
F₁	0.9966
True Positives (spans)	66,560
False Positives (spans)	238
False Negatives (spans)	216

TP/FP/FN refer to entity span counts; multiple spans may occur per sentence.

Intended Use

This model is intended for large-scale, longitudinal measurement of the media visibility of registered interest groups in Spanish news. It identifies entity spans whose surface form matches organisations present in the four transparency registers. It is not a detector of lobbying activity or influence attempts — the pragmatic role of a mention in context is outside the model's scope.

Suitable research applications include:

Tracking media salience of specific interest groups over time
Cross-outlet comparisons of lobbying organisation coverage
Combining lobby detection with topic classification for issue-specific analyses

Limitations

Coverage is bounded by the four Spanish transparency registers used to construct the gazetteer. Organisations not registered at the regional level (large multinationals, EU-level associations, etc.) are outside the model's reliable detection capacity.
Evaluation is on silver-labelled data derived from the same gazetteer; no manually annotated gold set is available.
The model requires text pre-processing to remove diacritics before inference (see usage example below).

Usage

import spacy
import unicodedata
from huggingface_hub import snapshot_download

def quitar_tildes(text):
    return "".join(
        c for c in unicodedata.normalize("NFD", str(text))
        if unicodedata.category(c) != "Mn"
    )

model_path = snapshot_download("lobby-ner-spain/lobby-ner-spain-final")
nlp = spacy.load(model_path)

text = "Telefónica ha pedido al Gobierno que revise la regulación del 5G. La CEOE y CEPYME han convocado una reunión con el Ministerio de Economía."
doc = nlp(quitar_tildes(text))

for ent in doc.ents:
    if ent.label_ == "LOBBY":
        print(f"{ent.text} [{ent.label_}]")

# Output:
# Telefonica [LOBBY]
# CEOE [LOBBY]
# CEPYME [LOBBY]

Interactive Demo

An interactive demo is available at https://huggingface.co/spaces/lobby-ner-spain/lobby-ner-spain-demo.

Citation

If you use this model, please cite the accompanying paper (forthcoming):

@article{lobbynerspain2026,
  title   = {Who Gets into the News? A Gazetteer-Based {NER} System for Tracking Lobbying Organisations in {Spanish} Media},
  journal = {Computational Communication Research},
  year    = {2026},
  note    = {Under review}
}

License

MIT

Downloads last month: -

lobby-ner-spain
/

lobby-ner-spain-final