Instructions to use lobby-ner-spain/lobby-ner-spain-final with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- spaCy
How to use lobby-ner-spain/lobby-ner-spain-final with spaCy:
!pip install https://huggingface.co/lobby-ner-spain/lobby-ner-spain-final/resolve/main/lobby-ner-spain-final-any-py3-none-any.whl # Using spacy.load(). import spacy nlp = spacy.load("lobby-ner-spain-final") # Importing as module. import lobby-ner-spain-final nlp = lobby-ner-spain-final.load() - Notebooks
- Google Colab
- Kaggle
Lobby NER Spain — Spanish News
A spaCy NER model for detecting mentions of officially registered lobbying organisations in Spanish news text. Fine-tuned from es_core_news_lg (spaCy 3.8.7) with a domain-specific LOBBY entity label grounded in four official Spanish transparency registers.
Model Description
This model extends spaCy's large Spanish pipeline with a LOBBY entity label that identifies organisations formally registered as interest group actors in one of four Spanish regional transparency registers. Unlike the base model's ORG label — which captures any organisation regardless of its role in the political process — LOBBY is restricted to entities that have formally declared their lobbying or representational activities to a Spanish regulatory body.
The model preserves all original entity labels (PER, ORG, LOC, MISC) alongside the new LOBBY label through class-incremental fine-tuning using resume_training(), which avoids catastrophic forgetting by continuing from existing model weights.
Training Data
The model was trained on a silver-labelled corpus of 287,238 sentences drawn from 19 Spanish news outlets covering the period May 2023 to January 2025. Silver labels were generated by applying a gazetteer of 8,419 registered organisations to the sentence corpus using spaCy's PhraseMatcher, with a capitalisation filter that discards matches whose first character is lowercase.
Outlet distribution (sentences with LOBBY match):
| Outlet | Sentences |
|---|---|
| El Español | 62,091 |
| El Confidencial | 37,259 |
| OKDiario | 30,831 |
| El País | 28,274 |
| El Periódico | 28,034 |
| 20minutos | 21,902 |
| La Vanguardia | 21,245 |
| El Diario | 14,388 |
| Europa Press | 13,603 |
| Público | 11,602 |
| El Economista | 11,190 |
| Cinco Días | 11,006 |
| Libertad Digital | 8,280 |
| Huffington Post | 7,770 |
| El Mundo | 7,558 |
| La Razón | 7,324 |
| El Salto | 4,683 |
| ABC | 3,774 |
| Expansión | 3,375 |
Gazetteer Sources
The entity list was compiled from four official Spanish transparency registers:
| Register | Entries | Access method |
|---|---|---|
| Catalan Lobby Register (Decret Llei 1/2017) | 5,930 | Public API |
| CNMV Register (Art. 37, Ley 3/2013) | 562 | Web scraping |
| Community of Madrid (Ley 10/2019) | 1,833 | Web scraping |
| Valencian Community REGIA (Ley 25/2018) | 1,361 | Direct CSV download |
Entries were manually reviewed to remove acronym-only entries, corporate legal suffixes (S.A., S.L., S.L.U.) without accompanying organisation names, and ambiguous single-token terms that generate systematic false positives in news text.
Training Procedure
- Base model:
es_core_news_lg(spaCy 3.8.7) - New label:
LOBBYadded viaadd_label() - Initialisation:
resume_training()— continues from existing weights to prevent catastrophic forgetting - Disabled pipes: all non-NER components (morphologiser, parser, sentence recogniser, attribute ruler, lemmatiser)
- Optimiser: Adam, learning rate 0.001, dropout 0.3, batch size 32
- Iterations: 15
- Hardware: NVIDIA GeForce RTX 4070 Ti SUPER (16.7 GB VRAM)
- Train/test split: 80/20, seed 42 → 229,790 train / 57,448 test
Evaluation
Evaluated on the held-out 20% test split (57,448 sentences, seed 42):
| Metric | Value |
|---|---|
| Precision | 0.9964 |
| Recall | 0.9968 |
| F₁ | 0.9966 |
| True Positives (spans) | 66,560 |
| False Positives (spans) | 238 |
| False Negatives (spans) | 216 |
TP/FP/FN refer to entity span counts; multiple spans may occur per sentence.
Intended Use
This model is intended for large-scale, longitudinal measurement of the media visibility of registered interest groups in Spanish news. It identifies entity spans whose surface form matches organisations present in the four transparency registers. It is not a detector of lobbying activity or influence attempts — the pragmatic role of a mention in context is outside the model's scope.
Suitable research applications include:
- Tracking media salience of specific interest groups over time
- Cross-outlet comparisons of lobbying organisation coverage
- Combining lobby detection with topic classification for issue-specific analyses
Limitations
- Coverage is bounded by the four Spanish transparency registers used to construct the gazetteer. Organisations not registered at the regional level (large multinationals, EU-level associations, etc.) are outside the model's reliable detection capacity.
- Evaluation is on silver-labelled data derived from the same gazetteer; no manually annotated gold set is available.
- The model requires text pre-processing to remove diacritics before inference (see usage example below).
Usage
import spacy
import unicodedata
from huggingface_hub import snapshot_download
def quitar_tildes(text):
return "".join(
c for c in unicodedata.normalize("NFD", str(text))
if unicodedata.category(c) != "Mn"
)
model_path = snapshot_download("lobby-ner-spain/lobby-ner-spain-final")
nlp = spacy.load(model_path)
text = "Telefónica ha pedido al Gobierno que revise la regulación del 5G. La CEOE y CEPYME han convocado una reunión con el Ministerio de Economía."
doc = nlp(quitar_tildes(text))
for ent in doc.ents:
if ent.label_ == "LOBBY":
print(f"{ent.text} [{ent.label_}]")
# Output:
# Telefonica [LOBBY]
# CEOE [LOBBY]
# CEPYME [LOBBY]
Interactive Demo
An interactive demo is available at https://huggingface.co/spaces/lobby-ner-spain/lobby-ner-spain-demo.
Citation
If you use this model, please cite the accompanying paper (forthcoming):
@article{lobbynerspain2026,
title = {Who Gets into the News? A Gazetteer-Based {NER} System for Tracking Lobbying Organisations in {Spanish} Media},
journal = {Computational Communication Research},
year = {2026},
note = {Under review}
}
License
MIT
- Downloads last month
- -