IdioleX-ES — Style-Aware Spanish Sentence Embeddings
IdioleX-ES is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning — capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 17 Spanish varieties, decoupled from semantic content.
Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX
Architecture
input_ids → BERTIN (RoBERTa-base) → layer-wise attention → mean pool
→ mean centering → L2 normalize → embedding
| Component | Detail |
|---|---|
| Base encoder | bertin-project/bertin-roberta-base-spanish |
| Pooling | Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool |
| Centering | Running-mean subtraction estimated over the Spanish training corpus |
| Output | L2-normalized vector, 768-dimensional |
The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).
Training
Data
Training data consists of Reddit comments from 17 regional Spanish-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~557k authors and ~49.5M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.
| Variety | Subreddit | Variety | Subreddit |
|---|---|---|---|
| Argentinian | r/argentina | Mexican | r/mexico |
| Bolivian | r/BOLIVIA | Panamanian | r/Panama |
| Chilean | r/chile | Paraguayan | r/Paraguay |
| Colombian | r/Colombia | Peruvian | r/PERU |
| Cuban | r/cuba | Peninsular | r/spain |
| Dominican | r/Dominican | Uruguayan | r/uruguay |
| Ecuadorian | r/ecuador | Venezuelan | r/vzla |
| El Salvadorian | r/ElSalvador | ||
| Guatemalan | r/guatemala | ||
| Honduran | r/Honduras |
Linguistic Features
41 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini,
covering: subject pronoun use (yo, tú, usted, vos, vosotros, ustedes, etc.),
verbal morphology (voseo present/imperative suffixes, 2pl forms -áis/-éis/-ís),
diminutive suffixes (-ito/a, -ico/a, -illo/a, -ino/a, -ete, etc.),
clitic patterns (DOM a, accusative/dative doubling, preverbal clitics, clitic sequences),
and orthographic markers (inverted punctuation, all-caps, repeated punctuation).
Objectives
Training proceeds in two stages:
Stage 1 — Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin λ warms up linearly from 0 to 0.5 over 25k steps.
Stage 2 — Feature-aware training (annotated subset, α=0.5):
| Loss | Weight | Purpose |
|---|---|---|
| Margin ranking loss | 1 − α = 0.5 | Proximity-based ranking |
| Feature prediction BCE | 0.25 × α = 0.125 | Predict 41 linguistic features |
| Supervised contrastive (Jaccard-weighted) | α = 0.5 | Feature-similarity alignment |
A VICReg regularizer (weight 0.25) enforces variance ≥ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.
Hyperparameters
| Parameter | Value |
|---|---|
| Base model | bertin-project/bertin-roberta-base-spanish |
| Hidden size | 768 |
| Transformer layers | 12 |
| Max sequence length | 512 |
| Batch size | 32 |
| Ranking group size | 16 |
| Feature vector dimension | 41 |
| Feature loss weight (α) | 0.5 |
| Contrastive temperature (τ) | 0.07 |
| Jaccard top-k | 5 |
| Learning rate | 1 × 10⁻⁵ |
| LR warmup | 25k steps |
| Optimizer | Adam |
| Margin (λ) | 0 → 0.5 (linear warmup) |
| Training GPUs | 4 |
| Max training time | ≤ 48 hrs |
Performance
Dialect Identification — DSL-ML 2024 (multi-label)
| Model | F1 | Exact Match |
|---|---|---|
| IdioleX-ES | 0.85 | 0.62 |
| Finetuned IdioleX-ES | 0.84 | 0.63 |
| Finetuned IdioleX-ES + Lexical | 0.85 | 0.63 |
| Finetuned BERT (baseline) | 0.80 | 0.59 |
| Centroid Clustering w/ BERT | 0.77 | 0.57 |
| Saleva & Palen-Michel, 2024 (shared task winner) | 0.82 | 0.50 |
Authorship Attribution — PAN 2019 (open-set, cross-domain)
| Model | Accuracy |
|---|---|
| IdioleX-ES | 31% |
| Finetuned IdioleX-ES | 36% |
| Finetuned IdioleX-ES + Lexical | 38% |
| Finetuned BERT (baseline) | 28% |
| Centroid Clustering w/ BERT | 16% |
| Centroid Clustering w/ E5 | 27% |
PAN 2019 is a cross-domain benchmark (test domain not seen during training), so these results demonstrate transfer of stylistic features beyond topical content.
Semantic Decoupling
Pearson correlation between IdioleX-ES idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ρ = 0.09. Stylistic and semantic similarity are largely independent.
Usage
import torch
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained(
"your-username/idiolex-bertin-spanish",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("your-username/idiolex-bertin-spanish")
sentences = [
"Che, ¿me tirás una mano con esto?", # Argentinian
"Oye, ¿me echas una mano con esto?", # Peninsular
]
inputs = tokenizer(
sentences,
return_tensors="pt",
truncation=True,
padding=True,
max_length=model.config.model_len, # 512
)
with torch.no_grad():
embeddings = model(**inputs) # [2, 768], L2-normalized
# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)
Config
| Parameter | Value |
|---|---|
base_model |
bertin-project/bertin-roberta-base-spanish |
embedding_dim |
768 |
layerwise_pooling |
True |
num_layers |
13 (12 transformer layers + embedding layer) |
layer_norm |
False |
layer_dropout |
None |
mean_center |
True |
model_len |
512 |
Custom files
This model uses trust_remote_code=True. The following files are hosted in this repo:
| File | Purpose |
|---|---|
configuration_idiolex.py |
IdioleXConfig |
modeling_idiolex.py |
IdioleXModel |
centering.py |
MeanCenterer — distributed running-mean buffer |
layer_pool.py |
LayerwiseAttention — scalar-mix of transformer layers |
pooling_utils.py |
last_token_pool, average_pool |
Citation
@article{kantharuban2025idiolex,
title = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
author = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
year = {2025},
note = {Preprint, under review}
}
- Downloads last month
- 9
Model tree for AnjaliRuban/idiolex-bertin-es
Base model
bertin-project/bertin-roberta-base-spanish