IdioleX-ES — Style-Aware Spanish Sentence Embeddings

IdioleX-ES is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning — capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 17 Spanish varieties, decoupled from semantic content.

Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX

Architecture

input_ids → BERTIN (RoBERTa-base) → layer-wise attention → mean pool
          → mean centering → L2 normalize → embedding

Component	Detail
Base encoder	`bertin-project/bertin-roberta-base-spanish`
Pooling	Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool
Centering	Running-mean subtraction estimated over the Spanish training corpus
Output	L2-normalized vector, 768-dimensional

The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).

Training

Data

Training data consists of Reddit comments from 17 regional Spanish-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~557k authors and ~49.5M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.

Variety	Subreddit	Variety	Subreddit
Argentinian	r/argentina	Mexican	r/mexico
Bolivian	r/BOLIVIA	Panamanian	r/Panama
Chilean	r/chile	Paraguayan	r/Paraguay
Colombian	r/Colombia	Peruvian	r/PERU
Cuban	r/cuba	Peninsular	r/spain
Dominican	r/Dominican	Uruguayan	r/uruguay
Ecuadorian	r/ecuador	Venezuelan	r/vzla
El Salvadorian	r/ElSalvador
Guatemalan	r/guatemala
Honduran	r/Honduras

Linguistic Features

41 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini, covering: subject pronoun use (yo, tú, usted, vos, vosotros, ustedes, etc.), verbal morphology (voseo present/imperative suffixes, 2pl forms -áis/-éis/-ís), diminutive suffixes (-ito/a, -ico/a, -illo/a, -ino/a, -ete, etc.), clitic patterns (DOM a, accusative/dative doubling, preverbal clitics, clitic sequences), and orthographic markers (inverted punctuation, all-caps, repeated punctuation).

Objectives

Training proceeds in two stages:

Stage 1 — Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin λ warms up linearly from 0 to 0.5 over 25k steps.

Stage 2 — Feature-aware training (annotated subset, α=0.5):

Loss	Weight	Purpose
Margin ranking loss	1 − α = 0.5	Proximity-based ranking
Feature prediction BCE	0.25 × α = 0.125	Predict 41 linguistic features
Supervised contrastive (Jaccard-weighted)	α = 0.5	Feature-similarity alignment

A VICReg regularizer (weight 0.25) enforces variance ≥ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.

Hyperparameters

Parameter	Value
Base model	`bertin-project/bertin-roberta-base-spanish`
Hidden size	768
Transformer layers	12
Max sequence length	512
Batch size	32
Ranking group size	16
Feature vector dimension	41
Feature loss weight (α)	0.5
Contrastive temperature (τ)	0.07
Jaccard top-k	5
Learning rate	1 × 10⁻⁵
LR warmup	25k steps
Optimizer	Adam
Margin (λ)	0 → 0.5 (linear warmup)
Training GPUs	4
Max training time	≤ 48 hrs

Performance

Dialect Identification — DSL-ML 2024 (multi-label)

Model	F1	Exact Match
IdioleX-ES	0.85	0.62
Finetuned IdioleX-ES	0.84	0.63
Finetuned IdioleX-ES + Lexical	0.85	0.63
Finetuned BERT (baseline)	0.80	0.59
Centroid Clustering w/ BERT	0.77	0.57
Saleva & Palen-Michel, 2024 (shared task winner)	0.82	0.50

Authorship Attribution — PAN 2019 (open-set, cross-domain)

Model	Accuracy
IdioleX-ES	31%
Finetuned IdioleX-ES	36%
Finetuned IdioleX-ES + Lexical	38%
Finetuned BERT (baseline)	28%
Centroid Clustering w/ BERT	16%
Centroid Clustering w/ E5	27%

PAN 2019 is a cross-domain benchmark (test domain not seen during training), so these results demonstrate transfer of stylistic features beyond topical content.

Semantic Decoupling

Pearson correlation between IdioleX-ES idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ρ = 0.09. Stylistic and semantic similarity are largely independent.

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "your-username/idiolex-bertin-spanish",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("your-username/idiolex-bertin-spanish")

sentences = [
    "Che, ¿me tirás una mano con esto?",   # Argentinian
    "Oye, ¿me echas una mano con esto?",   # Peninsular
]

inputs = tokenizer(
    sentences,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=model.config.model_len,  # 512
)

with torch.no_grad():
    embeddings = model(**inputs)   # [2, 768], L2-normalized

# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)

Config

Parameter	Value
`base_model`	`bertin-project/bertin-roberta-base-spanish`
`embedding_dim`	`768`
`layerwise_pooling`	`True`
`num_layers`	`13` (12 transformer layers + embedding layer)
`layer_norm`	`False`
`layer_dropout`	`None`
`mean_center`	`True`
`model_len`	`512`

Custom files

This model uses trust_remote_code=True. The following files are hosted in this repo:

File	Purpose
`configuration_idiolex.py`	`IdioleXConfig`
`modeling_idiolex.py`	`IdioleXModel`
`centering.py`	`MeanCenterer` — distributed running-mean buffer
`layer_pool.py`	`LayerwiseAttention` — scalar-mix of transformer layers
`pooling_utils.py`	`last_token_pool`, `average_pool`

Citation

@article{kantharuban2025idiolex,
  title   = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
  author  = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
             and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
  year    = {2025},
  note    = {Preprint, under review}
}

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for AnjaliRuban/idiolex-bertin-es

Base model

bertin-project/bertin-roberta-base-spanish

Finetuned

(17)

this model

Collection including AnjaliRuban/idiolex-bertin-es

IdioleX

Collection

Everything associated with the IdioleX paper. • 5 items • Updated Apr 7