IdioleX-ES — Style-Aware Spanish Sentence Embeddings

IdioleX-ES is a sentence encoder trained under the IDIOLEX framework for idiolectal representation learning — capturing how text is expressed rather than what it says. Embeddings encode stylistic and dialectal variation across 17 Spanish varieties, decoupled from semantic content.

Kantharuban et al., IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation (preprint, under review). Code: github.com/AnjaliRuban/IdioleX

Architecture

input_ids → BERTIN (RoBERTa-base) → layer-wise attention → mean pool
          → mean centering → L2 normalize → embedding
Component Detail
Base encoder bertin-project/bertin-roberta-base-spanish
Pooling Learnable layer-wise attention over all 13 hidden states (embedding + 12 transformer layers), then mean pool
Centering Running-mean subtraction estimated over the Spanish training corpus
Output L2-normalized vector, 768-dimensional

The scalar-mix weights are learned jointly with the encoder, following Rei et al. (2020).

Training

Data

Training data consists of Reddit comments from 17 regional Spanish-language subreddits, collected via the Pushshift archive through December 2024 and filtered for language and quality. Pre-training uses ~557k authors and ~49.5M sentences. Feature-supervised training uses 200 authors per dialect with LLM-annotated linguistic features.

Variety Subreddit Variety Subreddit
Argentinian r/argentina Mexican r/mexico
Bolivian r/BOLIVIA Panamanian r/Panama
Chilean r/chile Paraguayan r/Paraguay
Colombian r/Colombia Peruvian r/PERU
Cuban r/cuba Peninsular r/spain
Dominican r/Dominican Uruguayan r/uruguay
Ecuadorian r/ecuador Venezuelan r/vzla
El Salvadorian r/ElSalvador
Guatemalan r/guatemala
Honduran r/Honduras

Linguistic Features

41 binary dialectal features are extracted sentence-by-sentence using GPT-5-mini, covering: subject pronoun use (yo, , usted, vos, vosotros, ustedes, etc.), verbal morphology (voseo present/imperative suffixes, 2pl forms -áis/-éis/-ís), diminutive suffixes (-ito/a, -ico/a, -illo/a, -ino/a, -ete, etc.), clitic patterns (DOM a, accusative/dative doubling, preverbal clitics, clitic sequences), and orthographic markers (inverted punctuation, all-caps, repeated punctuation).

Objectives

Training proceeds in two stages:

Stage 1 — Ranking pre-training (full dataset): A margin ranking loss encourages sentences with higher hierarchical proximity to be closer in embedding space. Each batch of 16 is structured so every sentence has exactly one same-comment neighbor (r=3), two same-author neighbors (r=2), four same-dialect neighbors (r=1), and eight cross-dialect neighbors (r=0). The margin λ warms up linearly from 0 to 0.5 over 25k steps.

Stage 2 — Feature-aware training (annotated subset, α=0.5):

Loss Weight Purpose
Margin ranking loss 1 − α = 0.5 Proximity-based ranking
Feature prediction BCE 0.25 × α = 0.125 Predict 41 linguistic features
Supervised contrastive (Jaccard-weighted) α = 0.5 Feature-similarity alignment

A VICReg regularizer (weight 0.25) enforces variance ≥ 1 per dimension and decorrelates embedding dimensions to prevent anisotropy.

Hyperparameters

Parameter Value
Base model bertin-project/bertin-roberta-base-spanish
Hidden size 768
Transformer layers 12
Max sequence length 512
Batch size 32
Ranking group size 16
Feature vector dimension 41
Feature loss weight (α) 0.5
Contrastive temperature (τ) 0.07
Jaccard top-k 5
Learning rate 1 × 10⁻⁵
LR warmup 25k steps
Optimizer Adam
Margin (λ) 0 → 0.5 (linear warmup)
Training GPUs 4
Max training time ≤ 48 hrs

Performance

Dialect Identification — DSL-ML 2024 (multi-label)

Model F1 Exact Match
IdioleX-ES 0.85 0.62
Finetuned IdioleX-ES 0.84 0.63
Finetuned IdioleX-ES + Lexical 0.85 0.63
Finetuned BERT (baseline) 0.80 0.59
Centroid Clustering w/ BERT 0.77 0.57
Saleva & Palen-Michel, 2024 (shared task winner) 0.82 0.50

Authorship Attribution — PAN 2019 (open-set, cross-domain)

Model Accuracy
IdioleX-ES 31%
Finetuned IdioleX-ES 36%
Finetuned IdioleX-ES + Lexical 38%
Finetuned BERT (baseline) 28%
Centroid Clustering w/ BERT 16%
Centroid Clustering w/ E5 27%

PAN 2019 is a cross-domain benchmark (test domain not seen during training), so these results demonstrate transfer of stylistic features beyond topical content.

Semantic Decoupling

Pearson correlation between IdioleX-ES idiolectal similarity scores and Multilingual-E5 semantic similarity scores on withheld Reddit test pairs: ρ = 0.09. Stylistic and semantic similarity are largely independent.

Usage

import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained(
    "your-username/idiolex-bertin-spanish",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("your-username/idiolex-bertin-spanish")

sentences = [
    "Che, ¿me tirás una mano con esto?",   # Argentinian
    "Oye, ¿me echas una mano con esto?",   # Peninsular
]

inputs = tokenizer(
    sentences,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=model.config.model_len,  # 512
)

with torch.no_grad():
    embeddings = model(**inputs)   # [2, 768], L2-normalized

# Cosine similarity (embeddings are L2-normalized, so dot product = cosine sim)
similarity = embeddings @ embeddings.T
print(similarity)

Config

Parameter Value
base_model bertin-project/bertin-roberta-base-spanish
embedding_dim 768
layerwise_pooling True
num_layers 13 (12 transformer layers + embedding layer)
layer_norm False
layer_dropout None
mean_center True
model_len 512

Custom files

This model uses trust_remote_code=True. The following files are hosted in this repo:

File Purpose
configuration_idiolex.py IdioleXConfig
modeling_idiolex.py IdioleXModel
centering.py MeanCenterer — distributed running-mean buffer
layer_pool.py LayerwiseAttention — scalar-mix of transformer layers
pooling_utils.py last_token_pool, average_pool

Citation

@article{kantharuban2025idiolex,
  title   = {IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation},
  author  = {Kantharuban, Anjali and Srivastava, Aarohi and Faisal, Fahim and Ahia, Orevaoghene
             and Anastasopoulos, Antonios and Chiang, David and Tsvetkov, Yulia and Neubig, Graham},
  year    = {2025},
  note    = {Preprint, under review}
}
Downloads last month
9
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AnjaliRuban/idiolex-bertin-es

Finetuned
(16)
this model

Collection including AnjaliRuban/idiolex-bertin-es