YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
Neurobiber: Fast and Interpretable Stylistic Feature Extraction
Neurobiber is a transformer-based model that quickly predicts 96 interpretable stylistic features in text. These features are inspired by Biber's multidimensional framework of linguistic style, capturing everything from pronouns and passives to modal verbs and discourse devices. By combining a robust linguistically informed feature set with the speed of neural inference, Neurobiber enables large-scale stylistic analyses that were previously infeasible.
Why Neurobiber?
Extracting Biber-style features typically involves running a full parser or specialized tagger, which can be computationally expensive for large datasets or real-time applications. Neurobiber overcomes these challenges by:
- Operating up to 56x faster than parsing-based approaches.
- Retaining the interpretability of classical Biber-like feature definitions.
- Delivering high accuracy on diverse text genres (e.g., social media, news, literary works).
- Allowing seamless integration with modern deep learning pipelines via Hugging Face.
Example Script
The model now ships the feature names in its config, so you can map each output
dimension to its feature via model.config.id2label - no manual feature list.
import torch
import numpy as np
from transformers import AutoTokenizer, AutoModelForSequenceClassification
MODEL_NAME = "Blablablab/neurobiber"
CHUNK_SIZE = 512 # Neurobiber was trained with max_length=512
def load_model_and_tokenizer():
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME).to("cuda")
model.eval()
return model, tokenizer
def chunk_text(text, chunk_size=CHUNK_SIZE):
tokens = text.strip().split()
if not tokens:
return []
return [" ".join(tokens[i:i + chunk_size]) for i in range(0, len(tokens), chunk_size)]
def get_predictions_chunked_batch(model, tokenizer, texts, chunk_size=CHUNK_SIZE, subbatch_size=32):
chunked_texts = []
chunk_indices = []
for idx, text in enumerate(texts):
start = len(chunked_texts)
text_chunks = chunk_text(text, chunk_size)
chunked_texts.extend(text_chunks)
chunk_indices.append({
'original_idx': idx,
'chunk_range': (start, start + len(text_chunks))
})
# If there are no chunks (empty inputs), return zeros
if not chunked_texts:
return np.zeros((len(texts), model.config.num_labels))
all_chunk_preds = []
for i in range(0, len(chunked_texts), subbatch_size):
batch_chunks = chunked_texts[i : i + subbatch_size]
encodings = tokenizer(
batch_chunks,
return_tensors='pt',
padding=True,
truncation=True,
max_length=chunk_size
).to("cuda")
with torch.no_grad(), torch.amp.autocast("cuda"):
outputs = model(**encodings)
probs = torch.sigmoid(outputs.logits)
all_chunk_preds.append(probs.cpu())
all_chunk_preds = torch.cat(all_chunk_preds, dim=0) if all_chunk_preds else torch.empty(0)
predictions = [None] * len(texts)
for info in chunk_indices:
start, end = info['chunk_range']
if start == end:
# No tokens => no features
pred = torch.zeros(model.config.num_labels)
else:
# Take max across chunks for each feature
chunk_preds = all_chunk_preds[start:end]
pred, _ = torch.max(chunk_preds, dim=0)
predictions[info['original_idx']] = (pred > 0.5).int().numpy()
return np.array(predictions)
def predict_batch(model, tokenizer, texts, chunk_size=CHUNK_SIZE, subbatch_size=32):
return get_predictions_chunked_batch(model, tokenizer, texts, chunk_size, subbatch_size)
def predict_text(model, tokenizer, text, chunk_size=CHUNK_SIZE, subbatch_size=32):
batch_preds = predict_batch(model, tokenizer, [text], chunk_size, subbatch_size)
return batch_preds[0]
Single-Text Usage
model, tokenizer = load_model_and_tokenizer()
sample_text = "This is a sample text demonstrating certain stylistic features."
predictions = predict_text(model, tokenizer, sample_text)
# Map the 96-dim binary vector to feature names straight from the model config.
present = {model.config.id2label[i]: int(v) for i, v in enumerate(predictions)}
print(present) # {'BIN_QUAN': 0, 'BIN_QUPR': 1, ...}
print([f for f, v in present.items() if v]) # just the detected features
Batch Usage
docs = [
"First text goes here.",
"Second text, slightly different style."
]
model, tokenizer = load_model_and_tokenizer()
preds = predict_batch(model, tokenizer, docs)
print(preds.shape) # (2, 96)
# Names for any row come from the config:
id2label = model.config.id2label
for row in preds:
print([id2label[i] for i, v in enumerate(row) if v])
How It Works
Neurobiber is a fine-tuned RoBERTa. Given a text:
- The text is split into chunks (up to 512 tokens each).
- Each chunk is fed through the model to produce 96 logistic outputs (one per feature).
- The feature probabilities are aggregated across chunks so that each feature is
marked as
1if it appears in at least one chunk.
Each row in preds is a 96-element array. The mapping from index to feature name
is published in model.config.id2label (and the reverse in model.config.label2id).
Interpreting Outputs
- Each element is a binary label (0 or 1) indicating the model's detection of a
specific linguistic feature (e.g.,
BIN_VBDfor past tense verbs). - For long texts, segments of length 512 tokens are scored independently; if a
feature appears in any chunk, the output is
1for that feature.
Note on Feature Names
The 96 features and their order are defined by biberplus
(biberplus.tagger.constants.BIBER_PLUS_TAGS, prefixed with BIN_) and match the
training label order. This mapping is embedded in the model config, so prefer
model.config.id2label over any hardcoded list.
- Downloads last month
- 1,333