IdiomBERT β Joint mBERT for Multilingual Idiom Detection
Fine-tuned google-bert/bert-base-multilingual-cased for joint idiom detection across
English, Spanish, Hindi, and Telugu. One forward pass produces three outputs:
- Classification: literal (0) vs idiomatic (1)
- Span start/end: token indices of the idiomatic span
Trained on the MultiIdiom dataset (EN+ES+HI+TE split).
Files
pytorch_model.bin/model.safetensorsβ fine-tuned mBERT backbonetask_heads.ptβ three linear heads (cls_head,start_head,end_head)tokenizer.*β standard mBERT tokenizer
Usage
import torch
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import hf_hub_download
REPO = "Justarandomperson/IdiomBERT-system-e"
backbone = AutoModel.from_pretrained(REPO)
tokenizer = AutoTokenizer.from_pretrained(REPO)
heads = torch.load(hf_hub_download(REPO, 'task_heads.pt'), map_location='cpu', weights_only=True)
# Attach heads
hidden = backbone.config.hidden_size # 768
cls_head = torch.nn.Linear(hidden, 2)
start_head = torch.nn.Linear(hidden, 1)
end_head = torch.nn.Linear(hidden, 1)
cls_head.load_state_dict(heads['cls_head'])
start_head.load_state_dict(heads['start_head'])
end_head.load_state_dict(heads['end_head'])
backbone.eval(); cls_head.eval(); start_head.eval(); end_head.eval()
# Inference
enc = tokenizer("He kicked the bucket last night .", return_tensors='pt')
with torch.no_grad():
seq = backbone(**enc).last_hidden_state
label = cls_head(seq[:, 0, :]).argmax(-1).item() # 0=literal, 1=idiomatic
start = start_head(seq).squeeze(-1).argmax(-1).item()
end = end_head(seq).squeeze(-1).argmax(-1).item()
print(label, start, end)
- Downloads last month
- 17
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for Justarandomperson/IdiomBERT-system-e
Base model
google-bert/bert-base-multilingual-cased