Update README.md

fe73f95 verified 4 months ago

1.5 kB

license: mit
datasets:
  - cnmoro/LexicalTriplets
language:
  - en
  - pt
pipeline_tag: feature-extraction
library_name: sentence-transformers

This is a model trained on cnmoro/LexicalTriplets to produce lexical embeddings (not semantic!)

This can be used to compute lexical similarity between words or phrases.

Concept: "Some text" will be similar to "Sm txt"

"King" will not be similar to "Queen" or "Royalty"

"Dog" will not be similar to "Animal"

"Doge" will be similar to "Dog"

This will be trained for 2 epochs. The current model here is the first one.

import torch, re, unicodedata
from transformers import AutoModel, AutoTokenizer

model_name = "cnmoro/LexicalEmbed-Base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
model.eval()

def preprocess(text):
    text = unicodedata.normalize('NFD', text)
    text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
    text = re.sub(r'[^\w\s]+', ' ', text.lower())
    return re.sub(r'\s+', ' ', text).strip()

texts = ["hello world", "hel wor"]
texts = [ preprocess(s) for s in texts ]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    embeddings = model(**inputs)

cosine_sim = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
print(f"Cosine Similarity: {cosine_sim.item()}") # 0.8960