cnmoro
/

LexicalEmbed-Base

Feature Extraction

sentence-transformers

lexical_embedding

Model card Files Files and versions

cnmoro commited on Dec 15, 2025

Commit

fe73f95

·

verified ·

1 Parent(s): 489d90a

Update README.md

Files changed (1) hide show

README.md +8 -1

README.md CHANGED Viewed

@@ -25,7 +25,7 @@ Concept:
 This will be trained for 2 epochs. The current model here is the first one.
 ```python
-import torch
 from transformers import AutoModel, AutoTokenizer
 model_name = "cnmoro/LexicalEmbed-Base"
@@ -34,7 +34,14 @@ tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
 model.eval()
 texts = ["hello world", "hel wor"]
 inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():

 This will be trained for 2 epochs. The current model here is the first one.
 ```python
+import torch, re, unicodedata
 from transformers import AutoModel, AutoTokenizer
 model_name = "cnmoro/LexicalEmbed-Base"
 model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
 model.eval()
+def preprocess(text):
+    text = unicodedata.normalize('NFD', text)
+    text = ''.join(c for c in text if unicodedata.category(c) != 'Mn')
+    text = re.sub(r'[^\w\s]+', ' ', text.lower())
+    return re.sub(r'\s+', ' ', text).strip()
 texts = ["hello world", "hel wor"]
+texts = [ preprocess(s) for s in texts ]
 inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
 with torch.no_grad():