npc0
/

CharEmb

English

Model card Files Files and versions

xet

Community

npc0 commited on Nov 1, 2025

Commit

affc78a

verified ·

1 Parent(s): 15adc5f

Update README.md

Browse files

Files changed (1) hide show

README.md +137 -3

README.md CHANGED Viewed

@@ -1,3 +1,137 @@
----
-license: mit
----

+---
+license: mit
+language:
+- en
+---
+# Character Embedding Model
+A character level embedding for ASCII characters trained on Oxford English Dictionary.
+<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->
+## Model Description
+This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:
+- **Positive pairs**: A character and its corresponding word with that character blanked out
+- **Negative pairs**: Different characters that should have dissimilar embeddings
+### Architecture
+- **Embedding Dimension**: 8
+- **Hidden Dimension**: 64
+- **Transformer Layers**: 2
+- **Attention Heads**: 8
+- **Vocabulary Size**: 257 (256 ASCII + blank token)
+### Training
+The model was trained on word-definition pairs from a dictionary corpus using:
+- Mixed precision training (FP16)
+- Contrastive loss with margin-based negative sampling
+- Periodic embedding stabilization
+- Best model selection based on quality score (positive similarity - negative similarity)
+## Installation
+```bash
+pip install torch numpy
+```
+## Usage
+### Loading the Model
+```python
+import torch
+import numpy as np
+import torch.nn.functional as F
+# Load the pre-computed character embeddings
+char_embeddings = np.load('char_embeddings_best.npy', allow_pickle=True).item()
+# Convert to tensor for efficient operations
+char_embedding_tensor = {}
+for char, emb in char_embeddings.items():
+    char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
+```
+### Inference: Character to Embedding
+```python
+def get_character_embedding(char):
+    """Get the embedding for a single character."""
+    if char in char_embedding_tensor:
+        return char_embedding_tensor[char]
+    else:
+        print(f"Warning: Character '{char}' not found in embeddings")
+        return None
+# Example usage
+char = 'a'
+embedding = get_character_embedding(char)
+print(f"Embedding for '{char}': {embedding}")
+print(f"Embedding shape: {embedding.shape}")  # Should be (8,)
+```
+### Inference: Embedding to Character (Decoding)
+```python
+def decode_embedding(query_embedding, top_k=5):
+    """
+    Find the closest character(s) to a given embedding.
+    Args:
+        query_embedding: torch.Tensor of shape (8,)
+        top_k: Number of closest characters to return
+    Returns:
+        List of (character, similarity_score) tuples
+    """
+    # Normalize query embedding
+    query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
+    similarities = []
+    for char, emb in char_embedding_tensor.items():
+        # Normalize character embedding
+        emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
+        # Compute cosine similarity
+        sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
+        similarities.append((char, sim))
+    # Sort by similarity (descending)
+    similarities.sort(key=lambda x: x[1], reverse=True)
+    return similarities[:top_k]
+# Example usage
+test_char = 'e'
+test_embedding = get_character_embedding(test_char)
+if test_embedding is not None:
+    top_matches = decode_embedding(test_embedding, top_k=5)
+    print(f"\nTop 5 characters similar to '{test_char}':")
+    for char, sim in top_matches:
+        print(f"  '{char}': {sim:.4f}")
+```
+## Model File
+- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)
+## Limitations
+- The model only supports ASCII characters (0-255) plus a special blank token
+- Embeddings are context-averaged, so they may not capture all nuances of character usage
+- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
+- The model uses a relatively small embedding dimension (8) for efficiency
+## Citation
+```bibtex
+@misc{character_embedding_model,
+  title={Character Embedding Model with Blank-Filling},
+  author={Yuan Xu},
+  year={2025},
+  howpublished={\url{https://huggingface.co/your-username/character-embedding}}
+}
+```