Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,137 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
---
|
| 6 |
+
# Character Embedding Model
|
| 7 |
+
|
| 8 |
+
A character level embedding for ASCII characters trained on Oxford English Dictionary.
|
| 9 |
+
<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->
|
| 10 |
+
|
| 11 |
+
## Model Description
|
| 12 |
+
|
| 13 |
+
This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:
|
| 14 |
+
|
| 15 |
+
- **Positive pairs**: A character and its corresponding word with that character blanked out
|
| 16 |
+
- **Negative pairs**: Different characters that should have dissimilar embeddings
|
| 17 |
+
|
| 18 |
+
### Architecture
|
| 19 |
+
|
| 20 |
+
- **Embedding Dimension**: 8
|
| 21 |
+
- **Hidden Dimension**: 64
|
| 22 |
+
- **Transformer Layers**: 2
|
| 23 |
+
- **Attention Heads**: 8
|
| 24 |
+
- **Vocabulary Size**: 257 (256 ASCII + blank token)
|
| 25 |
+
|
| 26 |
+
### Training
|
| 27 |
+
|
| 28 |
+
The model was trained on word-definition pairs from a dictionary corpus using:
|
| 29 |
+
- Mixed precision training (FP16)
|
| 30 |
+
- Contrastive loss with margin-based negative sampling
|
| 31 |
+
- Periodic embedding stabilization
|
| 32 |
+
- Best model selection based on quality score (positive similarity - negative similarity)
|
| 33 |
+
|
| 34 |
+
## Installation
|
| 35 |
+
|
| 36 |
+
```bash
|
| 37 |
+
pip install torch numpy
|
| 38 |
+
```
|
| 39 |
+
|
| 40 |
+
## Usage
|
| 41 |
+
|
| 42 |
+
### Loading the Model
|
| 43 |
+
|
| 44 |
+
```python
|
| 45 |
+
import torch
|
| 46 |
+
import numpy as np
|
| 47 |
+
import torch.nn.functional as F
|
| 48 |
+
|
| 49 |
+
# Load the pre-computed character embeddings
|
| 50 |
+
char_embeddings = np.load('char_embeddings_best.npy', allow_pickle=True).item()
|
| 51 |
+
|
| 52 |
+
# Convert to tensor for efficient operations
|
| 53 |
+
char_embedding_tensor = {}
|
| 54 |
+
for char, emb in char_embeddings.items():
|
| 55 |
+
char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
|
| 56 |
+
```
|
| 57 |
+
|
| 58 |
+
### Inference: Character to Embedding
|
| 59 |
+
|
| 60 |
+
```python
|
| 61 |
+
def get_character_embedding(char):
|
| 62 |
+
"""Get the embedding for a single character."""
|
| 63 |
+
if char in char_embedding_tensor:
|
| 64 |
+
return char_embedding_tensor[char]
|
| 65 |
+
else:
|
| 66 |
+
print(f"Warning: Character '{char}' not found in embeddings")
|
| 67 |
+
return None
|
| 68 |
+
|
| 69 |
+
# Example usage
|
| 70 |
+
char = 'a'
|
| 71 |
+
embedding = get_character_embedding(char)
|
| 72 |
+
print(f"Embedding for '{char}': {embedding}")
|
| 73 |
+
print(f"Embedding shape: {embedding.shape}") # Should be (8,)
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
### Inference: Embedding to Character (Decoding)
|
| 77 |
+
|
| 78 |
+
```python
|
| 79 |
+
def decode_embedding(query_embedding, top_k=5):
|
| 80 |
+
"""
|
| 81 |
+
Find the closest character(s) to a given embedding.
|
| 82 |
+
|
| 83 |
+
Args:
|
| 84 |
+
query_embedding: torch.Tensor of shape (8,)
|
| 85 |
+
top_k: Number of closest characters to return
|
| 86 |
+
|
| 87 |
+
Returns:
|
| 88 |
+
List of (character, similarity_score) tuples
|
| 89 |
+
"""
|
| 90 |
+
# Normalize query embedding
|
| 91 |
+
query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
|
| 92 |
+
|
| 93 |
+
similarities = []
|
| 94 |
+
for char, emb in char_embedding_tensor.items():
|
| 95 |
+
# Normalize character embedding
|
| 96 |
+
emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
|
| 97 |
+
# Compute cosine similarity
|
| 98 |
+
sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
|
| 99 |
+
similarities.append((char, sim))
|
| 100 |
+
|
| 101 |
+
# Sort by similarity (descending)
|
| 102 |
+
similarities.sort(key=lambda x: x[1], reverse=True)
|
| 103 |
+
|
| 104 |
+
return similarities[:top_k]
|
| 105 |
+
|
| 106 |
+
# Example usage
|
| 107 |
+
test_char = 'e'
|
| 108 |
+
test_embedding = get_character_embedding(test_char)
|
| 109 |
+
|
| 110 |
+
if test_embedding is not None:
|
| 111 |
+
top_matches = decode_embedding(test_embedding, top_k=5)
|
| 112 |
+
print(f"\nTop 5 characters similar to '{test_char}':")
|
| 113 |
+
for char, sim in top_matches:
|
| 114 |
+
print(f" '{char}': {sim:.4f}")
|
| 115 |
+
```
|
| 116 |
+
|
| 117 |
+
## Model File
|
| 118 |
+
|
| 119 |
+
- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)
|
| 120 |
+
|
| 121 |
+
## Limitations
|
| 122 |
+
|
| 123 |
+
- The model only supports ASCII characters (0-255) plus a special blank token
|
| 124 |
+
- Embeddings are context-averaged, so they may not capture all nuances of character usage
|
| 125 |
+
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
|
| 126 |
+
- The model uses a relatively small embedding dimension (8) for efficiency
|
| 127 |
+
|
| 128 |
+
## Citation
|
| 129 |
+
|
| 130 |
+
```bibtex
|
| 131 |
+
@misc{character_embedding_model,
|
| 132 |
+
title={Character Embedding Model with Blank-Filling},
|
| 133 |
+
author={Yuan Xu},
|
| 134 |
+
year={2025},
|
| 135 |
+
howpublished={\url{https://huggingface.co/your-username/character-embedding}}
|
| 136 |
+
}
|
| 137 |
+
```
|