npc0 commited on
Commit
affc78a
·
verified ·
1 Parent(s): 15adc5f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -3
README.md CHANGED
@@ -1,3 +1,137 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ ---
6
+ # Character Embedding Model
7
+
8
+ A character level embedding for ASCII characters trained on Oxford English Dictionary.
9
+ <!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->
10
+
11
+ ## Model Description
12
+
13
+ This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:
14
+
15
+ - **Positive pairs**: A character and its corresponding word with that character blanked out
16
+ - **Negative pairs**: Different characters that should have dissimilar embeddings
17
+
18
+ ### Architecture
19
+
20
+ - **Embedding Dimension**: 8
21
+ - **Hidden Dimension**: 64
22
+ - **Transformer Layers**: 2
23
+ - **Attention Heads**: 8
24
+ - **Vocabulary Size**: 257 (256 ASCII + blank token)
25
+
26
+ ### Training
27
+
28
+ The model was trained on word-definition pairs from a dictionary corpus using:
29
+ - Mixed precision training (FP16)
30
+ - Contrastive loss with margin-based negative sampling
31
+ - Periodic embedding stabilization
32
+ - Best model selection based on quality score (positive similarity - negative similarity)
33
+
34
+ ## Installation
35
+
36
+ ```bash
37
+ pip install torch numpy
38
+ ```
39
+
40
+ ## Usage
41
+
42
+ ### Loading the Model
43
+
44
+ ```python
45
+ import torch
46
+ import numpy as np
47
+ import torch.nn.functional as F
48
+
49
+ # Load the pre-computed character embeddings
50
+ char_embeddings = np.load('char_embeddings_best.npy', allow_pickle=True).item()
51
+
52
+ # Convert to tensor for efficient operations
53
+ char_embedding_tensor = {}
54
+ for char, emb in char_embeddings.items():
55
+ char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
56
+ ```
57
+
58
+ ### Inference: Character to Embedding
59
+
60
+ ```python
61
+ def get_character_embedding(char):
62
+ """Get the embedding for a single character."""
63
+ if char in char_embedding_tensor:
64
+ return char_embedding_tensor[char]
65
+ else:
66
+ print(f"Warning: Character '{char}' not found in embeddings")
67
+ return None
68
+
69
+ # Example usage
70
+ char = 'a'
71
+ embedding = get_character_embedding(char)
72
+ print(f"Embedding for '{char}': {embedding}")
73
+ print(f"Embedding shape: {embedding.shape}") # Should be (8,)
74
+ ```
75
+
76
+ ### Inference: Embedding to Character (Decoding)
77
+
78
+ ```python
79
+ def decode_embedding(query_embedding, top_k=5):
80
+ """
81
+ Find the closest character(s) to a given embedding.
82
+
83
+ Args:
84
+ query_embedding: torch.Tensor of shape (8,)
85
+ top_k: Number of closest characters to return
86
+
87
+ Returns:
88
+ List of (character, similarity_score) tuples
89
+ """
90
+ # Normalize query embedding
91
+ query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
92
+
93
+ similarities = []
94
+ for char, emb in char_embedding_tensor.items():
95
+ # Normalize character embedding
96
+ emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
97
+ # Compute cosine similarity
98
+ sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
99
+ similarities.append((char, sim))
100
+
101
+ # Sort by similarity (descending)
102
+ similarities.sort(key=lambda x: x[1], reverse=True)
103
+
104
+ return similarities[:top_k]
105
+
106
+ # Example usage
107
+ test_char = 'e'
108
+ test_embedding = get_character_embedding(test_char)
109
+
110
+ if test_embedding is not None:
111
+ top_matches = decode_embedding(test_embedding, top_k=5)
112
+ print(f"\nTop 5 characters similar to '{test_char}':")
113
+ for char, sim in top_matches:
114
+ print(f" '{char}': {sim:.4f}")
115
+ ```
116
+
117
+ ## Model File
118
+
119
+ - `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)
120
+
121
+ ## Limitations
122
+
123
+ - The model only supports ASCII characters (0-255) plus a special blank token
124
+ - Embeddings are context-averaged, so they may not capture all nuances of character usage
125
+ - Performance Limited by the diversity and quality of training data of Oxford English Dictionary
126
+ - The model uses a relatively small embedding dimension (8) for efficiency
127
+
128
+ ## Citation
129
+
130
+ ```bibtex
131
+ @misc{character_embedding_model,
132
+ title={Character Embedding Model with Blank-Filling},
133
+ author={Yuan Xu},
134
+ year={2025},
135
+ howpublished={\url{https://huggingface.co/your-username/character-embedding}}
136
+ }
137
+ ```