File size: 4,316 Bytes
affc78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f29e223
affc78a
 
 
 
 
 
 
 
 
 
f29e223
 
 
 
affc78a
 
f29e223
affc78a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
license: mit
language:
- en
---
# Character Embedding Model

A character level embedding for ASCII characters trained on Oxford English Dictionary.
<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->

## Model Description

This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:

- **Positive pairs**: A character and its corresponding word with that character blanked out
- **Negative pairs**: Different characters that should have dissimilar embeddings

### Architecture

- **Embedding Dimension**: 8
- **Hidden Dimension**: 64
- **Transformer Layers**: 2
- **Attention Heads**: 8
- **Vocabulary Size**: 257 (256 ASCII + blank token)

### Training

The model was trained on word-definition pairs from a dictionary corpus using:
- Mixed precision training (FP16)
- Contrastive loss with margin-based negative sampling
- Periodic embedding stabilization
- Best model selection based on quality score (positive similarity - negative similarity)

## Installation

```bash
pip install torch numpy huggingface_hub
```

## Usage

### Loading the Model

```python
import torch
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

# Example for downloading a single file
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")

# Load the pre-computed character embeddings
char_embeddings = np.load(local_path, allow_pickle=True).item()

# Convert to tensor for efficient operations
char_embedding_tensor = {}
for char, emb in char_embeddings.items():
    char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
```

### Inference: Character to Embedding

```python
def get_character_embedding(char):
    """Get the embedding for a single character."""
    if char in char_embedding_tensor:
        return char_embedding_tensor[char]
    else:
        print(f"Warning: Character '{char}' not found in embeddings")
        return None

# Example usage
char = 'a'
embedding = get_character_embedding(char)
print(f"Embedding for '{char}': {embedding}")
print(f"Embedding shape: {embedding.shape}")  # Should be (8,)
```

### Inference: Embedding to Character (Decoding)

```python
def decode_embedding(query_embedding, top_k=5):
    """
    Find the closest character(s) to a given embedding.
    
    Args:
        query_embedding: torch.Tensor of shape (8,)
        top_k: Number of closest characters to return
    
    Returns:
        List of (character, similarity_score) tuples
    """
    # Normalize query embedding
    query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
    
    similarities = []
    for char, emb in char_embedding_tensor.items():
        # Normalize character embedding
        emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
        # Compute cosine similarity
        sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
        similarities.append((char, sim))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return similarities[:top_k]

# Example usage
test_char = 'e'
test_embedding = get_character_embedding(test_char)

if test_embedding is not None:
    top_matches = decode_embedding(test_embedding, top_k=5)
    print(f"\nTop 5 characters similar to '{test_char}':")
    for char, sim in top_matches:
        print(f"  '{char}': {sim:.4f}")
```

## Model File

- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)

## Limitations

- The model only supports ASCII characters (0-255) plus a special blank token
- Embeddings are context-averaged, so they may not capture all nuances of character usage
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
- The model uses a relatively small embedding dimension (8) for efficiency

## Citation

```bibtex
@misc{character_embedding_model,
  title={Character Embedding Model with Blank-Filling},
  author={Yuan Xu},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/character-embedding}}
}
```