File size: 4,316 Bytes
affc78a f29e223 affc78a f29e223 affc78a f29e223 affc78a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 |
---
license: mit
language:
- en
---
# Character Embedding Model
A character level embedding for ASCII characters trained on Oxford English Dictionary.
<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->
## Model Description
This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:
- **Positive pairs**: A character and its corresponding word with that character blanked out
- **Negative pairs**: Different characters that should have dissimilar embeddings
### Architecture
- **Embedding Dimension**: 8
- **Hidden Dimension**: 64
- **Transformer Layers**: 2
- **Attention Heads**: 8
- **Vocabulary Size**: 257 (256 ASCII + blank token)
### Training
The model was trained on word-definition pairs from a dictionary corpus using:
- Mixed precision training (FP16)
- Contrastive loss with margin-based negative sampling
- Periodic embedding stabilization
- Best model selection based on quality score (positive similarity - negative similarity)
## Installation
```bash
pip install torch numpy huggingface_hub
```
## Usage
### Loading the Model
```python
import torch
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
# Example for downloading a single file
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")
# Load the pre-computed character embeddings
char_embeddings = np.load(local_path, allow_pickle=True).item()
# Convert to tensor for efficient operations
char_embedding_tensor = {}
for char, emb in char_embeddings.items():
char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
```
### Inference: Character to Embedding
```python
def get_character_embedding(char):
"""Get the embedding for a single character."""
if char in char_embedding_tensor:
return char_embedding_tensor[char]
else:
print(f"Warning: Character '{char}' not found in embeddings")
return None
# Example usage
char = 'a'
embedding = get_character_embedding(char)
print(f"Embedding for '{char}': {embedding}")
print(f"Embedding shape: {embedding.shape}") # Should be (8,)
```
### Inference: Embedding to Character (Decoding)
```python
def decode_embedding(query_embedding, top_k=5):
"""
Find the closest character(s) to a given embedding.
Args:
query_embedding: torch.Tensor of shape (8,)
top_k: Number of closest characters to return
Returns:
List of (character, similarity_score) tuples
"""
# Normalize query embedding
query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
similarities = []
for char, emb in char_embedding_tensor.items():
# Normalize character embedding
emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
# Compute cosine similarity
sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
similarities.append((char, sim))
# Sort by similarity (descending)
similarities.sort(key=lambda x: x[1], reverse=True)
return similarities[:top_k]
# Example usage
test_char = 'e'
test_embedding = get_character_embedding(test_char)
if test_embedding is not None:
top_matches = decode_embedding(test_embedding, top_k=5)
print(f"\nTop 5 characters similar to '{test_char}':")
for char, sim in top_matches:
print(f" '{char}': {sim:.4f}")
```
## Model File
- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)
## Limitations
- The model only supports ASCII characters (0-255) plus a special blank token
- Embeddings are context-averaged, so they may not capture all nuances of character usage
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
- The model uses a relatively small embedding dimension (8) for efficiency
## Citation
```bibtex
@misc{character_embedding_model,
title={Character Embedding Model with Blank-Filling},
author={Yuan Xu},
year={2025},
howpublished={\url{https://huggingface.co/your-username/character-embedding}}
}
``` |