|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
--- |
|
|
# Character Embedding Model |
|
|
|
|
|
A character level embedding for ASCII characters trained on Oxford English Dictionary. |
|
|
<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. --> |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where: |
|
|
|
|
|
- **Positive pairs**: A character and its corresponding word with that character blanked out |
|
|
- **Negative pairs**: Different characters that should have dissimilar embeddings |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Embedding Dimension**: 8 |
|
|
- **Hidden Dimension**: 64 |
|
|
- **Transformer Layers**: 2 |
|
|
- **Attention Heads**: 8 |
|
|
- **Vocabulary Size**: 257 (256 ASCII + blank token) |
|
|
|
|
|
### Training |
|
|
|
|
|
The model was trained on word-definition pairs from a dictionary corpus using: |
|
|
- Mixed precision training (FP16) |
|
|
- Contrastive loss with margin-based negative sampling |
|
|
- Periodic embedding stabilization |
|
|
- Best model selection based on quality score (positive similarity - negative similarity) |
|
|
|
|
|
## Installation |
|
|
|
|
|
```bash |
|
|
pip install torch numpy huggingface_hub |
|
|
``` |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Loading the Model |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
import torch.nn.functional as F |
|
|
from huggingface_hub import hf_hub_download |
|
|
|
|
|
# Example for downloading a single file |
|
|
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy") |
|
|
|
|
|
# Load the pre-computed character embeddings |
|
|
char_embeddings = np.load(local_path, allow_pickle=True).item() |
|
|
|
|
|
# Convert to tensor for efficient operations |
|
|
char_embedding_tensor = {} |
|
|
for char, emb in char_embeddings.items(): |
|
|
char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32) |
|
|
``` |
|
|
|
|
|
### Inference: Character to Embedding |
|
|
|
|
|
```python |
|
|
def get_character_embedding(char): |
|
|
"""Get the embedding for a single character.""" |
|
|
if char in char_embedding_tensor: |
|
|
return char_embedding_tensor[char] |
|
|
else: |
|
|
print(f"Warning: Character '{char}' not found in embeddings") |
|
|
return None |
|
|
|
|
|
# Example usage |
|
|
char = 'a' |
|
|
embedding = get_character_embedding(char) |
|
|
print(f"Embedding for '{char}': {embedding}") |
|
|
print(f"Embedding shape: {embedding.shape}") # Should be (8,) |
|
|
``` |
|
|
|
|
|
### Inference: Embedding to Character (Decoding) |
|
|
|
|
|
```python |
|
|
def decode_embedding(query_embedding, top_k=5): |
|
|
""" |
|
|
Find the closest character(s) to a given embedding. |
|
|
|
|
|
Args: |
|
|
query_embedding: torch.Tensor of shape (8,) |
|
|
top_k: Number of closest characters to return |
|
|
|
|
|
Returns: |
|
|
List of (character, similarity_score) tuples |
|
|
""" |
|
|
# Normalize query embedding |
|
|
query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1) |
|
|
|
|
|
similarities = [] |
|
|
for char, emb in char_embedding_tensor.items(): |
|
|
# Normalize character embedding |
|
|
emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1) |
|
|
# Compute cosine similarity |
|
|
sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item() |
|
|
similarities.append((char, sim)) |
|
|
|
|
|
# Sort by similarity (descending) |
|
|
similarities.sort(key=lambda x: x[1], reverse=True) |
|
|
|
|
|
return similarities[:top_k] |
|
|
|
|
|
# Example usage |
|
|
test_char = 'e' |
|
|
test_embedding = get_character_embedding(test_char) |
|
|
|
|
|
if test_embedding is not None: |
|
|
top_matches = decode_embedding(test_embedding, top_k=5) |
|
|
print(f"\nTop 5 characters similar to '{test_char}':") |
|
|
for char, sim in top_matches: |
|
|
print(f" '{char}': {sim:.4f}") |
|
|
``` |
|
|
|
|
|
## Model File |
|
|
|
|
|
- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary) |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- The model only supports ASCII characters (0-255) plus a special blank token |
|
|
- Embeddings are context-averaged, so they may not capture all nuances of character usage |
|
|
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary |
|
|
- The model uses a relatively small embedding dimension (8) for efficiency |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{character_embedding_model, |
|
|
title={Character Embedding Model with Blank-Filling}, |
|
|
author={Yuan Xu}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/your-username/character-embedding}} |
|
|
} |
|
|
``` |