File size: 4,316 Bytes

---
license: mit
language:
- en
---
# Character Embedding Model

A character level embedding for ASCII characters trained on Oxford English Dictionary.
<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->

## Model Description

This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:

- **Positive pairs**: A character and its corresponding word with that character blanked out
- **Negative pairs**: Different characters that should have dissimilar embeddings

### Architecture

- **Embedding Dimension**: 8
- **Hidden Dimension**: 64
- **Transformer Layers**: 2
- **Attention Heads**: 8
- **Vocabulary Size**: 257 (256 ASCII + blank token)

### Training

The model was trained on word-definition pairs from a dictionary corpus using:
- Mixed precision training (FP16)
- Contrastive loss with margin-based negative sampling
- Periodic embedding stabilization
- Best model selection based on quality score (positive similarity - negative similarity)

## Installation

```bash
pip install torch numpy huggingface_hub
```

## Usage

### Loading the Model

```python
import torch
import numpy as np
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

# Example for downloading a single file
local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")

# Load the pre-computed character embeddings
char_embeddings = np.load(local_path, allow_pickle=True).item()

# Convert to tensor for efficient operations
char_embedding_tensor = {}
for char, emb in char_embeddings.items():
    char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
```

### Inference: Character to Embedding

```python
def get_character_embedding(char):
    """Get the embedding for a single character."""
    if char in char_embedding_tensor:
        return char_embedding_tensor[char]
    else:
        print(f"Warning: Character '{char}' not found in embeddings")
        return None

# Example usage
char = 'a'
embedding = get_character_embedding(char)
print(f"Embedding for '{char}': {embedding}")
print(f"Embedding shape: {embedding.shape}")  # Should be (8,)
```

### Inference: Embedding to Character (Decoding)

```python
def decode_embedding(query_embedding, top_k=5):
    """
    Find the closest character(s) to a given embedding.
    
    Args:
        query_embedding: torch.Tensor of shape (8,)
        top_k: Number of closest characters to return
    
    Returns:
        List of (character, similarity_score) tuples
    """
    # Normalize query embedding
    query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)
    
    similarities = []
    for char, emb in char_embedding_tensor.items():
        # Normalize character embedding
        emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
        # Compute cosine similarity
        sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
        similarities.append((char, sim))
    
    # Sort by similarity (descending)
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return similarities[:top_k]

# Example usage
test_char = 'e'
test_embedding = get_character_embedding(test_char)

if test_embedding is not None:
    top_matches = decode_embedding(test_embedding, top_k=5)
    print(f"\nTop 5 characters similar to '{test_char}':")
    for char, sim in top_matches:
        print(f"  '{char}': {sim:.4f}")
```

## Model File

- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)

## Limitations

- The model only supports ASCII characters (0-255) plus a special blank token
- Embeddings are context-averaged, so they may not capture all nuances of character usage
- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
- The model uses a relatively small embedding dimension (8) for efficiency

## Citation

```bibtex
@misc{character_embedding_model,
  title={Character Embedding Model with Blank-Filling},
  author={Yuan Xu},
  year={2025},
  howpublished={\url{https://huggingface.co/your-username/character-embedding}}
}
```