CharEmb / README.md

Update README.md

f29e223 verified 3 months ago

4.32 kB

	---
	license: mit
	language:
	- en
	---
	# Character Embedding Model

	A character level embedding for ASCII characters trained on Oxford English Dictionary.
	<!-- The model learns to represent characters such that `emb('a')` is close to `emb('_pple')` where `_` represents the blank position for 'a'. -->

	## Model Description

	This model uses a Transformer-based architecture to create embeddings that capture the contextual relationship between characters and their positions in words. It's trained using a contrastive learning approach where:

	- Positive pairs: A character and its corresponding word with that character blanked out
	- Negative pairs: Different characters that should have dissimilar embeddings

	### Architecture

	- Embedding Dimension: 8
	- Hidden Dimension: 64
	- Transformer Layers: 2
	- Attention Heads: 8
	- Vocabulary Size: 257 (256 ASCII + blank token)

	### Training

	The model was trained on word-definition pairs from a dictionary corpus using:
	- Mixed precision training (FP16)
	- Contrastive loss with margin-based negative sampling
	- Periodic embedding stabilization
	- Best model selection based on quality score (positive similarity - negative similarity)

	## Installation

	```bash
	pip install torch numpy huggingface_hub
	```

	## Usage

	### Loading the Model

	```python
	import torch
	import numpy as np
	import torch.nn.functional as F
	from huggingface_hub import hf_hub_download

	# Example for downloading a single file
	local_path = hf_hub_download(repo_id="npc0/CharEmb", filename="char_embeddings_best.npy")

	# Load the pre-computed character embeddings
	char_embeddings = np.load(local_path, allow_pickle=True).item()

	# Convert to tensor for efficient operations
	char_embedding_tensor = {}
	for char, emb in char_embeddings.items():
	char_embedding_tensor[char] = torch.tensor(emb, dtype=torch.float32)
	```

	### Inference: Character to Embedding

	```python
	def get_character_embedding(char):
	"""Get the embedding for a single character."""
	if char in char_embedding_tensor:
	return char_embedding_tensor[char]
	else:
	print(f"Warning: Character '{char}' not found in embeddings")
	return None

	# Example usage
	char = 'a'
	embedding = get_character_embedding(char)
	print(f"Embedding for '{char}': {embedding}")
	print(f"Embedding shape: {embedding.shape}") # Should be (8,)
	```

	### Inference: Embedding to Character (Decoding)

	```python
	def decode_embedding(query_embedding, top_k=5):
	"""
	Find the closest character(s) to a given embedding.

	Args:
	query_embedding: torch.Tensor of shape (8,)
	top_k: Number of closest characters to return

	Returns:
	List of (character, similarity_score) tuples
	"""
	# Normalize query embedding
	query_embedding = F.normalize(query_embedding.unsqueeze(0), p=2, dim=-1)

	similarities = []
	for char, emb in char_embedding_tensor.items():
	# Normalize character embedding
	emb_norm = F.normalize(emb.unsqueeze(0), p=2, dim=-1)
	# Compute cosine similarity
	sim = F.cosine_similarity(query_embedding, emb_norm, dim=-1).item()
	similarities.append((char, sim))

	# Sort by similarity (descending)
	similarities.sort(key=lambda x: x[1], reverse=True)

	return similarities[:top_k]

	# Example usage
	test_char = 'e'
	test_embedding = get_character_embedding(test_char)

	if test_embedding is not None:
	top_matches = decode_embedding(test_embedding, top_k=5)
	print(f"\nTop 5 characters similar to '{test_char}':")
	for char, sim in top_matches:
	print(f" '{char}': {sim:.4f}")
	```

	## Model File

	- `char_embeddings_best.npy`: Pre-computed character embeddings (numpy dictionary)

	## Limitations

	- The model only supports ASCII characters (0-255) plus a special blank token
	- Embeddings are context-averaged, so they may not capture all nuances of character usage
	- Performance Limited by the diversity and quality of training data of Oxford English Dictionary
	- The model uses a relatively small embedding dimension (8) for efficiency

	## Citation

	```bibtex
	@misc{character_embedding_model,
	title={Character Embedding Model with Blank-Filling},
	author={Yuan Xu},
	year={2025},
	howpublished={\url{https://huggingface.co/your-username/character-embedding}}
	}
	```