sawyerhpowell commited on
Commit
131d194
·
verified ·
1 Parent(s): aabd1cf

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +95 -0
README.md ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Megashtein: Deep Squared Euclidean Approximation to Levenshtein Distance"
3
+ description: "PyTorch implementation of neural network sequence embedding for approximate edit distance computation."
4
+ tags:
5
+ - pytorch
6
+ - neural-network
7
+ - levenshtein-distance
8
+ - sequence-embedding
9
+ - edit-distance
10
+ - string-similarity
11
+ - deep-learning
12
+ license: mit
13
+ language: en
14
+ ---
15
+
16
+ # megashtein
17
+
18
+ In their paper ["Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage"](https://arxiv.org/abs/2207.04684), Guo et al. explore techniques for using a neural network to embed sequences in such a way that the squared Euclidean distance between embeddings approximates the Levenshtein distance between the original sequences. This implementation also takes techniques from ["Levenshtein Distance Embeddings with Poisson Regression for DNA Storage" by Wei et al. (2023)](https://arxiv.org/pdf/2312.07931v1).
19
+
20
+ This is valuable because there are excellent libraries for doing fast GPU accelerated searches for the K nearest neighbors of vectors, like [faiss](https://github.com/facebookresearch/faiss). Algorithms like [HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) allow us to do these searches in logarithmic time, where a brute force levenshtein distance based fuzzy search would need to run in exponential time.
21
+
22
+ This repo contains a PyTorch implementation of the core ideas from Guo's paper, adapted for ASCII sequences rather than DNA sequences. The implementation includes:
23
+
24
+ - A convolutional neural network architecture for sequence embedding
25
+ - Training using Poisson regression loss (PNLL) as described in the paper
26
+ - Synthetic data generation with controlled edit distance relationships
27
+ - Model saving and loading functionality
28
+
29
+ The trained model learns to embed ASCII strings such that the squared Euclidean distance between embeddings approximates the true Levenshtein distance between the strings.
30
+
31
+ ## Model Architecture
32
+
33
+ - **Base Architecture**: Convolutional Neural Network with embedding layer
34
+ - **Input**: ASCII sequences up to 80 characters (padded with null characters)
35
+ - **Output**: 80-dimensional dense embeddings
36
+ - **Vocab Size**: 128 (ASCII character set)
37
+ - **Embedding Dimension**: 140
38
+
39
+ The model uses a 5-layer CNN with average pooling followed by fully connected layers to produce fixed-size embeddings from variable-length ASCII sequences.
40
+
41
+ ## Usage
42
+
43
+ ```python
44
+ import torch
45
+ from models import EditDistanceModel
46
+
47
+ # Load the model
48
+ model = EditDistanceModel(embedding_dim=140)
49
+ model.load_state_dict(torch.load('megashtein_trained_model.pth'))
50
+ model.eval()
51
+
52
+ # Embed strings
53
+ def embed_string(text, max_length=80):
54
+ # Pad and convert to tensor
55
+ padded = (text + '\0' * max_length)[:max_length]
56
+ indices = [min(ord(c), 127) for c in padded]
57
+ tensor = torch.tensor(indices, dtype=torch.long).unsqueeze(0)
58
+
59
+ with torch.no_grad():
60
+ embedding = model(tensor)
61
+ return embedding
62
+
63
+ # Example usage
64
+ text1 = "hello world"
65
+ text2 = "hello word"
66
+
67
+ emb1 = embed_string(text1)
68
+ emb2 = embed_string(text2)
69
+
70
+ # Compute approximate edit distance
71
+ approx_distance = torch.sum((emb1 - emb2) ** 2).item()
72
+ print(f"Approximate edit distance: {approx_distance}")
73
+ ```
74
+
75
+ ## Training Details
76
+
77
+ The model is trained using:
78
+ - **Loss Function**: Poisson Negative Log-Likelihood (PNLL)
79
+ - **Optimizer**: AdamW with learning rate 0.000817
80
+ - **Batch Size**: 32
81
+ - **Sequence Length**: 80 characters (fixed)
82
+ - **Synthetic Data**: Pairs of ASCII strings with known Levenshtein distances
83
+
84
+ ## Use Cases
85
+
86
+ - **Fuzzy String Search**: Find similar strings in large text collections
87
+ - **Text Clustering**: Group similar texts based on edit distance
88
+ - **Data Deduplication**: Identify near-duplicate text entries
89
+ - **Approximate String Matching**: Fast similarity search with controllable accuracy
90
+
91
+ ## Limitations
92
+
93
+ - **Fixed Length**: Input sequences must be exactly 80 characters (padded/truncated)
94
+ - **ASCII Only**: Limited to ASCII character set (0-127)
95
+ - **Approximation**: Provides approximate rather than exact edit distances