Upload README.md with huggingface_hub
Browse files
README.md
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# megashtein
|
| 2 |
+
|
| 3 |
+
In their paper ["Deep Squared Euclidean Approximation to the Levenshtein Distance for DNA Storage"](https://arxiv.org/abs/2207.04684), Guo et al. explore techniques for using a neural network to embed sequences in such a way that the squared Euclidean distance between embeddings approximates the Levenshtein distance between the original sequences.
|
| 4 |
+
|
| 5 |
+
This is valuable because there are excellent libraries for doing fast GPU accelerated searches for the K nearest neighbors of vectors, like [faiss](https://github.com/facebookresearch/faiss). Algorithms like [HNSW](https://en.wikipedia.org/wiki/Hierarchical_navigable_small_world) allow us to do these searches in logarithmic time, where a brute force levenshtein distance based fuzzy search would need to run in exponential time.
|
| 6 |
+
|
| 7 |
+
This repo contains a PyTorch implementation of the core ideas from Guo's paper, adapted for ASCII sequences rather than DNA sequences. The implementation includes:
|
| 8 |
+
|
| 9 |
+
- A convolutional neural network architecture for sequence embedding
|
| 10 |
+
- Training using Poisson regression loss (PNLL) as described in the paper
|
| 11 |
+
- Synthetic data generation with controlled edit distance relationships
|
| 12 |
+
- Model saving and loading functionality
|
| 13 |
+
|
| 14 |
+
The trained model learns to embed ASCII strings such that the squared Euclidean distance between embeddings approximates the true Levenshtein distance between the strings.
|