Voynich KSimplex Translator
A geometric deep learning system for analyzing and interpreting the Voynich Manuscript using KSimplex similarity assessment trained on Latin Wikipedia.
Model Description
This system combines:
- KSimplex Similarity Assessor: A novel geometric architecture using simplex-based routing for similarity computation
- Dual Embedding Fusion: Combines SBERT semantic embeddings (384-dim) with character-level TF-IDF (30k features)
- Cross-Corpus Transfer: Trained on Latin Wikipedia, applied to Voynich manuscript analysis
- Morphological Translator: Rule-based translation using discovered prefix-stem-suffix patterns
Architecture
Input Text
β
ββββΊ SBERT (all-MiniLM-L6-v2) βββΊ 384-dim
β β
ββββΊ Char TF-IDF (3-5 grams) βββΊ 30k-dim
β β
βΌ βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β KSimplex Similarity Assessor β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β SBERT Projection βββΊ 256-dim β
β TF-IDF Projection βββΊ 256-dim β
β β β
β βΌ β
β Fusion Layer βββΊ 256-dim β
β β β
β βΌ β
β SimplexSimilarityLayer Γ 3 (k=4) β
β βββββββββββββββββββββββββββββββ β
β β Route Projection (β4 edges)β β
β β Edge Transforms (4ΓLinear) β β
β β Weighted Sum + LayerNorm β β
β βββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Similarity Head βββΊ 128-dim β
β (L2 normalized) β
βββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
128-dim Similarity Embedding
Training
- Corpus: Latin Wikipedia (2000 documents, ~1.09M tokens)
- Windows: 200 tokens, stride 100
- Method: Contrastive learning with bucket classification auxiliary task
- Buckets: Louvain community detection on blended similarity graph (SBERT Ξ±=0.6 + TF-IDF)
- Loss: Contrastive margin loss + Cross-entropy bucket classification
- Performance: 100% accuracy, positive similarity 0.99 vs negative 0.25
Key Findings
Manuscript Structure
| Section | Folios | Character | Style Group |
|---|---|---|---|
| Herbal A | f1-f57 | Dense prose, plant descriptions | A |
| Herbal B | f58-f66 | Variant herbal style | A |
| Astronomical | f67-f73 | Zodiac, celestial diagrams | A |
| Biological | f75-f84 | Nymph figures, labels | B |
| Cosmological | f85-f86 | Rosette foldouts | C |
| Pharmaceutical | f87-f102 | Recipe format (p...am) | C |
| Recipes | f103-f116 | Cross-references, star labels | B |
Discovered Patterns
Structural Markers (Greek-derived):
p= Recipe/paragraph start (Ο)m,g= Line-end markers (ΞΌ, Ξ³)s,l,o= Label markers (Ο, Ξ», ΞΏ)-am,-dam,-ram= Recipe terminators (measurement)
Morphological System:
(PREFIX) + STEM + (SUFFIX) + (n)
Prefixes: qok- (the-), ok- (this-), ot- (other-), da- (of-)
Suffixes: -dy (matter), -ey (type), -in (of), -ol (liquid), -ar (part)
Bound 'n': Attaches to -ai- stems (daiin, qokaiin, okaiin)
Section Similarity Matrix:
Herbal_A Astro Bio Cosmo Pharma Recipe
Herbal A 1.00 0.99 0.93 0.77 0.77 0.88
Astronomical 0.99 1.00 0.96 0.83 0.82 0.92
Biological 0.93 0.96 1.00 0.94 0.94 0.99
Cosmological 0.77 0.83 0.94 1.00 0.98 0.97
Pharmaceutical 0.77 0.82 0.94 0.98 1.00 0.95
Recipes 0.88 0.92 0.99 0.97 0.95 1.00
Three Style Groups:
- Group A: Herbal + Astronomical (0.988 similarity)
- Group B: Biological + Recipes (0.991 similarity)
- Group C: Cosmological + Pharmaceutical (0.975 similarity)
Usage
Installation
pip install torch sentence-transformers scikit-learn datasets
Quick Start
from voynich_translator import VoynichTranslator
translator = VoynichTranslator()
# Translate text
result = translator.translate("daiin chedy qokeey shedy chol daiin")
print(result['english']) # "the herb bloom leaf stem the"
print(result['section']) # "Herbal A"
print(result['confidence']) # 1.0
# Translate with verbose analysis
result = translator.translate("p ol shy am", verbose=True)
# Returns word-by-word analysis and similar Latin passages
# Translate entire folio
folio = translator.translate_folio('f75r')
print(folio['full_english'])
# Find similar passages
similar = translator.find_similar_voynich("chedy qokeey")
latin = translator.find_similar_latin("chedy qokeey")
Full Standalone Setup
# Requires: Latin Wikipedia reload for TF-IDF vocabulary
# See standalone cell in repository for complete setup
from datasets import load_dataset
# 1. Load Latin corpus (same as training)
ds = load_dataset("wikimedia/wikipedia", "20231101.la", split="train", streaming=True)
# ... build windows, fit TF-IDF
# 2. Transform Voynich using Latin vectorizer
X_voy_tfidf = vec_lat.transform(voynich_texts)
# 3. Encode through KSimplex model
emb, _ = model(sbert_emb, tfidf_emb)
Lexicon
Core vocabulary mappings based on frequency and morphological analysis:
| Voynich | English | Category |
|---|---|---|
| daiin | the | Determiner |
| aiin | this | Determiner |
| qokaiin | the-said | Determiner |
| chedy | herb | Plant |
| shedy | leaf | Plant |
| qokeedy | blossom | Plant |
| chol | stem | Plant |
| ol | oil | Preparation |
| ar | root | Plant part |
| or | seed | Plant part |
| p | ΒΆ (recipe start) | Marker |
| am | β (measure end) | Marker |
Files
voynich-ksimplex-translator/
βββ README.md # This file
βββ voynich_translator.py # Complete standalone translator
βββ ksimplex_model.py # Model architecture
βββ ksimplex_similarity_model.pt # Trained weights
βββ similarity_embeddings.npz # Pre-computed embeddings
β βββ voynich_emb # (N_voy, 128) Voynich embeddings
β βββ voynich_labels # Cluster assignments
β βββ latin_emb # (N_lat, 128) Latin embeddings
β βββ latin_labels # Latin bucket assignments
βββ voynich_analysis_results.json # Statistical analysis
Limitations
β οΈ This is interpretive translation, not decipherment.
The Voynich cipher has not been broken. This system provides:
- β Structural analysis (recipe patterns, labels, cross-references)
- β Section classification with high accuracy
- β Morphological interpretation of word patterns
- β Similarity-based retrieval across Latin/Voynich corpora
- β True plaintext recovery
- β Verified word meanings
The lexicon is based on:
- Statistical frequency matching to Latin
- Positional grammar analysis
- Morphological pattern recognition
- Section-context inference
Citation
@software{voynich_ksimplex_2026,
title={Voynich KSimplex Translator: Geometric Deep Learning for Manuscript Analysis},
author={AbstractPhil},
year={2026},
url={https://huggingface.co/AbstractPhil/sbert-voynich-translation}
}
References
- Voynich Manuscript: Beinecke Library MS 408
- IVTFF Transcription: Zandbergen-Landini
- Latin Wikipedia: Wikimedia
License
MIT License - See LICENSE file for details.
"The Voynich appears to be a practical document (recipes, medical prescriptions) using Greek-derived notation for structure, with verbose cipher encoding the content, and a cross-reference system linking sections."