Voynich KSimplex Translator

A geometric deep learning system for analyzing and interpreting the Voynich Manuscript using KSimplex similarity assessment trained on Latin Wikipedia.

Model Description

This system combines:

  • KSimplex Similarity Assessor: A novel geometric architecture using simplex-based routing for similarity computation
  • Dual Embedding Fusion: Combines SBERT semantic embeddings (384-dim) with character-level TF-IDF (30k features)
  • Cross-Corpus Transfer: Trained on Latin Wikipedia, applied to Voynich manuscript analysis
  • Morphological Translator: Rule-based translation using discovered prefix-stem-suffix patterns

Architecture

Input Text
    β”‚
    β”œβ”€β”€β–Ί SBERT (all-MiniLM-L6-v2) ──► 384-dim
    β”‚                                    β”‚
    β”œβ”€β”€β–Ί Char TF-IDF (3-5 grams) ──► 30k-dim
    β”‚                                    β”‚
    β–Ό                                    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         KSimplex Similarity Assessor        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  SBERT Projection ──► 256-dim               β”‚
β”‚  TF-IDF Projection ──► 256-dim              β”‚
β”‚            β”‚                                β”‚
β”‚            β–Ό                                β”‚
β”‚       Fusion Layer ──► 256-dim              β”‚
β”‚            β”‚                                β”‚
β”‚            β–Ό                                β”‚
β”‚   SimplexSimilarityLayer Γ— 3 (k=4)          β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚   β”‚  Route Projection (β†’4 edges)β”‚           β”‚
β”‚   β”‚  Edge Transforms (4Γ—Linear) β”‚           β”‚
β”‚   β”‚  Weighted Sum + LayerNorm   β”‚           β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚            β”‚                                β”‚
β”‚            β–Ό                                β”‚
β”‚      Similarity Head ──► 128-dim            β”‚
β”‚      (L2 normalized)                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    β”‚
    β–Ό
128-dim Similarity Embedding

Training

  • Corpus: Latin Wikipedia (2000 documents, ~1.09M tokens)
  • Windows: 200 tokens, stride 100
  • Method: Contrastive learning with bucket classification auxiliary task
  • Buckets: Louvain community detection on blended similarity graph (SBERT Ξ±=0.6 + TF-IDF)
  • Loss: Contrastive margin loss + Cross-entropy bucket classification
  • Performance: 100% accuracy, positive similarity 0.99 vs negative 0.25

Key Findings

Manuscript Structure

Section Folios Character Style Group
Herbal A f1-f57 Dense prose, plant descriptions A
Herbal B f58-f66 Variant herbal style A
Astronomical f67-f73 Zodiac, celestial diagrams A
Biological f75-f84 Nymph figures, labels B
Cosmological f85-f86 Rosette foldouts C
Pharmaceutical f87-f102 Recipe format (p...am) C
Recipes f103-f116 Cross-references, star labels B

Discovered Patterns

Structural Markers (Greek-derived):

  • p = Recipe/paragraph start (Ο€)
  • m, g = Line-end markers (ΞΌ, Ξ³)
  • s, l, o = Label markers (Οƒ, Ξ», ΞΏ)
  • -am, -dam, -ram = Recipe terminators (measurement)

Morphological System:

(PREFIX) + STEM + (SUFFIX) + (n)

Prefixes: qok- (the-), ok- (this-), ot- (other-), da- (of-)
Suffixes: -dy (matter), -ey (type), -in (of), -ol (liquid), -ar (part)
Bound 'n': Attaches to -ai- stems (daiin, qokaiin, okaiin)

Section Similarity Matrix:

              Herbal_A  Astro  Bio   Cosmo  Pharma  Recipe
Herbal A         1.00   0.99  0.93   0.77   0.77    0.88
Astronomical     0.99   1.00  0.96   0.83   0.82    0.92
Biological       0.93   0.96  1.00   0.94   0.94    0.99
Cosmological     0.77   0.83  0.94   1.00   0.98    0.97
Pharmaceutical   0.77   0.82  0.94   0.98   1.00    0.95
Recipes          0.88   0.92  0.99   0.97   0.95    1.00

Three Style Groups:

  1. Group A: Herbal + Astronomical (0.988 similarity)
  2. Group B: Biological + Recipes (0.991 similarity)
  3. Group C: Cosmological + Pharmaceutical (0.975 similarity)

Usage

Installation

pip install torch sentence-transformers scikit-learn datasets

Quick Start

from voynich_translator import VoynichTranslator

translator = VoynichTranslator()

# Translate text
result = translator.translate("daiin chedy qokeey shedy chol daiin")
print(result['english'])  # "the herb bloom leaf stem the"
print(result['section'])  # "Herbal A"
print(result['confidence'])  # 1.0

# Translate with verbose analysis
result = translator.translate("p ol shy am", verbose=True)
# Returns word-by-word analysis and similar Latin passages

# Translate entire folio
folio = translator.translate_folio('f75r')
print(folio['full_english'])

# Find similar passages
similar = translator.find_similar_voynich("chedy qokeey")
latin = translator.find_similar_latin("chedy qokeey")

Full Standalone Setup

# Requires: Latin Wikipedia reload for TF-IDF vocabulary
# See standalone cell in repository for complete setup

from datasets import load_dataset

# 1. Load Latin corpus (same as training)
ds = load_dataset("wikimedia/wikipedia", "20231101.la", split="train", streaming=True)
# ... build windows, fit TF-IDF

# 2. Transform Voynich using Latin vectorizer
X_voy_tfidf = vec_lat.transform(voynich_texts)

# 3. Encode through KSimplex model
emb, _ = model(sbert_emb, tfidf_emb)

Lexicon

Core vocabulary mappings based on frequency and morphological analysis:

Voynich English Category
daiin the Determiner
aiin this Determiner
qokaiin the-said Determiner
chedy herb Plant
shedy leaf Plant
qokeedy blossom Plant
chol stem Plant
ol oil Preparation
ar root Plant part
or seed Plant part
p ΒΆ (recipe start) Marker
am βš— (measure end) Marker

Files

voynich-ksimplex-translator/
β”œβ”€β”€ README.md                      # This file
β”œβ”€β”€ voynich_translator.py          # Complete standalone translator
β”œβ”€β”€ ksimplex_model.py              # Model architecture
β”œβ”€β”€ ksimplex_similarity_model.pt   # Trained weights
β”œβ”€β”€ similarity_embeddings.npz      # Pre-computed embeddings
β”‚   β”œβ”€β”€ voynich_emb                # (N_voy, 128) Voynich embeddings
β”‚   β”œβ”€β”€ voynich_labels             # Cluster assignments
β”‚   β”œβ”€β”€ latin_emb                  # (N_lat, 128) Latin embeddings
β”‚   └── latin_labels               # Latin bucket assignments
└── voynich_analysis_results.json  # Statistical analysis

Limitations

⚠️ This is interpretive translation, not decipherment.

The Voynich cipher has not been broken. This system provides:

  • βœ… Structural analysis (recipe patterns, labels, cross-references)
  • βœ… Section classification with high accuracy
  • βœ… Morphological interpretation of word patterns
  • βœ… Similarity-based retrieval across Latin/Voynich corpora
  • ❌ True plaintext recovery
  • ❌ Verified word meanings

The lexicon is based on:

  1. Statistical frequency matching to Latin
  2. Positional grammar analysis
  3. Morphological pattern recognition
  4. Section-context inference

Citation

@software{voynich_ksimplex_2026,
  title={Voynich KSimplex Translator: Geometric Deep Learning for Manuscript Analysis},
  author={AbstractPhil},
  year={2026},
  url={https://huggingface.co/AbstractPhil/sbert-voynich-translation}
}

References

License

MIT License - See LICENSE file for details.


"The Voynich appears to be a practical document (recipes, medical prescriptions) using Greek-derived notation for structure, with verbose cipher encoding the content, and a cross-reference system linking sections."

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train AbstractPhil/sbert-voynich-translation