Parametric UMAP: English-German Cross-Lingual Embeddings (768D β†’ 2D)

A Parametric UMAP model that projects 768-dimensional semantic embeddings from google/embeddinggemma-300m into a shared 2D cross-lingual space for English and German.

Architecture

  • Input: 768-dimensional embeddings (from embeddinggemma-300m)
  • Encoder:
    • Dense(768 β†’ 256) + ReLU
    • Dense(256 β†’ 128) + ReLU
    • Dense(128 β†’ 2) (linear output)
  • Output: 2D coordinates

Training Details

  • Base embeddings: google/embeddinggemma-300m
  • Training data: 200,000 English-German sentence pairs from Helsinki-NLP/tatoeba_mt
  • Method: Parametric UMAP with cosine metric
  • Framework: TensorFlow 2.14 + umap-learn 0.5.5
  • Epochs: 10 full epochs over UMAP graph
  • Batch size: 1000 edges

Primary Use Cases

Visualization: Plot bilingual text data in 2D for exploration
Similarity analysis: Find semantically similar texts across languages
Cross-lingual clustering: Group related content in EN/DE
Semantic search: Fast nearest-neighbor search in 2D space

How to Use

Installation

pip install sentence-transformers tensorflow umap-learn numpy

Basic Usage

import numpy as np
from sentence_transformers import SentenceTransformer
from tensorflow import keras

# Load models
embedding_model = SentenceTransformer("google/embeddinggemma-300m")
umap_encoder = keras.models.load_model("path/to/encoder")

# Your sentences
sentences = [
    "Hello world",
    "Hallo Welt"
]

# Generate 768D embeddings
embeddings_768d = embedding_model.encode(
    sentences,
    prompt_name="Clustering",
    convert_to_numpy=True
)

# Project to 2D
coords_2d = umap_encoder.predict(embeddings_768d)

print(coords_2d)
# [[0.11, 7.84],
#  [0.28, 7.70]]

Visualization Example

import matplotlib.pyplot as plt

plt.figure(figsize=(10, 8))
plt.scatter(coords_2d[:, 0], coords_2d[:, 1])

for i, sent in enumerate(sentences):
    plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1]))

plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.title("2D Semantic Space")
plt.show()
Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for tudi2d/parametric-umap-embeddinggemma-en-de-2d

Finetuned
(154)
this model

Dataset used to train tudi2d/parametric-umap-embeddinggemma-en-de-2d