Helsinki-NLP/tatoeba_mt
Updated β’ 4.66k β’ 63
How to use tudi2d/parametric-umap-embeddinggemma-en-de-2d with TF-Keras:
# Note: 'keras<3.x' or 'tf_keras' must be installed (legacy)
# See https://github.com/keras-team/tf-keras for more details.
from huggingface_hub import from_pretrained_keras
model = from_pretrained_keras("tudi2d/parametric-umap-embeddinggemma-en-de-2d")
A Parametric UMAP model that projects 768-dimensional semantic embeddings from google/embeddinggemma-300m into a shared 2D cross-lingual space for English and German.
google/embeddinggemma-300mVisualization: Plot bilingual text data in 2D for exploration
Similarity analysis: Find semantically similar texts across languages
Cross-lingual clustering: Group related content in EN/DE
Semantic search: Fast nearest-neighbor search in 2D space
pip install sentence-transformers tensorflow umap-learn numpy
import numpy as np
from sentence_transformers import SentenceTransformer
from tensorflow import keras
# Load models
embedding_model = SentenceTransformer("google/embeddinggemma-300m")
umap_encoder = keras.models.load_model("path/to/encoder")
# Your sentences
sentences = [
"Hello world",
"Hallo Welt"
]
# Generate 768D embeddings
embeddings_768d = embedding_model.encode(
sentences,
prompt_name="Clustering",
convert_to_numpy=True
)
# Project to 2D
coords_2d = umap_encoder.predict(embeddings_768d)
print(coords_2d)
# [[0.11, 7.84],
# [0.28, 7.70]]
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.scatter(coords_2d[:, 0], coords_2d[:, 1])
for i, sent in enumerate(sentences):
plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1]))
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.title("2D Semantic Space")
plt.show()
Base model
google/embeddinggemma-300m