tudi2d's picture
Create README.md
17503a6 verified
---
license: gemma
datasets:
- Helsinki-NLP/tatoeba_mt
language:
- en
- de
base_model:
- google/embeddinggemma-300m
---
# Parametric UMAP: English-German Cross-Lingual Embeddings (768D β†’ 2D)
A Parametric UMAP model that projects 768-dimensional semantic embeddings from `google/embeddinggemma-300m` into a shared 2D cross-lingual space for English and German.
### Architecture
- **Input**: 768-dimensional embeddings (from embeddinggemma-300m)
- **Encoder**:
- Dense(768 β†’ 256) + ReLU
- Dense(256 β†’ 128) + ReLU
- Dense(128 β†’ 2) (linear output)
- **Output**: 2D coordinates
### Training Details
- **Base embeddings**: `google/embeddinggemma-300m`
- **Training data**: 200,000 English-German sentence pairs from Helsinki-NLP/tatoeba_mt
- **Method**: Parametric UMAP with cosine metric
- **Framework**: TensorFlow 2.14 + umap-learn 0.5.5
- **Epochs**: 10 full epochs over UMAP graph
- **Batch size**: 1000 edges
### Primary Use Cases
**Visualization**: Plot bilingual text data in 2D for exploration
**Similarity analysis**: Find semantically similar texts across languages
**Cross-lingual clustering**: Group related content in EN/DE
**Semantic search**: Fast nearest-neighbor search in 2D space
## How to Use
### Installation
```bash
pip install sentence-transformers tensorflow umap-learn numpy
```
### Basic Usage
```python
import numpy as np
from sentence_transformers import SentenceTransformer
from tensorflow import keras
# Load models
embedding_model = SentenceTransformer("google/embeddinggemma-300m")
umap_encoder = keras.models.load_model("path/to/encoder")
# Your sentences
sentences = [
"Hello world",
"Hallo Welt"
]
# Generate 768D embeddings
embeddings_768d = embedding_model.encode(
sentences,
prompt_name="Clustering",
convert_to_numpy=True
)
# Project to 2D
coords_2d = umap_encoder.predict(embeddings_768d)
print(coords_2d)
# [[0.11, 7.84],
# [0.28, 7.70]]
```
### Visualization Example
```python
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 8))
plt.scatter(coords_2d[:, 0], coords_2d[:, 1])
for i, sent in enumerate(sentences):
plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1]))
plt.xlabel("UMAP Dimension 1")
plt.ylabel("UMAP Dimension 2")
plt.title("2D Semantic Space")
plt.show()
```