|
|
--- |
|
|
license: gemma |
|
|
datasets: |
|
|
- Helsinki-NLP/tatoeba_mt |
|
|
language: |
|
|
- en |
|
|
- de |
|
|
base_model: |
|
|
- google/embeddinggemma-300m |
|
|
--- |
|
|
# Parametric UMAP: English-German Cross-Lingual Embeddings (768D β 2D) |
|
|
|
|
|
A Parametric UMAP model that projects 768-dimensional semantic embeddings from `google/embeddinggemma-300m` into a shared 2D cross-lingual space for English and German. |
|
|
|
|
|
### Architecture |
|
|
|
|
|
- **Input**: 768-dimensional embeddings (from embeddinggemma-300m) |
|
|
- **Encoder**: |
|
|
- Dense(768 β 256) + ReLU |
|
|
- Dense(256 β 128) + ReLU |
|
|
- Dense(128 β 2) (linear output) |
|
|
- **Output**: 2D coordinates |
|
|
|
|
|
### Training Details |
|
|
|
|
|
- **Base embeddings**: `google/embeddinggemma-300m` |
|
|
- **Training data**: 200,000 English-German sentence pairs from Helsinki-NLP/tatoeba_mt |
|
|
- **Method**: Parametric UMAP with cosine metric |
|
|
- **Framework**: TensorFlow 2.14 + umap-learn 0.5.5 |
|
|
- **Epochs**: 10 full epochs over UMAP graph |
|
|
- **Batch size**: 1000 edges |
|
|
|
|
|
### Primary Use Cases |
|
|
**Visualization**: Plot bilingual text data in 2D for exploration |
|
|
**Similarity analysis**: Find semantically similar texts across languages |
|
|
**Cross-lingual clustering**: Group related content in EN/DE |
|
|
**Semantic search**: Fast nearest-neighbor search in 2D space |
|
|
|
|
|
## How to Use |
|
|
|
|
|
### Installation |
|
|
|
|
|
```bash |
|
|
pip install sentence-transformers tensorflow umap-learn numpy |
|
|
``` |
|
|
|
|
|
### Basic Usage |
|
|
|
|
|
```python |
|
|
import numpy as np |
|
|
from sentence_transformers import SentenceTransformer |
|
|
from tensorflow import keras |
|
|
|
|
|
# Load models |
|
|
embedding_model = SentenceTransformer("google/embeddinggemma-300m") |
|
|
umap_encoder = keras.models.load_model("path/to/encoder") |
|
|
|
|
|
# Your sentences |
|
|
sentences = [ |
|
|
"Hello world", |
|
|
"Hallo Welt" |
|
|
] |
|
|
|
|
|
# Generate 768D embeddings |
|
|
embeddings_768d = embedding_model.encode( |
|
|
sentences, |
|
|
prompt_name="Clustering", |
|
|
convert_to_numpy=True |
|
|
) |
|
|
|
|
|
# Project to 2D |
|
|
coords_2d = umap_encoder.predict(embeddings_768d) |
|
|
|
|
|
print(coords_2d) |
|
|
# [[0.11, 7.84], |
|
|
# [0.28, 7.70]] |
|
|
``` |
|
|
|
|
|
### Visualization Example |
|
|
|
|
|
```python |
|
|
import matplotlib.pyplot as plt |
|
|
|
|
|
plt.figure(figsize=(10, 8)) |
|
|
plt.scatter(coords_2d[:, 0], coords_2d[:, 1]) |
|
|
|
|
|
for i, sent in enumerate(sentences): |
|
|
plt.annotate(sent, (coords_2d[i, 0], coords_2d[i, 1])) |
|
|
|
|
|
plt.xlabel("UMAP Dimension 1") |
|
|
plt.ylabel("UMAP Dimension 2") |
|
|
plt.title("2D Semantic Space") |
|
|
plt.show() |
|
|
``` |