🚀 Setup Guide for Hugging Face Deployment

Prerequisites

Install required packages:

pip install huggingface_hub sentence-transformers

Login to Hugging Face:

huggingface-cli login

Enter your Hugging Face token when prompted.

📦 Repository Contents

final_repo/
├── README.md                           # Main model documentation
├── USAGE_EXAMPLES.md                   # Comprehensive usage examples
├── SETUP.md                           # This setup guide
├── push_to_hf.py                      # Upload script
├── .gitignore                         # Git ignore rules
├── model.safetensors                  # Model weights
├── config.json                        # Model configuration
├── tokenizer.json                     # Tokenizer
├── vocab.txt                          # Vocabulary
├── sentence_bert_config.json          # Sentence-BERT config
├── modules.json                       # Model modules
├── 1_Pooling/config.json             # Pooling configuration
├── training_metadata.json            # Training information
└── configuration_hf_nomic_bert.py    # Model architecture

🔄 Push to Hugging Face

Option 1: Automated Upload (Recommended)

cd final_repo
python push_to_hf.py

Option 2: Manual Upload

cd final_repo

# Clone/create the repo
git clone https://huggingface.co/asmud/nomic-embed-indonesian
# OR create new: huggingface-cli repo create nomic-embed-indonesian

# Copy files
cp -r * nomic-embed-indonesian/
cd nomic-embed-indonesian/

# Git commands
git add .
git commit -m "Add Indonesian text embedding model

- Fine-tuned from nomic-embed-text-v1.5
- Optimized for Indonesian language
- 6,294 training examples across 17 categories
- Conservative training to prevent embedding collapse
- Maintains base model performance with Indonesian specialization"

git push

✅ Verification Steps

After uploading, verify the model works:

from sentence_transformers import SentenceTransformer

# Load the uploaded model
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)

# Test Indonesian text
texts = [
    "search_query: Apa itu kecerdasan buatan?",
    "search_document: Kecerdasan buatan adalah teknologi yang memungkinkan mesin belajar",
    "classification: Produk ini sangat berkualitas (sentimen: positif)"
]

embeddings = model.encode(texts)
print(f"✅ Model working! Embedding shape: {embeddings.shape}")

📊 Model Information

Base Model: nomic-ai/nomic-embed-text-v1.5
Language: Indonesian (Bahasa Indonesia)
Embedding Dimension: 768
Max Sequence Length: 8192
Training Examples: 6,294 (balanced positive/negative)
Categories: 17 Indonesian content domains
Loss Function: MultipleNegativesRankingLoss
Training: Conservative approach to prevent embedding collapse

🎯 Model Performance

Search Retrieval: Maintains base performance (1.000 precision@1)
Classification: Stable performance (0.667 accuracy)
Clustering: Excellent performance (1.000 accuracy)
Semantic Similarity: High correlation (0.794)
Embedding Health: Healthy diversity range (0.625-0.898)

📝 License & Attribution

This model inherits the license from nomic-ai/nomic-embed-text-v1.5. Please refer to the base model's license terms.

🔗 Links

Model Repository: https://huggingface.co/asmud/nomic-embed-indonesian
Base Model: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
Sentence Transformers: https://www.sbert.net

🐛 Troubleshooting

Common Issues:

Authentication Error:

huggingface-cli login

Large File Upload Issues:

git lfs install
git lfs track "*.safetensors"

Model Loading Error:

# Ensure trust_remote_code=True if needed
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)

Memory Issues:

# Use CPU if GPU memory insufficient
model = SentenceTransformer("asmud/nomic-embed-indonesian", device='cpu')