🚀 Setup Guide for Hugging Face Deployment
Prerequisites
- Install required packages:
pip install huggingface_hub sentence-transformers
- Login to Hugging Face:
huggingface-cli login
Enter your Hugging Face token when prompted.
📦 Repository Contents
final_repo/
├── README.md # Main model documentation
├── USAGE_EXAMPLES.md # Comprehensive usage examples
├── SETUP.md # This setup guide
├── push_to_hf.py # Upload script
├── .gitignore # Git ignore rules
├── model.safetensors # Model weights
├── config.json # Model configuration
├── tokenizer.json # Tokenizer
├── vocab.txt # Vocabulary
├── sentence_bert_config.json # Sentence-BERT config
├── modules.json # Model modules
├── 1_Pooling/config.json # Pooling configuration
├── training_metadata.json # Training information
└── configuration_hf_nomic_bert.py # Model architecture
🔄 Push to Hugging Face
Option 1: Automated Upload (Recommended)
cd final_repo
python push_to_hf.py
Option 2: Manual Upload
cd final_repo
# Clone/create the repo
git clone https://huggingface.co/asmud/nomic-embed-indonesian
# OR create new: huggingface-cli repo create nomic-embed-indonesian
# Copy files
cp -r * nomic-embed-indonesian/
cd nomic-embed-indonesian/
# Git commands
git add .
git commit -m "Add Indonesian text embedding model
- Fine-tuned from nomic-embed-text-v1.5
- Optimized for Indonesian language
- 6,294 training examples across 17 categories
- Conservative training to prevent embedding collapse
- Maintains base model performance with Indonesian specialization"
git push
✅ Verification Steps
After uploading, verify the model works:
from sentence_transformers import SentenceTransformer
# Load the uploaded model
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)
# Test Indonesian text
texts = [
"search_query: Apa itu kecerdasan buatan?",
"search_document: Kecerdasan buatan adalah teknologi yang memungkinkan mesin belajar",
"classification: Produk ini sangat berkualitas (sentimen: positif)"
]
embeddings = model.encode(texts)
print(f"✅ Model working! Embedding shape: {embeddings.shape}")
📊 Model Information
- Base Model: nomic-ai/nomic-embed-text-v1.5
- Language: Indonesian (Bahasa Indonesia)
- Embedding Dimension: 768
- Max Sequence Length: 8192
- Training Examples: 6,294 (balanced positive/negative)
- Categories: 17 Indonesian content domains
- Loss Function: MultipleNegativesRankingLoss
- Training: Conservative approach to prevent embedding collapse
🎯 Model Performance
- Search Retrieval: Maintains base performance (1.000 precision@1)
- Classification: Stable performance (0.667 accuracy)
- Clustering: Excellent performance (1.000 accuracy)
- Semantic Similarity: High correlation (0.794)
- Embedding Health: Healthy diversity range (0.625-0.898)
📝 License & Attribution
This model inherits the license from nomic-ai/nomic-embed-text-v1.5. Please refer to the base model's license terms.
🔗 Links
- Model Repository: https://huggingface.co/asmud/nomic-embed-indonesian
- Base Model: https://huggingface.co/nomic-ai/nomic-embed-text-v1.5
- Sentence Transformers: https://www.sbert.net
🐛 Troubleshooting
Common Issues:
- Authentication Error:
huggingface-cli login
- Large File Upload Issues:
git lfs install
git lfs track "*.safetensors"
- Model Loading Error:
# Ensure trust_remote_code=True if needed
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)
- Memory Issues:
# Use CPU if GPU memory insufficient
model = SentenceTransformer("asmud/nomic-embed-indonesian", device='cpu')