asmud's picture
Upload folder using huggingface_hub
57e0da1 verified

🚀 Setup Guide for Hugging Face Deployment

Prerequisites

  1. Install required packages:
pip install huggingface_hub sentence-transformers
  1. Login to Hugging Face:
huggingface-cli login

Enter your Hugging Face token when prompted.

📦 Repository Contents

final_repo/
├── README.md                           # Main model documentation
├── USAGE_EXAMPLES.md                   # Comprehensive usage examples
├── SETUP.md                           # This setup guide
├── push_to_hf.py                      # Upload script
├── .gitignore                         # Git ignore rules
├── model.safetensors                  # Model weights
├── config.json                        # Model configuration
├── tokenizer.json                     # Tokenizer
├── vocab.txt                          # Vocabulary
├── sentence_bert_config.json          # Sentence-BERT config
├── modules.json                       # Model modules
├── 1_Pooling/config.json             # Pooling configuration
├── training_metadata.json            # Training information
└── configuration_hf_nomic_bert.py    # Model architecture

🔄 Push to Hugging Face

Option 1: Automated Upload (Recommended)

cd final_repo
python push_to_hf.py

Option 2: Manual Upload

cd final_repo

# Clone/create the repo
git clone https://huggingface.co/asmud/nomic-embed-indonesian
# OR create new: huggingface-cli repo create nomic-embed-indonesian

# Copy files
cp -r * nomic-embed-indonesian/
cd nomic-embed-indonesian/

# Git commands
git add .
git commit -m "Add Indonesian text embedding model

- Fine-tuned from nomic-embed-text-v1.5
- Optimized for Indonesian language
- 6,294 training examples across 17 categories
- Conservative training to prevent embedding collapse
- Maintains base model performance with Indonesian specialization"

git push

✅ Verification Steps

After uploading, verify the model works:

from sentence_transformers import SentenceTransformer

# Load the uploaded model
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)

# Test Indonesian text
texts = [
    "search_query: Apa itu kecerdasan buatan?",
    "search_document: Kecerdasan buatan adalah teknologi yang memungkinkan mesin belajar",
    "classification: Produk ini sangat berkualitas (sentimen: positif)"
]

embeddings = model.encode(texts)
print(f"✅ Model working! Embedding shape: {embeddings.shape}")

📊 Model Information

  • Base Model: nomic-ai/nomic-embed-text-v1.5
  • Language: Indonesian (Bahasa Indonesia)
  • Embedding Dimension: 768
  • Max Sequence Length: 8192
  • Training Examples: 6,294 (balanced positive/negative)
  • Categories: 17 Indonesian content domains
  • Loss Function: MultipleNegativesRankingLoss
  • Training: Conservative approach to prevent embedding collapse

🎯 Model Performance

  • Search Retrieval: Maintains base performance (1.000 precision@1)
  • Classification: Stable performance (0.667 accuracy)
  • Clustering: Excellent performance (1.000 accuracy)
  • Semantic Similarity: High correlation (0.794)
  • Embedding Health: Healthy diversity range (0.625-0.898)

📝 License & Attribution

This model inherits the license from nomic-ai/nomic-embed-text-v1.5. Please refer to the base model's license terms.

🔗 Links

🐛 Troubleshooting

Common Issues:

  1. Authentication Error:
huggingface-cli login
  1. Large File Upload Issues:
git lfs install
git lfs track "*.safetensors"
  1. Model Loading Error:
# Ensure trust_remote_code=True if needed
model = SentenceTransformer("asmud/nomic-embed-indonesian", trust_remote_code=True)
  1. Memory Issues:
# Use CPU if GPU memory insufficient
model = SentenceTransformer("asmud/nomic-embed-indonesian", device='cpu')