|
|
--- |
|
|
title: Test01 |
|
|
emoji: 🐨 |
|
|
colorFrom: gray |
|
|
colorTo: red |
|
|
sdk: docker |
|
|
app_port: 7860 |
|
|
startup_duration_timeout: 60m |
|
|
pinned: false |
|
|
license: apache-2.0 |
|
|
--- |
|
|
# User Embedding Model |
|
|
|
|
|
This repository contains a PyTorch model for generating user embeddings based on DMP (Data Management Platform) data. The model creates dense vector representations of users that can be used for recommendation systems, user clustering, and similarity searches. |
|
|
|
|
|
## Quick Start with Docker |
|
|
|
|
|
To run the model using Docker: |
|
|
|
|
|
```bash |
|
|
docker build -t user-embedding-model . |
|
|
|
|
|
docker run -v /path/to/your/data:/app/data \ |
|
|
-e DATA_PATH=/app/data/users.json \ |
|
|
-e NUM_EPOCHS=10 \ |
|
|
-e BATCH_SIZE=32 \ |
|
|
-v /path/to/output:/app/embeddings_output \ |
|
|
user-embedding-model |
|
|
``` |
|
|
|
|
|
## Pushing to Hugging Face |
|
|
|
|
|
To automatically push the model to Hugging Face, add your credentials: |
|
|
|
|
|
```bash |
|
|
docker run -v /path/to/your/data:/app/data \ |
|
|
-e DATA_PATH=/app/data/users.json \ |
|
|
-e HF_REPO_ID="your-username/your-model-name" \ |
|
|
-e HF_TOKEN="your-huggingface-token" \ |
|
|
-v /path/to/output:/app/embeddings_output \ |
|
|
user-embedding-model |
|
|
``` |
|
|
|
|
|
## Input Data Format |
|
|
|
|
|
The model expects user data in JSON format, with each user having DMP fields like: |
|
|
|
|
|
```json |
|
|
{ |
|
|
"dmp": { |
|
|
"city": "milano", |
|
|
"domains": ["example.com"], |
|
|
"brands": ["brand1", "brand2"], |
|
|
"clusters": ["cluster1", "cluster2"], |
|
|
"industries": ["industry1"], |
|
|
"tags": ["tag1", "tag2"], |
|
|
"channels": ["channel1"], |
|
|
"~click__host": "host1", |
|
|
"~click__domain": "domain1", |
|
|
"": { |
|
|
"id": "user123" |
|
|
} |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Environment Variables |
|
|
|
|
|
- `DATA_PATH`: Path to your input JSON file (default: "users.json") |
|
|
- `NUM_EPOCHS`: Number of training epochs (default: 10) |
|
|
- `BATCH_SIZE`: Batch size for training (default: 32) |
|
|
- `LEARNING_RATE`: Learning rate for optimizer (default: 0.001) |
|
|
- `SAVE_INTERVAL`: Save checkpoint every N epochs (default: 2) |
|
|
- `HF_REPO_ID`: Hugging Face repository ID for uploading |
|
|
- `HF_TOKEN`: Hugging Face API token |
|
|
|
|
|
## Output |
|
|
|
|
|
The model generates: |
|
|
|
|
|
1. `embeddings.json`: User embeddings in JSON format |
|
|
2. `embeddings.npz`: User embeddings in NumPy format |
|
|
3. `vocabularies.json`: Vocabulary mappings |
|
|
4. `model.pth`: Trained PyTorch model |
|
|
5. `model_config.json`: Model configuration |
|
|
6. Hugging Face-compatible model files in the "huggingface" subdirectory |
|
|
|
|
|
## Hardware Requirements |
|
|
|
|
|
- **Recommended**: NVIDIA GPU with CUDA support |
|
|
- The code uses parallel processing for triplet generation to utilize all available CPU cores |
|
|
- For L40S GPU, recommended batch size: 32-64 |
|
|
- Memory requirement: At least 8GB RAM |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
The model consists of: |
|
|
|
|
|
- Embedding layers for each user field |
|
|
- Sequential fully connected layers for dimensionality reduction |
|
|
- Output dimension: 256 (configurable) |
|
|
- Training method: Triplet margin loss using similar/dissimilar users |
|
|
|
|
|
## Performance Optimization |
|
|
|
|
|
The code includes several optimizations: |
|
|
|
|
|
- Parallel triplet generation using all available CPU cores |
|
|
- GPU acceleration for model training |
|
|
- Efficient memory handling for large datasets |
|
|
- TensorBoard integration for monitoring training |
|
|
|
|
|
## Troubleshooting |
|
|
|
|
|
If you encounter issues: |
|
|
|
|
|
1. Check that your input data follows the expected format |
|
|
2. Ensure you have sufficient memory for your dataset size |
|
|
3. For GPU errors, try reducing batch size |
|
|
4. Check the logs for detailed error messages |