bertopic / Deployment Guide.md
Mars203020's picture
Upload 17 files
b7b041e verified

Deployment Guide

This guide covers various deployment options for the Social Media Topic Modeling System.

Local Development

Quick Start

# Install dependencies
pip install -r requirements.txt

# Run the application
streamlit run streamlit_app.py

Development with Docker

# Build and run with Docker Compose
docker-compose up --build

# Or build and run manually
docker build -t topic-modeling-app .
docker run -p 8501:8501 topic-modeling-app

Production Deployment

Docker Production Setup

  1. Build the production image:
docker build -t topic-modeling-app:latest .
  1. Run with production settings:
docker run -d \
  --name topic-modeling-prod \
  -p 8501:8501 \
  --memory=4g \
  --cpus=2 \
  --restart=unless-stopped \
  topic-modeling-app:latest
  1. Using Docker Compose for production:
version: '3.8'
services:
  topic-modeling-app:
    build: .
    ports:
      - "8501:8501"
    environment:
      - STREAMLIT_SERVER_PORT=8501
      - STREAMLIT_SERVER_ADDRESS=0.0.0.0
    volumes:
      - ./data:/app/data
    restart: unless-stopped
    deploy:
      resources:
        limits:
          memory: 4G
          cpus: '2'
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8501/_stcore/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Cloud Deployment Options

1. AWS ECS/Fargate

# Tag for ECR
docker tag topic-modeling-app:latest your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest

# Push to ECR
docker push your-account.dkr.ecr.region.amazonaws.com/topic-modeling-app:latest

2. Google Cloud Run

# Build and deploy to Cloud Run
gcloud run deploy topic-modeling-app \
  --image gcr.io/your-project/topic-modeling-app \
  --platform managed \
  --region us-central1 \
  --memory 4Gi \
  --cpu 2

3. Azure Container Instances

# Deploy to Azure
az container create \
  --resource-group myResourceGroup \
  --name topic-modeling-app \
  --image your-registry.azurecr.io/topic-modeling-app:latest \
  --cpu 2 \
  --memory 4 \
  --ports 8501

4. Heroku

# Login to Heroku Container Registry
heroku container:login

# Build and push
heroku container:push web --app your-app-name

# Release
heroku container:release web --app your-app-name

Kubernetes Deployment

Deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: topic-modeling-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: topic-modeling-app
  template:
    metadata:
      labels:
        app: topic-modeling-app
    spec:
      containers:
      - name: topic-modeling-app
        image: topic-modeling-app:latest
        ports:
        - containerPort: 8501
        resources:
          requests:
            memory: "2Gi"
            cpu: "1"
          limits:
            memory: "4Gi"
            cpu: "2"
        env:
        - name: STREAMLIT_SERVER_PORT
          value: "8501"
        - name: STREAMLIT_SERVER_ADDRESS
          value: "0.0.0.0"
---
apiVersion: v1
kind: Service
metadata:
  name: topic-modeling-service
spec:
  selector:
    app: topic-modeling-app
  ports:
  - port: 80
    targetPort: 8501
  type: LoadBalancer

Performance Optimization

Memory Management

  • Minimum RAM: 4GB for small datasets (< 1000 documents)
  • Recommended RAM: 8GB+ for larger datasets
  • Large datasets: Consider processing in batches

CPU Optimization

  • Minimum: 2 CPU cores
  • Recommended: 4+ CPU cores for faster processing
  • GPU: Optional, can speed up transformer models

Storage Considerations

  • Docker image: ~2GB
  • Temporary files: Varies with dataset size
  • Persistent storage: Optional for saving results

Monitoring and Logging

Health Checks

The application includes built-in health checks:

# Check application health
curl http://localhost:8501/_stcore/health

Logging

Streamlit logs are available through Docker:

# View logs
docker logs topic-modeling-app

# Follow logs
docker logs -f topic-modeling-app

Monitoring with Prometheus

Add monitoring endpoints for production:

# Add to streamlit_app.py for monitoring
import time
import psutil

# Add metrics endpoint
@st.cache_data
def get_system_metrics():
    return {
        'cpu_percent': psutil.cpu_percent(),
        'memory_percent': psutil.virtual_memory().percent,
        'timestamp': time.time()
    }

Security Considerations

Container Security

  • Run as non-root user (included in Dockerfile)
  • Use minimal base images
  • Regularly update dependencies

Network Security

  • Use HTTPS in production
  • Implement proper firewall rules
  • Consider VPN for internal access

Data Security

  • Encrypt data at rest and in transit
  • Implement proper access controls
  • Regular security audits

Troubleshooting

Common Issues

  1. Out of Memory Errors

    • Increase container memory limits
    • Process smaller datasets
    • Use batch processing
  2. Slow Performance

    • Increase CPU allocation
    • Use SSD storage
    • Optimize dataset size
  3. Container Won't Start

    • Check logs: docker logs container-name
    • Verify port availability
    • Check resource limits
  4. Model Loading Issues

    • Ensure internet connectivity for model downloads
    • Pre-download models in Docker build
    • Check disk space

Support

For deployment issues:

  1. Check the logs first
  2. Verify system requirements
  3. Test with sample data
  4. Check network connectivity