Spaces:

cygon24
/

llm-api-backend

Runtime error

App Files Files Community

llm-api-backend / strucrure.md

cygon

intial commit

7f7b101 3 months ago

preview code

raw

history blame contribute delete

7.95 kB

🤖 Production-Ready LLM API Backend

A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with Encore.ts for easy deployment to Encore Cloud or Hugging Face Spaces.

✨ Features

🎯 5 Core Endpoints - Chat, RAG, Analysis, Models, Health
🔄 Dual Provider Support - Ollama (local) or Hugging Face (cloud)
⚡ Smart Caching - In-memory cache with TTL and automatic cleanup
🛡️ Type-Safe - Full TypeScript support with end-to-end type safety
📦 Production Ready - Comprehensive error handling, logging, and monitoring
🚀 Zero Config - Works out of the box on multiple platforms

🚀 Quick Start

Local Development

# Set up secrets
encore secret set LLMProvider ollama
encore secret set OllamaBaseURL http://localhost:11434

# Or use Hugging Face
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token_here
encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2

# Run locally
encore run

# Test the API
curl -X POST http://localhost:4000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain AI in simple terms"}'

Deploy to Encore Cloud

encore deploy

Your API will be live at: https://staging-<your-app>.encr.app

Deploy to Hugging Face Spaces

See README.space.md for complete Hugging Face Spaces deployment instructions.

Quick summary:

Create a new Docker Space on Hugging Face
Push this repository to your Space
Configure secrets in Space settings
Your API is live!

📡 API Endpoints

POST `/chat`

Conversational AI with intelligent caching.

Request:

{
  "message": "Explain quantum computing",
  "model": "llama3",
  "temperature": 0.7,
  "maxTokens": 500,
  "systemPrompt": "You are a helpful assistant"
}

Response:

{
  "response": "Quantum computing is...",
  "model": "llama3",
  "tokensUsed": 150
}

POST `/rag`

Retrieval-Augmented Generation with source tracking.

Request:

{
  "query": "What is the main topic?",
  "context": [
    "Quantum computing uses qubits...",
    "Classical computers use bits..."
  ],
  "model": "mistral",
  "temperature": 0.5
}

Response:

{
  "response": "Based on [0] and [1], the main topic is...",
  "model": "mistral",
  "tokensUsed": 120,
  "sources": [0, 1]
}

POST `/analyze`

Text analysis for educational and research use cases.

Request:

{
  "text": "Your long text here...",
  "task": "summarize",
  "model": "llama3",
  "temperature": 0.3
}

Tasks: summarize, evaluate, explain, extract

Response:

{
  "result": "Summary of the text...",
  "task": "summarize",
  "model": "llama3",
  "tokensUsed": 80
}

GET `/models`

List all available LLM models.

Response:

{
  "provider": "ollama",
  "models": [
    {
      "name": "llama3",
      "size": "4.7 GB",
      "description": "llama3 - Modified 1/2/2025",
      "provider": "ollama"
    }
  ]
}

GET `/health`

System health and uptime monitoring.

Response:

{
  "status": "healthy",
  "uptime": 3600,
  "provider": "huggingface",
  "modelsAvailable": true,
  "cache": {
    "chat": {"size": 10, "maxEntries": 100, "ttl": 300},
    "rag": {"size": 5, "maxEntries": 50, "ttl": 600},
    "analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
  }
}

🔧 Configuration

Required Secrets

Secret	Description	Example
`LLMProvider`	Provider to use	`ollama` or `huggingface`
`OllamaBaseURL`	Ollama API URL (if using Ollama)	`http://localhost:11434`
`HuggingFaceAPIKey`	HF token (if using Hugging Face)	`hf_xxxxxxxxxxxxx`
`DefaultModel`	Default model (optional)	`llama3` or `mistralai/Mistral-7B-Instruct-v0.2`

Setting Secrets

Encore Cloud:

encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token

Hugging Face Spaces: Add secrets in Space Settings → Repository secrets

🏗️ Architecture

backend/
├── chat/                    # Conversational AI endpoint
│   ├── encore.service.ts
│   └── chat.ts
├── rag/                     # RAG endpoint
│   ├── encore.service.ts
│   └── rag.ts
├── analyze/                 # Text analysis endpoint
│   ├── encore.service.ts
│   └── analyze.ts
├── models/                  # Model listing endpoint
│   ├── encore.service.ts
│   └── models.ts
├── health/                  # Health check endpoint
│   ├── encore.service.ts
│   └── health.ts
└── lib/                     # Shared utilities
    ├── types.ts            # TypeScript types
    ├── cache.ts            # In-memory caching
    ├── llm-provider.ts     # Provider abstraction
    ├── ollama-client.ts    # Ollama integration
    └── huggingface-client.ts # Hugging Face integration

🎯 Use Cases

💬 Chatbots - Build conversational AI applications
📚 RAG Systems - Create context-aware Q&A systems
🎓 Education - Analyze and explain complex texts
🔬 Research - Summarize and extract key information
🤖 AI Agents - Backend for autonomous AI systems
📊 Content Analysis - Evaluate and process documents

🚀 Deployment Options

1. Encore Cloud (Recommended for Production)

encore deploy

Automatic scaling
Built-in monitoring
Type-safe service-to-service calls
Zero infrastructure management

2. Hugging Face Spaces (Great for Demos)

See README.space.md
Free hosting for public projects
Easy model integration
Community visibility

3. Docker

docker build -t llm-api .
docker run -p 7860:7860 \
  -e LLMProvider=huggingface \
  -e HuggingFaceAPIKey=your_key \
  llm-api

4. Self-Hosted

npm install -g encore.dev
encore run --port 8080

📊 Performance

Caching - Reduces redundant LLM calls by up to 80%
Async/Await - Non-blocking concurrent requests
Lightweight - Minimal dependencies for fast startup
Efficient - Optimized for serverless environments

Cache Configuration:

Chat: 300s TTL, 100 max entries
RAG: 600s TTL, 50 max entries
Analysis: 900s TTL, 30 max entries

🔐 Security Best Practices

✅ API keys stored as secrets, never in code
✅ No sensitive data in logs
✅ Type-safe request validation
✅ Error messages don't leak internals
✅ CORS configured for frontend integration

🛠️ Development

# Install Encore
npm install -g encore.dev

# Run with hot reload
encore run

# Run tests
encore test

# Type check
encore build

📝 Example: Frontend Integration

// Auto-generated type-safe client
import backend from '~backend/client';

// Chat
const response = await backend.chat.chat({
  message: "Hello!",
  temperature: 0.7
});

// RAG
const ragResponse = await backend.rag.rag({
  query: "What is this about?",
  context: ["Document 1...", "Document 2..."]
});

// Analysis
const analysis = await backend.analyze.analyze({
  text: "Long text...",
  task: "summarize"
});

🤝 Contributing

Contributions welcome! This is a production-ready foundation that can be extended with:

Additional analysis tasks
Vector database integration for RAG
Streaming responses
Rate limiting middleware
Authentication
Model fine-tuning endpoints

📄 License

MIT License - feel free to use in your projects!

🆘 Support

Built with ❤️ using Encore.ts

🤖 Production-Ready LLM API Backend

✨ Features

🚀 Quick Start

Local Development

Deploy to Encore Cloud

Deploy to Hugging Face Spaces

📡 API Endpoints

POST /chat

POST /rag

POST /analyze

GET /models

GET /health

🔧 Configuration

Required Secrets

Setting Secrets

🏗️ Architecture

🎯 Use Cases

🚀 Deployment Options

1. Encore Cloud (Recommended for Production)

2. Hugging Face Spaces (Great for Demos)

3. Docker

4. Self-Hosted

📊 Performance

🔐 Security Best Practices

🛠️ Development

📝 Example: Frontend Integration

🤝 Contributing

📄 License

🆘 Support

POST `/chat`

POST `/rag`

POST `/analyze`

GET `/models`

GET `/health`