Spaces:
Runtime error
π€ Production-Ready LLM API Backend
A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with Encore.ts for easy deployment to Encore Cloud or Hugging Face Spaces.
β¨ Features
- π― 5 Core Endpoints - Chat, RAG, Analysis, Models, Health
- π Dual Provider Support - Ollama (local) or Hugging Face (cloud)
- β‘ Smart Caching - In-memory cache with TTL and automatic cleanup
- π‘οΈ Type-Safe - Full TypeScript support with end-to-end type safety
- π¦ Production Ready - Comprehensive error handling, logging, and monitoring
- π Zero Config - Works out of the box on multiple platforms
π Quick Start
Local Development
# Set up secrets
encore secret set LLMProvider ollama
encore secret set OllamaBaseURL http://localhost:11434
# Or use Hugging Face
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token_here
encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2
# Run locally
encore run
# Test the API
curl -X POST http://localhost:4000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain AI in simple terms"}'
Deploy to Encore Cloud
encore deploy
Your API will be live at: https://staging-<your-app>.encr.app
Deploy to Hugging Face Spaces
See README.space.md for complete Hugging Face Spaces deployment instructions.
Quick summary:
- Create a new Docker Space on Hugging Face
- Push this repository to your Space
- Configure secrets in Space settings
- Your API is live!
π‘ API Endpoints
POST /chat
Conversational AI with intelligent caching.
Request:
{
"message": "Explain quantum computing",
"model": "llama3",
"temperature": 0.7,
"maxTokens": 500,
"systemPrompt": "You are a helpful assistant"
}
Response:
{
"response": "Quantum computing is...",
"model": "llama3",
"tokensUsed": 150
}
POST /rag
Retrieval-Augmented Generation with source tracking.
Request:
{
"query": "What is the main topic?",
"context": [
"Quantum computing uses qubits...",
"Classical computers use bits..."
],
"model": "mistral",
"temperature": 0.5
}
Response:
{
"response": "Based on [0] and [1], the main topic is...",
"model": "mistral",
"tokensUsed": 120,
"sources": [0, 1]
}
POST /analyze
Text analysis for educational and research use cases.
Request:
{
"text": "Your long text here...",
"task": "summarize",
"model": "llama3",
"temperature": 0.3
}
Tasks: summarize, evaluate, explain, extract
Response:
{
"result": "Summary of the text...",
"task": "summarize",
"model": "llama3",
"tokensUsed": 80
}
GET /models
List all available LLM models.
Response:
{
"provider": "ollama",
"models": [
{
"name": "llama3",
"size": "4.7 GB",
"description": "llama3 - Modified 1/2/2025",
"provider": "ollama"
}
]
}
GET /health
System health and uptime monitoring.
Response:
{
"status": "healthy",
"uptime": 3600,
"provider": "huggingface",
"modelsAvailable": true,
"cache": {
"chat": {"size": 10, "maxEntries": 100, "ttl": 300},
"rag": {"size": 5, "maxEntries": 50, "ttl": 600},
"analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
}
}
π§ Configuration
Required Secrets
| Secret | Description | Example |
|---|---|---|
LLMProvider |
Provider to use | ollama or huggingface |
OllamaBaseURL |
Ollama API URL (if using Ollama) | http://localhost:11434 |
HuggingFaceAPIKey |
HF token (if using Hugging Face) | hf_xxxxxxxxxxxxx |
DefaultModel |
Default model (optional) | llama3 or mistralai/Mistral-7B-Instruct-v0.2 |
Setting Secrets
Encore Cloud:
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token
Hugging Face Spaces: Add secrets in Space Settings β Repository secrets
ποΈ Architecture
backend/
βββ chat/ # Conversational AI endpoint
β βββ encore.service.ts
β βββ chat.ts
βββ rag/ # RAG endpoint
β βββ encore.service.ts
β βββ rag.ts
βββ analyze/ # Text analysis endpoint
β βββ encore.service.ts
β βββ analyze.ts
βββ models/ # Model listing endpoint
β βββ encore.service.ts
β βββ models.ts
βββ health/ # Health check endpoint
β βββ encore.service.ts
β βββ health.ts
βββ lib/ # Shared utilities
βββ types.ts # TypeScript types
βββ cache.ts # In-memory caching
βββ llm-provider.ts # Provider abstraction
βββ ollama-client.ts # Ollama integration
βββ huggingface-client.ts # Hugging Face integration
π― Use Cases
- π¬ Chatbots - Build conversational AI applications
- π RAG Systems - Create context-aware Q&A systems
- π Education - Analyze and explain complex texts
- π¬ Research - Summarize and extract key information
- π€ AI Agents - Backend for autonomous AI systems
- π Content Analysis - Evaluate and process documents
π Deployment Options
1. Encore Cloud (Recommended for Production)
encore deploy
- Automatic scaling
- Built-in monitoring
- Type-safe service-to-service calls
- Zero infrastructure management
2. Hugging Face Spaces (Great for Demos)
- See README.space.md
- Free hosting for public projects
- Easy model integration
- Community visibility
3. Docker
docker build -t llm-api .
docker run -p 7860:7860 \
-e LLMProvider=huggingface \
-e HuggingFaceAPIKey=your_key \
llm-api
4. Self-Hosted
npm install -g encore.dev
encore run --port 8080
π Performance
- Caching - Reduces redundant LLM calls by up to 80%
- Async/Await - Non-blocking concurrent requests
- Lightweight - Minimal dependencies for fast startup
- Efficient - Optimized for serverless environments
Cache Configuration:
- Chat: 300s TTL, 100 max entries
- RAG: 600s TTL, 50 max entries
- Analysis: 900s TTL, 30 max entries
π Security Best Practices
β
API keys stored as secrets, never in code
β
No sensitive data in logs
β
Type-safe request validation
β
Error messages don't leak internals
β
CORS configured for frontend integration
π οΈ Development
# Install Encore
npm install -g encore.dev
# Run with hot reload
encore run
# Run tests
encore test
# Type check
encore build
π Example: Frontend Integration
// Auto-generated type-safe client
import backend from '~backend/client';
// Chat
const response = await backend.chat.chat({
message: "Hello!",
temperature: 0.7
});
// RAG
const ragResponse = await backend.rag.rag({
query: "What is this about?",
context: ["Document 1...", "Document 2..."]
});
// Analysis
const analysis = await backend.analyze.analyze({
text: "Long text...",
task: "summarize"
});
π€ Contributing
Contributions welcome! This is a production-ready foundation that can be extended with:
- Additional analysis tasks
- Vector database integration for RAG
- Streaming responses
- Rate limiting middleware
- Authentication
- Model fine-tuning endpoints
π License
MIT License - feel free to use in your projects!
π Support
Built with β€οΈ using Encore.ts