llm-api-backend / strucrure.md
cygon
intial commit
7f7b101
# πŸ€– Production-Ready LLM API Backend
A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with [Encore.ts](https://encore.dev) for easy deployment to Encore Cloud or Hugging Face Spaces.
## ✨ Features
- 🎯 **5 Core Endpoints** - Chat, RAG, Analysis, Models, Health
- πŸ”„ **Dual Provider Support** - Ollama (local) or Hugging Face (cloud)
- ⚑ **Smart Caching** - In-memory cache with TTL and automatic cleanup
- πŸ›‘οΈ **Type-Safe** - Full TypeScript support with end-to-end type safety
- πŸ“¦ **Production Ready** - Comprehensive error handling, logging, and monitoring
- πŸš€ **Zero Config** - Works out of the box on multiple platforms
## πŸš€ Quick Start
### Local Development
```bash
# Set up secrets
encore secret set LLMProvider ollama
encore secret set OllamaBaseURL http://localhost:11434
# Or use Hugging Face
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token_here
encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2
# Run locally
encore run
# Test the API
curl -X POST http://localhost:4000/chat \
-H "Content-Type: application/json" \
-d '{"message": "Explain AI in simple terms"}'
```
### Deploy to Encore Cloud
```bash
encore deploy
```
Your API will be live at: `https://staging-<your-app>.encr.app`
### Deploy to Hugging Face Spaces
See [README.space.md](./README.space.md) for complete Hugging Face Spaces deployment instructions.
**Quick summary:**
1. Create a new Docker Space on Hugging Face
2. Push this repository to your Space
3. Configure secrets in Space settings
4. Your API is live!
## πŸ“‘ API Endpoints
### POST `/chat`
Conversational AI with intelligent caching.
**Request:**
```json
{
"message": "Explain quantum computing",
"model": "llama3",
"temperature": 0.7,
"maxTokens": 500,
"systemPrompt": "You are a helpful assistant"
}
```
**Response:**
```json
{
"response": "Quantum computing is...",
"model": "llama3",
"tokensUsed": 150
}
```
### POST `/rag`
Retrieval-Augmented Generation with source tracking.
**Request:**
```json
{
"query": "What is the main topic?",
"context": [
"Quantum computing uses qubits...",
"Classical computers use bits..."
],
"model": "mistral",
"temperature": 0.5
}
```
**Response:**
```json
{
"response": "Based on [0] and [1], the main topic is...",
"model": "mistral",
"tokensUsed": 120,
"sources": [0, 1]
}
```
### POST `/analyze`
Text analysis for educational and research use cases.
**Request:**
```json
{
"text": "Your long text here...",
"task": "summarize",
"model": "llama3",
"temperature": 0.3
}
```
**Tasks:** `summarize`, `evaluate`, `explain`, `extract`
**Response:**
```json
{
"result": "Summary of the text...",
"task": "summarize",
"model": "llama3",
"tokensUsed": 80
}
```
### GET `/models`
List all available LLM models.
**Response:**
```json
{
"provider": "ollama",
"models": [
{
"name": "llama3",
"size": "4.7 GB",
"description": "llama3 - Modified 1/2/2025",
"provider": "ollama"
}
]
}
```
### GET `/health`
System health and uptime monitoring.
**Response:**
```json
{
"status": "healthy",
"uptime": 3600,
"provider": "huggingface",
"modelsAvailable": true,
"cache": {
"chat": {"size": 10, "maxEntries": 100, "ttl": 300},
"rag": {"size": 5, "maxEntries": 50, "ttl": 600},
"analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
}
}
```
## πŸ”§ Configuration
### Required Secrets
| Secret | Description | Example |
|--------|-------------|---------|
| `LLMProvider` | Provider to use | `ollama` or `huggingface` |
| `OllamaBaseURL` | Ollama API URL (if using Ollama) | `http://localhost:11434` |
| `HuggingFaceAPIKey` | HF token (if using Hugging Face) | `hf_xxxxxxxxxxxxx` |
| `DefaultModel` | Default model (optional) | `llama3` or `mistralai/Mistral-7B-Instruct-v0.2` |
### Setting Secrets
**Encore Cloud:**
```bash
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token
```
**Hugging Face Spaces:**
Add secrets in Space Settings β†’ Repository secrets
## πŸ—οΈ Architecture
```
backend/
β”œβ”€β”€ chat/ # Conversational AI endpoint
β”‚ β”œβ”€β”€ encore.service.ts
β”‚ └── chat.ts
β”œβ”€β”€ rag/ # RAG endpoint
β”‚ β”œβ”€β”€ encore.service.ts
β”‚ └── rag.ts
β”œβ”€β”€ analyze/ # Text analysis endpoint
β”‚ β”œβ”€β”€ encore.service.ts
β”‚ └── analyze.ts
β”œβ”€β”€ models/ # Model listing endpoint
β”‚ β”œβ”€β”€ encore.service.ts
β”‚ └── models.ts
β”œβ”€β”€ health/ # Health check endpoint
β”‚ β”œβ”€β”€ encore.service.ts
β”‚ └── health.ts
└── lib/ # Shared utilities
β”œβ”€β”€ types.ts # TypeScript types
β”œβ”€β”€ cache.ts # In-memory caching
β”œβ”€β”€ llm-provider.ts # Provider abstraction
β”œβ”€β”€ ollama-client.ts # Ollama integration
└── huggingface-client.ts # Hugging Face integration
```
## 🎯 Use Cases
- πŸ’¬ **Chatbots** - Build conversational AI applications
- πŸ“š **RAG Systems** - Create context-aware Q&A systems
- πŸŽ“ **Education** - Analyze and explain complex texts
- πŸ”¬ **Research** - Summarize and extract key information
- πŸ€– **AI Agents** - Backend for autonomous AI systems
- πŸ“Š **Content Analysis** - Evaluate and process documents
## πŸš€ Deployment Options
### 1. Encore Cloud (Recommended for Production)
```bash
encore deploy
```
- Automatic scaling
- Built-in monitoring
- Type-safe service-to-service calls
- Zero infrastructure management
### 2. Hugging Face Spaces (Great for Demos)
- See [README.space.md](./README.space.md)
- Free hosting for public projects
- Easy model integration
- Community visibility
### 3. Docker
```bash
docker build -t llm-api .
docker run -p 7860:7860 \
-e LLMProvider=huggingface \
-e HuggingFaceAPIKey=your_key \
llm-api
```
### 4. Self-Hosted
```bash
npm install -g encore.dev
encore run --port 8080
```
## πŸ“Š Performance
- **Caching** - Reduces redundant LLM calls by up to 80%
- **Async/Await** - Non-blocking concurrent requests
- **Lightweight** - Minimal dependencies for fast startup
- **Efficient** - Optimized for serverless environments
**Cache Configuration:**
- Chat: 300s TTL, 100 max entries
- RAG: 600s TTL, 50 max entries
- Analysis: 900s TTL, 30 max entries
## πŸ” Security Best Practices
βœ… API keys stored as secrets, never in code
βœ… No sensitive data in logs
βœ… Type-safe request validation
βœ… Error messages don't leak internals
βœ… CORS configured for frontend integration
## πŸ› οΈ Development
```bash
# Install Encore
npm install -g encore.dev
# Run with hot reload
encore run
# Run tests
encore test
# Type check
encore build
```
## πŸ“ Example: Frontend Integration
```typescript
// Auto-generated type-safe client
import backend from '~backend/client';
// Chat
const response = await backend.chat.chat({
message: "Hello!",
temperature: 0.7
});
// RAG
const ragResponse = await backend.rag.rag({
query: "What is this about?",
context: ["Document 1...", "Document 2..."]
});
// Analysis
const analysis = await backend.analyze.analyze({
text: "Long text...",
task: "summarize"
});
```
## 🀝 Contributing
Contributions welcome! This is a production-ready foundation that can be extended with:
- Additional analysis tasks
- Vector database integration for RAG
- Streaming responses
- Rate limiting middleware
- Authentication
- Model fine-tuning endpoints
## πŸ“„ License
MIT License - feel free to use in your projects!
## πŸ†˜ Support
- [Encore Documentation](https://encore.dev/docs)
- [Hugging Face Spaces Docs](https://huggingface.co/docs/hub/spaces)
- [GitHub Issues](./issues)
---
**Built with** ❀️ using [Encore.ts](https://encore.dev)