llm-api-backend / strucrure.md
cygon
intial commit
7f7b101

πŸ€– Production-Ready LLM API Backend

A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with Encore.ts for easy deployment to Encore Cloud or Hugging Face Spaces.

✨ Features

  • 🎯 5 Core Endpoints - Chat, RAG, Analysis, Models, Health
  • πŸ”„ Dual Provider Support - Ollama (local) or Hugging Face (cloud)
  • ⚑ Smart Caching - In-memory cache with TTL and automatic cleanup
  • πŸ›‘οΈ Type-Safe - Full TypeScript support with end-to-end type safety
  • πŸ“¦ Production Ready - Comprehensive error handling, logging, and monitoring
  • πŸš€ Zero Config - Works out of the box on multiple platforms

πŸš€ Quick Start

Local Development

# Set up secrets
encore secret set LLMProvider ollama
encore secret set OllamaBaseURL http://localhost:11434

# Or use Hugging Face
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token_here
encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2

# Run locally
encore run

# Test the API
curl -X POST http://localhost:4000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain AI in simple terms"}'

Deploy to Encore Cloud

encore deploy

Your API will be live at: https://staging-<your-app>.encr.app

Deploy to Hugging Face Spaces

See README.space.md for complete Hugging Face Spaces deployment instructions.

Quick summary:

  1. Create a new Docker Space on Hugging Face
  2. Push this repository to your Space
  3. Configure secrets in Space settings
  4. Your API is live!

πŸ“‘ API Endpoints

POST /chat

Conversational AI with intelligent caching.

Request:

{
  "message": "Explain quantum computing",
  "model": "llama3",
  "temperature": 0.7,
  "maxTokens": 500,
  "systemPrompt": "You are a helpful assistant"
}

Response:

{
  "response": "Quantum computing is...",
  "model": "llama3",
  "tokensUsed": 150
}

POST /rag

Retrieval-Augmented Generation with source tracking.

Request:

{
  "query": "What is the main topic?",
  "context": [
    "Quantum computing uses qubits...",
    "Classical computers use bits..."
  ],
  "model": "mistral",
  "temperature": 0.5
}

Response:

{
  "response": "Based on [0] and [1], the main topic is...",
  "model": "mistral",
  "tokensUsed": 120,
  "sources": [0, 1]
}

POST /analyze

Text analysis for educational and research use cases.

Request:

{
  "text": "Your long text here...",
  "task": "summarize",
  "model": "llama3",
  "temperature": 0.3
}

Tasks: summarize, evaluate, explain, extract

Response:

{
  "result": "Summary of the text...",
  "task": "summarize",
  "model": "llama3",
  "tokensUsed": 80
}

GET /models

List all available LLM models.

Response:

{
  "provider": "ollama",
  "models": [
    {
      "name": "llama3",
      "size": "4.7 GB",
      "description": "llama3 - Modified 1/2/2025",
      "provider": "ollama"
    }
  ]
}

GET /health

System health and uptime monitoring.

Response:

{
  "status": "healthy",
  "uptime": 3600,
  "provider": "huggingface",
  "modelsAvailable": true,
  "cache": {
    "chat": {"size": 10, "maxEntries": 100, "ttl": 300},
    "rag": {"size": 5, "maxEntries": 50, "ttl": 600},
    "analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
  }
}

πŸ”§ Configuration

Required Secrets

Secret Description Example
LLMProvider Provider to use ollama or huggingface
OllamaBaseURL Ollama API URL (if using Ollama) http://localhost:11434
HuggingFaceAPIKey HF token (if using Hugging Face) hf_xxxxxxxxxxxxx
DefaultModel Default model (optional) llama3 or mistralai/Mistral-7B-Instruct-v0.2

Setting Secrets

Encore Cloud:

encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token

Hugging Face Spaces: Add secrets in Space Settings β†’ Repository secrets

πŸ—οΈ Architecture

backend/
β”œβ”€β”€ chat/                    # Conversational AI endpoint
β”‚   β”œβ”€β”€ encore.service.ts
β”‚   └── chat.ts
β”œβ”€β”€ rag/                     # RAG endpoint
β”‚   β”œβ”€β”€ encore.service.ts
β”‚   └── rag.ts
β”œβ”€β”€ analyze/                 # Text analysis endpoint
β”‚   β”œβ”€β”€ encore.service.ts
β”‚   └── analyze.ts
β”œβ”€β”€ models/                  # Model listing endpoint
β”‚   β”œβ”€β”€ encore.service.ts
β”‚   └── models.ts
β”œβ”€β”€ health/                  # Health check endpoint
β”‚   β”œβ”€β”€ encore.service.ts
β”‚   └── health.ts
└── lib/                     # Shared utilities
    β”œβ”€β”€ types.ts            # TypeScript types
    β”œβ”€β”€ cache.ts            # In-memory caching
    β”œβ”€β”€ llm-provider.ts     # Provider abstraction
    β”œβ”€β”€ ollama-client.ts    # Ollama integration
    └── huggingface-client.ts # Hugging Face integration

🎯 Use Cases

  • πŸ’¬ Chatbots - Build conversational AI applications
  • πŸ“š RAG Systems - Create context-aware Q&A systems
  • πŸŽ“ Education - Analyze and explain complex texts
  • πŸ”¬ Research - Summarize and extract key information
  • πŸ€– AI Agents - Backend for autonomous AI systems
  • πŸ“Š Content Analysis - Evaluate and process documents

πŸš€ Deployment Options

1. Encore Cloud (Recommended for Production)

encore deploy
  • Automatic scaling
  • Built-in monitoring
  • Type-safe service-to-service calls
  • Zero infrastructure management

2. Hugging Face Spaces (Great for Demos)

  • See README.space.md
  • Free hosting for public projects
  • Easy model integration
  • Community visibility

3. Docker

docker build -t llm-api .
docker run -p 7860:7860 \
  -e LLMProvider=huggingface \
  -e HuggingFaceAPIKey=your_key \
  llm-api

4. Self-Hosted

npm install -g encore.dev
encore run --port 8080

πŸ“Š Performance

  • Caching - Reduces redundant LLM calls by up to 80%
  • Async/Await - Non-blocking concurrent requests
  • Lightweight - Minimal dependencies for fast startup
  • Efficient - Optimized for serverless environments

Cache Configuration:

  • Chat: 300s TTL, 100 max entries
  • RAG: 600s TTL, 50 max entries
  • Analysis: 900s TTL, 30 max entries

πŸ” Security Best Practices

βœ… API keys stored as secrets, never in code
βœ… No sensitive data in logs
βœ… Type-safe request validation
βœ… Error messages don't leak internals
βœ… CORS configured for frontend integration

πŸ› οΈ Development

# Install Encore
npm install -g encore.dev

# Run with hot reload
encore run

# Run tests
encore test

# Type check
encore build

πŸ“ Example: Frontend Integration

// Auto-generated type-safe client
import backend from '~backend/client';

// Chat
const response = await backend.chat.chat({
  message: "Hello!",
  temperature: 0.7
});

// RAG
const ragResponse = await backend.rag.rag({
  query: "What is this about?",
  context: ["Document 1...", "Document 2..."]
});

// Analysis
const analysis = await backend.analyze.analyze({
  text: "Long text...",
  task: "summarize"
});

🀝 Contributing

Contributions welcome! This is a production-ready foundation that can be extended with:

  • Additional analysis tasks
  • Vector database integration for RAG
  • Streaming responses
  • Rate limiting middleware
  • Authentication
  • Model fine-tuning endpoints

πŸ“„ License

MIT License - feel free to use in your projects!

πŸ†˜ Support


Built with ❀️ using Encore.ts