Spaces:

cygon24
/

llm-api-backend

Runtime error

App Files Files Community

llm-api-backend / strucrure.md

cygon

intial commit

7f7b101 7 months ago

preview code

raw

history blame contribute delete

7.95 kB

	# 🤖 Production-Ready LLM API Backend

	A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with [Encore.ts](https://encore.dev) for easy deployment to Encore Cloud or Hugging Face Spaces.

	## ✨ Features

	- 🎯 5 Core Endpoints - Chat, RAG, Analysis, Models, Health
	- 🔄 Dual Provider Support - Ollama (local) or Hugging Face (cloud)
	- ⚡ Smart Caching - In-memory cache with TTL and automatic cleanup
	- 🛡️ Type-Safe - Full TypeScript support with end-to-end type safety
	- 📦 Production Ready - Comprehensive error handling, logging, and monitoring
	- 🚀 Zero Config - Works out of the box on multiple platforms

	## 🚀 Quick Start

	### Local Development

	```bash
	# Set up secrets
	encore secret set LLMProvider ollama
	encore secret set OllamaBaseURL http://localhost:11434

	# Or use Hugging Face
	encore secret set LLMProvider huggingface
	encore secret set HuggingFaceAPIKey hf_your_token_here
	encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2

	# Run locally
	encore run

	# Test the API
	curl -X POST http://localhost:4000/chat \
	-H "Content-Type: application/json" \
	-d '{"message": "Explain AI in simple terms"}'
	```

	### Deploy to Encore Cloud

	```bash
	encore deploy
	```

	Your API will be live at: `https://staging-<your-app>.encr.app`

	### Deploy to Hugging Face Spaces

	See [README.space.md](./README.space.md) for complete Hugging Face Spaces deployment instructions.

	Quick summary:
	1. Create a new Docker Space on Hugging Face
	2. Push this repository to your Space
	3. Configure secrets in Space settings
	4. Your API is live!

	## 📡 API Endpoints

	### POST `/chat`
	Conversational AI with intelligent caching.

	Request:
	```json
	{
	"message": "Explain quantum computing",
	"model": "llama3",
	"temperature": 0.7,
	"maxTokens": 500,
	"systemPrompt": "You are a helpful assistant"
	}
	```

	Response:
	```json
	{
	"response": "Quantum computing is...",
	"model": "llama3",
	"tokensUsed": 150
	}
	```

	### POST `/rag`
	Retrieval-Augmented Generation with source tracking.

	Request:
	```json
	{
	"query": "What is the main topic?",
	"context": [
	"Quantum computing uses qubits...",
	"Classical computers use bits..."
	],
	"model": "mistral",
	"temperature": 0.5
	}
	```

	Response:
	```json
	{
	"response": "Based on [0] and [1], the main topic is...",
	"model": "mistral",
	"tokensUsed": 120,
	"sources": [0, 1]
	}
	```

	### POST `/analyze`
	Text analysis for educational and research use cases.

	Request:
	```json
	{
	"text": "Your long text here...",
	"task": "summarize",
	"model": "llama3",
	"temperature": 0.3
	}
	```

	Tasks: `summarize`, `evaluate`, `explain`, `extract`

	Response:
	```json
	{
	"result": "Summary of the text...",
	"task": "summarize",
	"model": "llama3",
	"tokensUsed": 80
	}
	```

	### GET `/models`
	List all available LLM models.

	Response:
	```json
	{
	"provider": "ollama",
	"models": [
	{
	"name": "llama3",
	"size": "4.7 GB",
	"description": "llama3 - Modified 1/2/2025",
	"provider": "ollama"
	}
	]
	}
	```

	### GET `/health`
	System health and uptime monitoring.

	Response:
	```json
	{
	"status": "healthy",
	"uptime": 3600,
	"provider": "huggingface",
	"modelsAvailable": true,
	"cache": {
	"chat": {"size": 10, "maxEntries": 100, "ttl": 300},
	"rag": {"size": 5, "maxEntries": 50, "ttl": 600},
	"analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
	}
	}
	```

	## 🔧 Configuration

	### Required Secrets

	\| Secret \| Description \| Example \|
	\|--------\|-------------\|---------\|
	\| `LLMProvider` \| Provider to use \| `ollama` or `huggingface` \|
	\| `OllamaBaseURL` \| Ollama API URL (if using Ollama) \| `http://localhost:11434` \|
	\| `HuggingFaceAPIKey` \| HF token (if using Hugging Face) \| `hf_xxxxxxxxxxxxx` \|
	\| `DefaultModel` \| Default model (optional) \| `llama3` or `mistralai/Mistral-7B-Instruct-v0.2` \|

	### Setting Secrets

	Encore Cloud:
	```bash
	encore secret set LLMProvider huggingface
	encore secret set HuggingFaceAPIKey hf_your_token
	```

	Hugging Face Spaces:
	Add secrets in Space Settings → Repository secrets

	## 🏗️ Architecture

	```
	backend/
	├── chat/ # Conversational AI endpoint
	│ ├── encore.service.ts
	│ └── chat.ts
	├── rag/ # RAG endpoint
	│ ├── encore.service.ts
	│ └── rag.ts
	├── analyze/ # Text analysis endpoint
	│ ├── encore.service.ts
	│ └── analyze.ts
	├── models/ # Model listing endpoint
	│ ├── encore.service.ts
	│ └── models.ts
	├── health/ # Health check endpoint
	│ ├── encore.service.ts
	│ └── health.ts
	└── lib/ # Shared utilities
	├── types.ts # TypeScript types
	├── cache.ts # In-memory caching
	├── llm-provider.ts # Provider abstraction
	├── ollama-client.ts # Ollama integration
	└── huggingface-client.ts # Hugging Face integration
	```

	## 🎯 Use Cases

	- 💬 Chatbots - Build conversational AI applications
	- 📚 RAG Systems - Create context-aware Q&A systems
	- 🎓 Education - Analyze and explain complex texts
	- 🔬 Research - Summarize and extract key information
	- 🤖 AI Agents - Backend for autonomous AI systems
	- 📊 Content Analysis - Evaluate and process documents

	## 🚀 Deployment Options

	### 1. Encore Cloud (Recommended for Production)
	```bash
	encore deploy
	```
	- Automatic scaling
	- Built-in monitoring
	- Type-safe service-to-service calls
	- Zero infrastructure management

	### 2. Hugging Face Spaces (Great for Demos)
	- See [README.space.md](./README.space.md)
	- Free hosting for public projects
	- Easy model integration
	- Community visibility

	### 3. Docker
	```bash
	docker build -t llm-api .
	docker run -p 7860:7860 \
	-e LLMProvider=huggingface \
	-e HuggingFaceAPIKey=your_key \
	llm-api
	```

	### 4. Self-Hosted
	```bash
	npm install -g encore.dev
	encore run --port 8080
	```

	## 📊 Performance

	- Caching - Reduces redundant LLM calls by up to 80%
	- Async/Await - Non-blocking concurrent requests
	- Lightweight - Minimal dependencies for fast startup
	- Efficient - Optimized for serverless environments

	Cache Configuration:
	- Chat: 300s TTL, 100 max entries
	- RAG: 600s TTL, 50 max entries
	- Analysis: 900s TTL, 30 max entries

	## 🔐 Security Best Practices

	✅ API keys stored as secrets, never in code
	✅ No sensitive data in logs
	✅ Type-safe request validation
	✅ Error messages don't leak internals
	✅ CORS configured for frontend integration

	## 🛠️ Development

	```bash
	# Install Encore
	npm install -g encore.dev

	# Run with hot reload
	encore run

	# Run tests
	encore test

	# Type check
	encore build
	```

	## 📝 Example: Frontend Integration

	```typescript
	// Auto-generated type-safe client
	import backend from '~backend/client';

	// Chat
	const response = await backend.chat.chat({
	message: "Hello!",
	temperature: 0.7
	});

	// RAG
	const ragResponse = await backend.rag.rag({
	query: "What is this about?",
	context: ["Document 1...", "Document 2..."]
	});

	// Analysis
	const analysis = await backend.analyze.analyze({
	text: "Long text...",
	task: "summarize"
	});
	```

	## 🤝 Contributing

	Contributions welcome! This is a production-ready foundation that can be extended with:

	- Additional analysis tasks
	- Vector database integration for RAG
	- Streaming responses
	- Rate limiting middleware
	- Authentication
	- Model fine-tuning endpoints

	## 📄 License

	MIT License - feel free to use in your projects!

	## 🆘 Support

	- [Encore Documentation](https://encore.dev/docs)
	- [Hugging Face Spaces Docs](https://huggingface.co/docs/hub/spaces)
	- [GitHub Issues](./issues)

	---

	Built with ❤️ using [Encore.ts](https://encore.dev)