# 🤖 Production-Ready LLM API Backend

A flexible, high-performance REST API for LLM capabilities including conversational AI, RAG, and text analysis. Built with [Encore.ts](https://encore.dev) for easy deployment to Encore Cloud or Hugging Face Spaces.

## ✨ Features

- 🎯 **5 Core Endpoints** - Chat, RAG, Analysis, Models, Health
- 🔄 **Dual Provider Support** - Ollama (local) or Hugging Face (cloud)
- ⚡ **Smart Caching** - In-memory cache with TTL and automatic cleanup
- 🛡️ **Type-Safe** - Full TypeScript support with end-to-end type safety
- 📦 **Production Ready** - Comprehensive error handling, logging, and monitoring
- 🚀 **Zero Config** - Works out of the box on multiple platforms

## 🚀 Quick Start

### Local Development

```bash
# Set up secrets
encore secret set LLMProvider ollama
encore secret set OllamaBaseURL http://localhost:11434

# Or use Hugging Face
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token_here
encore secret set DefaultModel mistralai/Mistral-7B-Instruct-v0.2

# Run locally
encore run

# Test the API
curl -X POST http://localhost:4000/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Explain AI in simple terms"}'
```

### Deploy to Encore Cloud

```bash
encore deploy
```

Your API will be live at: `https://staging-<your-app>.encr.app`

### Deploy to Hugging Face Spaces

See [README.space.md](./README.space.md) for complete Hugging Face Spaces deployment instructions.

**Quick summary:**
1. Create a new Docker Space on Hugging Face
2. Push this repository to your Space
3. Configure secrets in Space settings
4. Your API is live!

## 📡 API Endpoints

### POST `/chat`
Conversational AI with intelligent caching.

**Request:**
```json
{
  "message": "Explain quantum computing",
  "model": "llama3",
  "temperature": 0.7,
  "maxTokens": 500,
  "systemPrompt": "You are a helpful assistant"
}
```

**Response:**
```json
{
  "response": "Quantum computing is...",
  "model": "llama3",
  "tokensUsed": 150
}
```

### POST `/rag`
Retrieval-Augmented Generation with source tracking.

**Request:**
```json
{
  "query": "What is the main topic?",
  "context": [
    "Quantum computing uses qubits...",
    "Classical computers use bits..."
  ],
  "model": "mistral",
  "temperature": 0.5
}
```

**Response:**
```json
{
  "response": "Based on [0] and [1], the main topic is...",
  "model": "mistral",
  "tokensUsed": 120,
  "sources": [0, 1]
}
```

### POST `/analyze`
Text analysis for educational and research use cases.

**Request:**
```json
{
  "text": "Your long text here...",
  "task": "summarize",
  "model": "llama3",
  "temperature": 0.3
}
```

**Tasks:** `summarize`, `evaluate`, `explain`, `extract`

**Response:**
```json
{
  "result": "Summary of the text...",
  "task": "summarize",
  "model": "llama3",
  "tokensUsed": 80
}
```

### GET `/models`
List all available LLM models.

**Response:**
```json
{
  "provider": "ollama",
  "models": [
    {
      "name": "llama3",
      "size": "4.7 GB",
      "description": "llama3 - Modified 1/2/2025",
      "provider": "ollama"
    }
  ]
}
```

### GET `/health`
System health and uptime monitoring.

**Response:**
```json
{
  "status": "healthy",
  "uptime": 3600,
  "provider": "huggingface",
  "modelsAvailable": true,
  "cache": {
    "chat": {"size": 10, "maxEntries": 100, "ttl": 300},
    "rag": {"size": 5, "maxEntries": 50, "ttl": 600},
    "analysis": {"size": 2, "maxEntries": 30, "ttl": 900}
  }
}
```

## 🔧 Configuration

### Required Secrets

| Secret | Description | Example |
|--------|-------------|---------|
| `LLMProvider` | Provider to use | `ollama` or `huggingface` |
| `OllamaBaseURL` | Ollama API URL (if using Ollama) | `http://localhost:11434` |
| `HuggingFaceAPIKey` | HF token (if using Hugging Face) | `hf_xxxxxxxxxxxxx` |
| `DefaultModel` | Default model (optional) | `llama3` or `mistralai/Mistral-7B-Instruct-v0.2` |

### Setting Secrets

**Encore Cloud:**
```bash
encore secret set LLMProvider huggingface
encore secret set HuggingFaceAPIKey hf_your_token
```

**Hugging Face Spaces:**
Add secrets in Space Settings → Repository secrets

## 🏗️ Architecture

```
backend/
├── chat/                    # Conversational AI endpoint
│   ├── encore.service.ts
│   └── chat.ts
├── rag/                     # RAG endpoint
│   ├── encore.service.ts
│   └── rag.ts
├── analyze/                 # Text analysis endpoint
│   ├── encore.service.ts
│   └── analyze.ts
├── models/                  # Model listing endpoint
│   ├── encore.service.ts
│   └── models.ts
├── health/                  # Health check endpoint
│   ├── encore.service.ts
│   └── health.ts
└── lib/                     # Shared utilities
    ├── types.ts            # TypeScript types
    ├── cache.ts            # In-memory caching
    ├── llm-provider.ts     # Provider abstraction
    ├── ollama-client.ts    # Ollama integration
    └── huggingface-client.ts # Hugging Face integration
```

## 🎯 Use Cases

- 💬 **Chatbots** - Build conversational AI applications
- 📚 **RAG Systems** - Create context-aware Q&A systems
- 🎓 **Education** - Analyze and explain complex texts
- 🔬 **Research** - Summarize and extract key information
- 🤖 **AI Agents** - Backend for autonomous AI systems
- 📊 **Content Analysis** - Evaluate and process documents

## 🚀 Deployment Options

### 1. Encore Cloud (Recommended for Production)
```bash
encore deploy
```
- Automatic scaling
- Built-in monitoring
- Type-safe service-to-service calls
- Zero infrastructure management

### 2. Hugging Face Spaces (Great for Demos)
- See [README.space.md](./README.space.md)
- Free hosting for public projects
- Easy model integration
- Community visibility

### 3. Docker
```bash
docker build -t llm-api .
docker run -p 7860:7860 \
  -e LLMProvider=huggingface \
  -e HuggingFaceAPIKey=your_key \
  llm-api
```

### 4. Self-Hosted
```bash
npm install -g encore.dev
encore run --port 8080
```

## 📊 Performance

- **Caching** - Reduces redundant LLM calls by up to 80%
- **Async/Await** - Non-blocking concurrent requests
- **Lightweight** - Minimal dependencies for fast startup
- **Efficient** - Optimized for serverless environments

**Cache Configuration:**
- Chat: 300s TTL, 100 max entries
- RAG: 600s TTL, 50 max entries
- Analysis: 900s TTL, 30 max entries

## 🔐 Security Best Practices

✅ API keys stored as secrets, never in code  
✅ No sensitive data in logs  
✅ Type-safe request validation  
✅ Error messages don't leak internals  
✅ CORS configured for frontend integration  

## 🛠️ Development

```bash
# Install Encore
npm install -g encore.dev

# Run with hot reload
encore run

# Run tests
encore test

# Type check
encore build
```

## 📝 Example: Frontend Integration

```typescript
// Auto-generated type-safe client
import backend from '~backend/client';

// Chat
const response = await backend.chat.chat({
  message: "Hello!",
  temperature: 0.7
});

// RAG
const ragResponse = await backend.rag.rag({
  query: "What is this about?",
  context: ["Document 1...", "Document 2..."]
});

// Analysis
const analysis = await backend.analyze.analyze({
  text: "Long text...",
  task: "summarize"
});
```

## 🤝 Contributing

Contributions welcome! This is a production-ready foundation that can be extended with:

- Additional analysis tasks
- Vector database integration for RAG
- Streaming responses
- Rate limiting middleware
- Authentication
- Model fine-tuning endpoints

## 📄 License

MIT License - feel free to use in your projects!

## 🆘 Support

- [Encore Documentation](https://encore.dev/docs)
- [Hugging Face Spaces Docs](https://huggingface.co/docs/hub/spaces)
- [GitHub Issues](./issues)

---

**Built with** ❤️ using [Encore.ts](https://encore.dev)