lyangas's picture
move model downloading to dockerfile
f2adbf5
---
title: LLM Structured Output Docker
emoji: πŸ€–
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
pinned: false
license: mit
short_description: Get structured JSON responses from LLM using Docker
tags:
- llama-cpp
- gguf
- json-schema
- structured-output
- llm
- docker
- gradio
- grammar
- gbnf
---
# πŸ€– LLM Structured Output (Docker Version)
Dockerized application for getting structured responses from local GGUF language models in specified JSON format.
## ✨ Key Features
- **Docker containerized** for easy deployment on HuggingFace Spaces
- **Local GGUF model support** via llama-cpp-python
- **Optimized for containers** with configurable resources
- **JSON schema support** for structured output
- **πŸ”— Grammar-based structured output** (GBNF) for precise JSON generation
- **Dual generation modes**: Grammar mode and Schema guidance mode
- **Gradio web interface** for convenient interaction
- **REST API** for integration with other applications
- **Memory efficient** with GGUF quantized models
## πŸš€ Deployment on HuggingFace Spaces
This version is specifically designed for HuggingFace Spaces with Docker SDK:
1. Clone this repository
2. Push to HuggingFace Spaces with `sdk: docker` in README.md
3. The application will automatically build and deploy
## 🐳 Local Docker Usage
### Build the image:
```bash
docker build -t llm-structured-output .
```
### Run the container:
```bash
docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output
```
### With custom configuration:
```bash
docker run -p 7860:7860 \
-e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \
-e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \
-e N_CTX="4096" \
-e MAX_NEW_TOKENS="512" \
llm-structured-output
```
## 🌐 Application Access
- **Web interface**: http://localhost:7860
- **API**: Available through the same port
- **Health check**: http://localhost:7860/health (when API mode is enabled)
## πŸ“ Environment Variables
Configure the application using environment variables:
| Variable | Default | Description |
|----------|---------|-------------|
| `MODEL_REPO` | `lmstudio-community/gemma-3n-E4B-it-text-GGUF` | HuggingFace model repository |
| `MODEL_FILENAME` | `gemma-3n-E4B-it-Q8_0.gguf` | Model file name |
| `N_CTX` | `4096` | Context window size |
| `N_GPU_LAYERS` | `0` | GPU layers (0 for CPU-only) |
| `N_THREADS` | `4` | CPU threads |
| `MAX_NEW_TOKENS` | `256` | Maximum response length |
| `TEMPERATURE` | `0.1` | Generation temperature |
| `HUGGINGFACE_TOKEN` | `` | HF token for private models |
## πŸ“‹ Usage Examples
### Example JSON Schema:
```json
{
"type": "object",
"properties": {
"summary": {"type": "string"},
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
"confidence": {"type": "number", "minimum": 0, "maximum": 1}
},
"required": ["summary", "sentiment"]
}
```
### Example Prompt:
```
Analyze this review: "The product exceeded my expectations! Great quality and fast delivery."
```
## πŸ”§ Docker Optimizations
This Docker version includes several optimizations:
- **Reduced memory usage** with smaller context window and batch sizes
- **CPU-optimized** configuration by default
- **Efficient layer caching** for faster builds
- **Security**: Runs as non-root user
- **Multi-stage build** capabilities for production
## πŸ—οΈ Architecture
- **Base Image**: Python 3.10 slim
- **ML Backend**: llama-cpp-python with OpenBLAS
- **Web Interface**: Gradio 4.x
- **API**: FastAPI with automatic documentation
- **Model Storage**: Downloaded on first run to `/app/models/`
## πŸ’‘ Performance Tips
1. **Memory**: Start with smaller models (7B or less)
2. **CPU**: Adjust `N_THREADS` based on available cores
3. **Context**: Reduce `N_CTX` if experiencing memory issues
4. **Batch size**: Lower `N_BATCH` for memory-constrained environments
## πŸ”— Grammar Mode (GBNF)
This project now supports **Grammar-based Structured Output** using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation:
### ✨ What is Grammar Mode?
Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides:
- **100% valid JSON** - No parsing errors
- **Schema compliance** - Guaranteed structure adherence
- **Consistent output** - Reliable format every time
- **Better performance** - Fewer retry attempts needed
### πŸŽ›οΈ Usage
**In Gradio Interface:**
- Toggle the "πŸ”— Use Grammar (GBNF) Mode" checkbox
- Enabled by default for best results
**In API:**
```json
{
"prompt": "Your prompt here",
"json_schema": { your_schema },
"use_grammar": true
}
```
**In Python:**
```python
result = llm_client.generate_structured_response(
prompt="Your prompt",
json_schema=schema,
use_grammar=True # Enable grammar mode
)
```
### πŸ”„ Mode Comparison
| Feature | Grammar Mode | Schema Guidance Mode |
|---------|-------------|---------------------|
| JSON Validity | 100% guaranteed | High, but may need parsing |
| Schema Compliance | Strict enforcement | Guidance-based |
| Speed | Faster (single pass) | May need retries |
| Flexibility | Structured | More creative freedom |
| Best for | APIs, data extraction | Creative content with structure |
### πŸ› οΈ Supported Schema Features
- βœ… Objects with required/optional properties
- βœ… Arrays with typed items
- βœ… String enums
- βœ… Numbers and integers
- βœ… Booleans
- βœ… Nested objects and arrays
- ⚠️ Complex conditionals (simplified)
## πŸ” Troubleshooting
### Container fails to start:
- Check available memory (minimum 4GB recommended)
- Verify model repository accessibility
- Ensure proper environment variable formatting
### Model download issues:
- Check internet connectivity in container
- Verify `HUGGINGFACE_TOKEN` for private models
- Ensure sufficient disk space
### Performance issues:
- Reduce `N_CTX` and `MAX_NEW_TOKENS`
- Adjust `N_THREADS` to match CPU cores
- Consider using smaller/quantized models
## πŸ“„ License
MIT License - see LICENSE file for details.
---
For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference