|
|
--- |
|
|
title: LLM Structured Output Docker |
|
|
emoji: π€ |
|
|
colorFrom: blue |
|
|
colorTo: green |
|
|
sdk: docker |
|
|
app_port: 7860 |
|
|
pinned: false |
|
|
license: mit |
|
|
short_description: Get structured JSON responses from LLM using Docker |
|
|
tags: |
|
|
- llama-cpp |
|
|
- gguf |
|
|
- json-schema |
|
|
- structured-output |
|
|
- llm |
|
|
- docker |
|
|
- gradio |
|
|
- grammar |
|
|
- gbnf |
|
|
--- |
|
|
|
|
|
# π€ LLM Structured Output (Docker Version) |
|
|
|
|
|
Dockerized application for getting structured responses from local GGUF language models in specified JSON format. |
|
|
|
|
|
## β¨ Key Features |
|
|
|
|
|
- **Docker containerized** for easy deployment on HuggingFace Spaces |
|
|
- **Local GGUF model support** via llama-cpp-python |
|
|
- **Optimized for containers** with configurable resources |
|
|
- **JSON schema support** for structured output |
|
|
- **π Grammar-based structured output** (GBNF) for precise JSON generation |
|
|
- **Dual generation modes**: Grammar mode and Schema guidance mode |
|
|
- **Gradio web interface** for convenient interaction |
|
|
- **REST API** for integration with other applications |
|
|
- **Memory efficient** with GGUF quantized models |
|
|
|
|
|
## π Deployment on HuggingFace Spaces |
|
|
|
|
|
This version is specifically designed for HuggingFace Spaces with Docker SDK: |
|
|
|
|
|
1. Clone this repository |
|
|
2. Push to HuggingFace Spaces with `sdk: docker` in README.md |
|
|
3. The application will automatically build and deploy |
|
|
|
|
|
## π³ Local Docker Usage |
|
|
|
|
|
### Build the image: |
|
|
```bash |
|
|
docker build -t llm-structured-output . |
|
|
``` |
|
|
|
|
|
### Run the container: |
|
|
```bash |
|
|
docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output |
|
|
``` |
|
|
|
|
|
### With custom configuration: |
|
|
```bash |
|
|
docker run -p 7860:7860 \ |
|
|
-e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \ |
|
|
-e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \ |
|
|
-e N_CTX="4096" \ |
|
|
-e MAX_NEW_TOKENS="512" \ |
|
|
llm-structured-output |
|
|
``` |
|
|
|
|
|
## π Application Access |
|
|
|
|
|
- **Web interface**: http://localhost:7860 |
|
|
- **API**: Available through the same port |
|
|
- **Health check**: http://localhost:7860/health (when API mode is enabled) |
|
|
|
|
|
## π Environment Variables |
|
|
|
|
|
Configure the application using environment variables: |
|
|
|
|
|
| Variable | Default | Description | |
|
|
|----------|---------|-------------| |
|
|
| `MODEL_REPO` | `lmstudio-community/gemma-3n-E4B-it-text-GGUF` | HuggingFace model repository | |
|
|
| `MODEL_FILENAME` | `gemma-3n-E4B-it-Q8_0.gguf` | Model file name | |
|
|
| `N_CTX` | `4096` | Context window size | |
|
|
| `N_GPU_LAYERS` | `0` | GPU layers (0 for CPU-only) | |
|
|
| `N_THREADS` | `4` | CPU threads | |
|
|
| `MAX_NEW_TOKENS` | `256` | Maximum response length | |
|
|
| `TEMPERATURE` | `0.1` | Generation temperature | |
|
|
| `HUGGINGFACE_TOKEN` | `` | HF token for private models | |
|
|
|
|
|
## π Usage Examples |
|
|
|
|
|
### Example JSON Schema: |
|
|
```json |
|
|
{ |
|
|
"type": "object", |
|
|
"properties": { |
|
|
"summary": {"type": "string"}, |
|
|
"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]}, |
|
|
"confidence": {"type": "number", "minimum": 0, "maximum": 1} |
|
|
}, |
|
|
"required": ["summary", "sentiment"] |
|
|
} |
|
|
``` |
|
|
|
|
|
### Example Prompt: |
|
|
``` |
|
|
Analyze this review: "The product exceeded my expectations! Great quality and fast delivery." |
|
|
``` |
|
|
|
|
|
## π§ Docker Optimizations |
|
|
|
|
|
This Docker version includes several optimizations: |
|
|
|
|
|
- **Reduced memory usage** with smaller context window and batch sizes |
|
|
- **CPU-optimized** configuration by default |
|
|
- **Efficient layer caching** for faster builds |
|
|
- **Security**: Runs as non-root user |
|
|
- **Multi-stage build** capabilities for production |
|
|
|
|
|
## ποΈ Architecture |
|
|
|
|
|
- **Base Image**: Python 3.10 slim |
|
|
- **ML Backend**: llama-cpp-python with OpenBLAS |
|
|
- **Web Interface**: Gradio 4.x |
|
|
- **API**: FastAPI with automatic documentation |
|
|
- **Model Storage**: Downloaded on first run to `/app/models/` |
|
|
|
|
|
## π‘ Performance Tips |
|
|
|
|
|
1. **Memory**: Start with smaller models (7B or less) |
|
|
2. **CPU**: Adjust `N_THREADS` based on available cores |
|
|
3. **Context**: Reduce `N_CTX` if experiencing memory issues |
|
|
4. **Batch size**: Lower `N_BATCH` for memory-constrained environments |
|
|
|
|
|
## π Grammar Mode (GBNF) |
|
|
|
|
|
This project now supports **Grammar-based Structured Output** using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation: |
|
|
|
|
|
### β¨ What is Grammar Mode? |
|
|
|
|
|
Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides: |
|
|
|
|
|
- **100% valid JSON** - No parsing errors |
|
|
- **Schema compliance** - Guaranteed structure adherence |
|
|
- **Consistent output** - Reliable format every time |
|
|
- **Better performance** - Fewer retry attempts needed |
|
|
|
|
|
### ποΈ Usage |
|
|
|
|
|
**In Gradio Interface:** |
|
|
- Toggle the "π Use Grammar (GBNF) Mode" checkbox |
|
|
- Enabled by default for best results |
|
|
|
|
|
**In API:** |
|
|
```json |
|
|
{ |
|
|
"prompt": "Your prompt here", |
|
|
"json_schema": { your_schema }, |
|
|
"use_grammar": true |
|
|
} |
|
|
``` |
|
|
|
|
|
**In Python:** |
|
|
```python |
|
|
result = llm_client.generate_structured_response( |
|
|
prompt="Your prompt", |
|
|
json_schema=schema, |
|
|
use_grammar=True # Enable grammar mode |
|
|
) |
|
|
``` |
|
|
|
|
|
### π Mode Comparison |
|
|
|
|
|
| Feature | Grammar Mode | Schema Guidance Mode | |
|
|
|---------|-------------|---------------------| |
|
|
| JSON Validity | 100% guaranteed | High, but may need parsing | |
|
|
| Schema Compliance | Strict enforcement | Guidance-based | |
|
|
| Speed | Faster (single pass) | May need retries | |
|
|
| Flexibility | Structured | More creative freedom | |
|
|
| Best for | APIs, data extraction | Creative content with structure | |
|
|
|
|
|
### π οΈ Supported Schema Features |
|
|
|
|
|
- β
Objects with required/optional properties |
|
|
- β
Arrays with typed items |
|
|
- β
String enums |
|
|
- β
Numbers and integers |
|
|
- β
Booleans |
|
|
- β
Nested objects and arrays |
|
|
- β οΈ Complex conditionals (simplified) |
|
|
|
|
|
## π Troubleshooting |
|
|
|
|
|
### Container fails to start: |
|
|
- Check available memory (minimum 4GB recommended) |
|
|
- Verify model repository accessibility |
|
|
- Ensure proper environment variable formatting |
|
|
|
|
|
### Model download issues: |
|
|
- Check internet connectivity in container |
|
|
- Verify `HUGGINGFACE_TOKEN` for private models |
|
|
- Ensure sufficient disk space |
|
|
|
|
|
### Performance issues: |
|
|
- Reduce `N_CTX` and `MAX_NEW_TOKENS` |
|
|
- Adjust `N_THREADS` to match CPU cores |
|
|
- Consider using smaller/quantized models |
|
|
|
|
|
## π License |
|
|
|
|
|
MIT License - see LICENSE file for details. |
|
|
|
|
|
--- |
|
|
|
|
|
For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference |
|
|
|