Spaces:

lyangas
/

free_llm_structure_output_docker

Sleeping

App Files Files Community

free_llm_structure_output_docker / README.md

lyangas

move model downloading to dockerfile

f2adbf5 4 months ago

preview code

raw

history blame contribute delete

6.31 kB

	---
	title: LLM Structured Output Docker
	emoji: 🤖
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	pinned: false
	license: mit
	short_description: Get structured JSON responses from LLM using Docker
	tags:
	- llama-cpp
	- gguf
	- json-schema
	- structured-output
	- llm
	- docker
	- gradio
	- grammar
	- gbnf
	---

	# 🤖 LLM Structured Output (Docker Version)

	Dockerized application for getting structured responses from local GGUF language models in specified JSON format.

	## ✨ Key Features

	- Docker containerized for easy deployment on HuggingFace Spaces
	- Local GGUF model support via llama-cpp-python
	- Optimized for containers with configurable resources
	- JSON schema support for structured output
	- 🔗 Grammar-based structured output (GBNF) for precise JSON generation
	- Dual generation modes: Grammar mode and Schema guidance mode
	- Gradio web interface for convenient interaction
	- REST API for integration with other applications
	- Memory efficient with GGUF quantized models

	## 🚀 Deployment on HuggingFace Spaces

	This version is specifically designed for HuggingFace Spaces with Docker SDK:

	1. Clone this repository
	2. Push to HuggingFace Spaces with `sdk: docker` in README.md
	3. The application will automatically build and deploy

	## 🐳 Local Docker Usage

	### Build the image:
	```bash
	docker build -t llm-structured-output .
	```

	### Run the container:
	```bash
	docker run -p 7860:7860 -e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" llm-structured-output
	```

	### With custom configuration:
	```bash
	docker run -p 7860:7860 \
	-e MODEL_REPO="lmstudio-community/gemma-3n-E4B-it-text-GGUF" \
	-e MODEL_FILENAME="gemma-3n-E4B-it-Q8_0.gguf" \
	-e N_CTX="4096" \
	-e MAX_NEW_TOKENS="512" \
	llm-structured-output
	```

	## 🌐 Application Access

	- Web interface: http://localhost:7860
	- API: Available through the same port
	- Health check: http://localhost:7860/health (when API mode is enabled)

	## 📝 Environment Variables

	Configure the application using environment variables:

	\| Variable \| Default \| Description \|
	\|----------\|---------\|-------------\|
	\| `MODEL_REPO` \| `lmstudio-community/gemma-3n-E4B-it-text-GGUF` \| HuggingFace model repository \|
	\| `MODEL_FILENAME` \| `gemma-3n-E4B-it-Q8_0.gguf` \| Model file name \|
	\| `N_CTX` \| `4096` \| Context window size \|
	\| `N_GPU_LAYERS` \| `0` \| GPU layers (0 for CPU-only) \|
	\| `N_THREADS` \| `4` \| CPU threads \|
	\| `MAX_NEW_TOKENS` \| `256` \| Maximum response length \|
	\| `TEMPERATURE` \| `0.1` \| Generation temperature \|
	\| `HUGGINGFACE_TOKEN` \| `` \| HF token for private models \|

	## 📋 Usage Examples

	### Example JSON Schema:
	```json
	{
	"type": "object",
	"properties": {
	"summary": {"type": "string"},
	"sentiment": {"type": "string", "enum": ["positive", "negative", "neutral"]},
	"confidence": {"type": "number", "minimum": 0, "maximum": 1}
	},
	"required": ["summary", "sentiment"]
	}
	```

	### Example Prompt:
	```
	Analyze this review: "The product exceeded my expectations! Great quality and fast delivery."
	```

	## 🔧 Docker Optimizations

	This Docker version includes several optimizations:

	- Reduced memory usage with smaller context window and batch sizes
	- CPU-optimized configuration by default
	- Efficient layer caching for faster builds
	- Security: Runs as non-root user
	- Multi-stage build capabilities for production

	## 🏗️ Architecture

	- Base Image: Python 3.10 slim
	- ML Backend: llama-cpp-python with OpenBLAS
	- Web Interface: Gradio 4.x
	- API: FastAPI with automatic documentation
	- Model Storage: Downloaded on first run to `/app/models/`

	## 💡 Performance Tips

	1. Memory: Start with smaller models (7B or less)
	2. CPU: Adjust `N_THREADS` based on available cores
	3. Context: Reduce `N_CTX` if experiencing memory issues
	4. Batch size: Lower `N_BATCH` for memory-constrained environments

	## 🔗 Grammar Mode (GBNF)

	This project now supports Grammar-based Structured Output using GBNF (Grammar in Backus-Naur Form) for more precise JSON generation:

	### ✨ What is Grammar Mode?

	Grammar Mode automatically converts your JSON Schema into a GBNF grammar that constrains the model to generate only valid JSON matching your schema structure. This provides:

	- 100% valid JSON - No parsing errors
	- Schema compliance - Guaranteed structure adherence
	- Consistent output - Reliable format every time
	- Better performance - Fewer retry attempts needed

	### 🎛️ Usage

	In Gradio Interface:
	- Toggle the "🔗 Use Grammar (GBNF) Mode" checkbox
	- Enabled by default for best results

	In API:
	```json
	{
	"prompt": "Your prompt here",
	"json_schema": { your_schema },
	"use_grammar": true
	}
	```

	In Python:
	```python
	result = llm_client.generate_structured_response(
	prompt="Your prompt",
	json_schema=schema,
	use_grammar=True # Enable grammar mode
	)
	```

	### 🔄 Mode Comparison

	\| Feature \| Grammar Mode \| Schema Guidance Mode \|
	\|---------\|-------------\|---------------------\|
	\| JSON Validity \| 100% guaranteed \| High, but may need parsing \|
	\| Schema Compliance \| Strict enforcement \| Guidance-based \|
	\| Speed \| Faster (single pass) \| May need retries \|
	\| Flexibility \| Structured \| More creative freedom \|
	\| Best for \| APIs, data extraction \| Creative content with structure \|

	### 🛠️ Supported Schema Features

	- ✅ Objects with required/optional properties
	- ✅ Arrays with typed items
	- ✅ String enums
	- ✅ Numbers and integers
	- ✅ Booleans
	- ✅ Nested objects and arrays
	- ⚠️ Complex conditionals (simplified)

	## 🔍 Troubleshooting

	### Container fails to start:
	- Check available memory (minimum 4GB recommended)
	- Verify model repository accessibility
	- Ensure proper environment variable formatting

	### Model download issues:
	- Check internet connectivity in container
	- Verify `HUGGINGFACE_TOKEN` for private models
	- Ensure sufficient disk space

	### Performance issues:
	- Reduce `N_CTX` and `MAX_NEW_TOKENS`
	- Adjust `N_THREADS` to match CPU cores
	- Consider using smaller/quantized models

	## 📄 License

	MIT License - see LICENSE file for details.

	---

	For more information about HuggingFace Spaces Docker configuration, see: https://huggingface.co/docs/hub/spaces-config-reference