Spaces:

lyangas
/

free_llm_structure_output_docker

Sleeping

App Files Files Community

lyangas commited on Aug 23

Commit

b9cb4a6

1 Parent(s): f2adbf5

wheel llama cpp was added

Browse files

Files changed (10) hide show

.gitattributes +1 -0
.gitignore +1 -1
BUILD_INSTRUCTIONS.md +0 -89
Dockerfile +9 -8
GRAMMAR_CHANGES.md +0 -100
app.py +373 -224
config.py +10 -7
requirements.txt +0 -2
test.ipynb +0 -24
wheels/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.whl filter=lfs diff=lfs merge=lfs -text

.gitignore CHANGED Viewed

@@ -15,7 +15,6 @@ lib64/
 parts/
 sdist/
 var/
-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
@@ -69,3 +68,4 @@ temp/
 # Test files
 test*
 test.ipynb

 parts/
 sdist/
 var/
 *.egg-info/
 .installed.cfg
 *.egg
 # Test files
 test*
 test.ipynb
+logs.txt

BUILD_INSTRUCTIONS.md DELETED Viewed

@@ -1,89 +0,0 @@
-# Инструкции по сборке Docker образа с предзагруженной моделью
-## Обзор изменений
-Dockerfile был модифицирован для предварительной загрузки модели Hugging Face во время сборки образа. Это обеспечивает:
-- ✅ Быстрое развертывание (модель уже в контейнере)
-- ✅ Надежность (нет зависимости от сети при запуске)
-- ✅ Консистентность (фиксированная версия модели)
-## Сборка образа
-### Базовая сборка (для публичных моделей):
-```bash
-docker build -t llm-structured-output .
-```
-### Сборка с токеном Hugging Face (для приватных моделей):
-```bash
-docker build --build-arg HUGGINGFACE_TOKEN=your_token_here -t llm-structured-output .
-```
-Или через переменную окружения:
-```bash
-export HUGGINGFACE_TOKEN=your_token_here
-docker build -t llm-structured-output .
-```
-## Запуск контейнера
-```bash
-docker run -p 7860:7860 llm-structured-output
-```
-Приложение будет доступно по адресу: http://localhost:7860
-## Запуск через docker-compose
-```bash
-docker-compose up --build
-```
-## Важные изменения
-### 1. Dockerfile
-- Добавлен `git-lfs` для работы с большими файлами
-- Добавлена переменная `DOCKER_CONTAINER=true`
-- Добавлен этап предварительной загрузки модели
-- Модель скачивается во время сборки образа
-### 2. app.py
-- Добавлена проверка на Docker окружение
-- Если модель не найдена в Docker контейнере, выбрасывается ошибка
-- Логика загрузки модели оптимизирована для работы с предзагруженными моделями
-## Размер образа
-Образ будет больше из-за включенной модели, но это компенсируется:
-- Быстрым запуском контейнера
-- Отсутствием сетевых зависимостей
-- Возможностью кэширования слоев Docker
-## Настройка модели
-Для изменения модели отредактируйте `config.py`:
-```python
-MODEL_REPO: str = "your-repo/your-model"
-MODEL_FILENAME: str = "your-model.gguf"
-```
-Затем пересоберите образ.
-## Отладка
-Для проверки наличия модели в контейнере:
-```bash
-docker run -it llm-structured-output ls -la /app/models/
-```
-Для проверки логов сборки:
-```bash
-docker build --no-cache -t llm-structured-output .
-```

Dockerfile CHANGED Viewed

@@ -4,14 +4,17 @@ FROM python:3.10-slim
 # Set working directory
 WORKDIR /app
-# Install system dependencies required for runtime and git-lfs
 RUN apt-get update && apt-get install -y \
     wget \
     curl \
     git \
     git-lfs \
     libopenblas-dev \
     libssl-dev \
     && rm -rf /var/lib/apt/lists/*
 # Initialize git-lfs
@@ -26,7 +29,9 @@ ENV DOCKER_CONTAINER=true
 # Create models directory
 RUN mkdir -p /app/models
 # Copy requirements first for better Docker layer caching
 COPY requirements.txt .
@@ -42,11 +47,7 @@ RUN python -c "import os; from huggingface_hub import hf_hub_download; from conf
 # Verify model file exists after build
 RUN ls -la /app/models/ && \
-    [ -f "/app/models/gemma-3n-E4B-it-Q8_0.gguf" ] || (echo "Model file not found!" && exit 1)
-# Copy and install llama-cpp-python from local wheel
-COPY wheels/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl /tmp/
-RUN pip install /tmp/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
 # Copy application files
 COPY . .
@@ -62,5 +63,5 @@ USER user
 EXPOSE 7860
 # Set entrypoint and default command
-ENTRYPOINT ["./entrypoint.sh"]
 CMD ["python", "main.py", "--mode", "gradio"]

 # Set working directory
 WORKDIR /app
+# Install system dependencies required for runtime and compilation
 RUN apt-get update && apt-get install -y \
     wget \
     curl \
     git \
     git-lfs \
+    build-essential \
+    cmake \
     libopenblas-dev \
     libssl-dev \
+    libgomp1 \
     && rm -rf /var/lib/apt/lists/*
 # Initialize git-lfs
 # Create models directory
 RUN mkdir -p /app/models
+# Copy and install llama-cpp-python from local wheel
+COPY wheels/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl /tmp/
+RUN pip install /tmp/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl
 # Copy requirements first for better Docker layer caching
 COPY requirements.txt .
 # Verify model file exists after build
 RUN ls -la /app/models/ && \
+    [ -n "$(ls /app/models/*.gguf 2>/dev/null)" ] || (echo "No .gguf model file found!" && exit 1)
 # Copy application files
 COPY . .
 EXPOSE 7860
 # Set entrypoint and default command
+# ENTRYPOINT ["./entrypoint.sh"]
 CMD ["python", "main.py", "--mode", "gradio"]

GRAMMAR_CHANGES.md DELETED Viewed

@@ -1,100 +0,0 @@
-# 🔗 Grammar Support Implementation
-## 📋 Summary
-Successfully integrated **Grammar-based Structured Output (GBNF)** support from the source project `/Users/ivan/Documents/Proging/free_llm_huggingface/free_llm_structure_output` into the current Docker project.
-## 🔧 Changes Made
-### 1. Core Grammar Implementation (`app.py`)
-- ✅ Added `LlamaGrammar` import from `llama_cpp`
-- ✅ Implemented `_json_schema_to_gbnf()` function for JSON Schema → GBNF conversion
-- ✅ Added `use_grammar` parameter to `generate_structured_response()` method
-- ✅ Enhanced generation logic with dual modes:
-  - **Grammar Mode**: Uses GBNF constraints for strict JSON enforcement
-  - **Schema Guidance Mode**: Uses prompt-based schema guidance
-- ✅ Added `test_grammar_generation()` function for testing
-- ✅ Updated `process_request()` to handle grammar parameter
-### 2. Gradio Interface Enhancement
-- ✅ Added "🔗 Use Grammar (GBNF) Mode" checkbox
-- ✅ Updated submit button handler to pass grammar parameter
-- ✅ Enhanced model information section with grammar features description
-### 3. REST API Updates (`api.py`)
-- ✅ Added `use_grammar: bool = True` to `StructuredOutputRequest` model
-- ✅ Updated `/generate` endpoint to support grammar parameter
-- ✅ Updated `/generate_with_file` endpoint with `use_grammar` form field
-- ✅ Enhanced API documentation
-### 4. Documentation Updates
-- ✅ Updated `README.md` with comprehensive Grammar Mode section
-- ✅ Added feature tags: `grammar`, `gbnf`
-- ✅ Included usage examples for all interfaces
-- ✅ Added mode comparison table
-- ✅ Listed supported schema features
-### 5. Testing
-- ✅ Created `test_grammar_standalone.py` for validation
-- ✅ Successfully tested grammar generation with multiple schema types:
-  - Simple objects with required/optional properties
-  - Nested objects with arrays
-  - String enums support
-## 🎯 Key Features Added
-### Grammar Mode Benefits:
-- **100% valid JSON** - No parsing errors
-- **Schema compliance** - Guaranteed structure adherence
-- **Consistent output** - Reliable format every time
-- **Better performance** - Fewer retry attempts needed
-### Supported Schema Features:
-- ✅ Objects with required/optional properties
-- ✅ Arrays with typed items
-- ✅ String enums
-- ✅ Numbers and integers
-- ✅ Booleans
-- ✅ Nested objects and arrays
-- ⚠️ Complex conditionals (simplified)
-## 🎛️ Usage Examples
-### Gradio Interface:
-- Toggle the "🔗 Use Grammar (GBNF) Mode" checkbox (enabled by default)
-### REST API:
-```json
-{
-  "prompt": "Analyze this data...",
-  "json_schema": {
-    "type": "object",
-    "properties": {
-      "result": {"type": "string"},
-      "confidence": {"type": "number"}
-    }
-  },
-  "use_grammar": true
-}
-```
-### Python API:
-```python
-result = llm_client.generate_structured_response(
-    prompt="Your prompt",
-    json_schema=schema,
-    use_grammar=True  # Enable grammar mode
-)
-```
-## 🔍 Validation
-All grammar generation functionality has been tested and validated:
-- ✅ Grammar generation from JSON schemas works correctly
-- ✅ GBNF output format is valid
-- ✅ Enum support is functional
-- ✅ Nested structures are handled properly
-## 🚀 Ready for Production
-The implementation is complete and ready for use in Docker environments. Grammar mode provides more reliable structured output generation while maintaining backward compatibility with the existing schema guidance approach.

app.py CHANGED Viewed

@@ -1,3 +1,9 @@
 import json
 import os
 import gradio as gr
@@ -9,7 +15,7 @@ from config import Config
 # Try to import llama_cpp with fallback
 try:
-    from llama_cpp import Llama, LlamaGrammar
     LLAMA_CPP_AVAILABLE = True
 except ImportError as e:
     print(f"Warning: llama-cpp-python not available: {e}")
@@ -27,9 +33,14 @@ except ImportError as e:
     hf_hub_download = None
 # Setup logging
-logging.basicConfig(level=logging.INFO)
 logger = logging.getLogger(__name__)
 class StructuredOutputRequest(BaseModel):
     prompt: str
     image: Optional[str] = None  # base64 encoded image
@@ -144,14 +155,19 @@ class LLMClient:
                 lora_base=None,
                 lora_path=None,
                 seed=Config.SEED,
-                verbose=True  # Enable verbose for debugging
             )
             logger.info("Model successfully loaded and initialized")
             # Test model with a simple prompt to verify it's working
             logger.info("Testing model with simple prompt...")
-            test_response = self.llm("Hello", max_tokens=1, temperature=0.1)
             logger.info("Model test successful")
         except Exception as e:
@@ -175,11 +191,13 @@ class LLMClient:
     def _format_prompt_with_schema(self, prompt: str, json_schema: Dict[str, Any]) -> str:
         """
-        Format prompt for structured output generation
         """
         schema_str = json.dumps(json_schema, ensure_ascii=False, indent=2)
-        formatted_prompt = f"""User: {prompt}
 Please respond in strict accordance with the following JSON schema:
@@ -187,139 +205,72 @@ Please respond in strict accordance with the following JSON schema:
 {schema_str}
 ```
-Return ONLY valid JSON without additional comments or explanations."""
         return formatted_prompt
-def _json_schema_to_gbnf(schema: Dict[str, Any], root_name: str = "root") -> str:
-    """Convert JSON schema to GBNF (Backus-Naur Form) grammar for structured output"""
-    rules = []
-    rule_names = set()  # Track rule names to avoid duplicates
-    def add_rule(name: str, definition: str):
-        if name not in rule_names:
-            rules.append(f"{name} ::= {definition}")
-            rule_names.add(name)
-    def process_type(schema_part: Dict[str, Any], type_name: str = "value") -> str:
-        if "type" not in schema_part:
-            # Handle anyOf, oneOf, allOf cases - simplified to string for now
-            return "string"
-        schema_type = schema_part["type"]
-        if schema_type == "object":
-            # Handle object type
-            properties = schema_part.get("properties", {})
-            required = schema_part.get("required", [])
-            if not properties:
-                add_rule(type_name, '"{" ws "}"')
-                return type_name
-            # Separate required and optional parts
-            required_parts = []
-            optional_parts = []
-            for prop_name, prop_schema in properties.items():
-                prop_type_name = f"{type_name}_{prop_name}"
-                prop_type = process_type(prop_schema, prop_type_name)
-                prop_def = f'"\\"" "{prop_name}" "\\"" ws ":" ws {prop_type}'
-                if prop_name in required:
-                    required_parts.append(prop_def)
-                else:
-                    optional_parts.append(prop_def)
-            # Build object structure - simplified approach
-            if not required_parts and not optional_parts:
-                object_def = '"{" ws "}"'
-            else:
-                # For simplicity, create a fixed structure based on required fields only
-                # and treat optional fields as always present but with optional values
-                if not required_parts:
-                    # Only optional fields - make the whole object optional content
-                    if len(optional_parts) == 1:
-                        object_def = f'"{" ws ({optional_parts[0]})? ws "}"'
-                    else:
-                        comma_separated = ' ws "," ws '.join(optional_parts)
-                        object_def = f'"{" ws ({comma_separated})? ws "}"'
-                else:
-                    # Has required fields
-                    all_parts = required_parts.copy()
-                    # Add optional parts as truly optional (with optional commas)
-                    for opt_part in optional_parts:
-                        all_parts.append(f'(ws "," ws {opt_part})?')
-                    if len(all_parts) == 1:
-                        object_def = f'"{" ws {all_parts[0]} ws "}"'
-                    else:
-                        # Join required parts with commas, optional parts are already with optional commas
-                        required_with_commas = ' ws "," ws '.join(required_parts)
-                        optional_with_commas = ' '.join([f'(ws "," ws {opt})?' for opt in optional_parts])
-                        if optional_with_commas:
-                            object_def = f'"{{" ws {required_with_commas} {optional_with_commas} ws "}}"'
-                        else:
-                            object_def = f'"{{" ws {required_with_commas} ws "}}"'
-            add_rule(type_name, object_def)
-            return type_name
-        elif schema_type == "array":
-            # Handle array type
-            items_schema = schema_part.get("items", {})
-            items_type_name = f"{type_name}_items"
-            item_type = process_type(items_schema, f"{type_name}_item")
-            # Create array items rule
-            add_rule(items_type_name, f"{item_type} (ws \",\" ws {item_type})*")
-            add_rule(type_name, f'"[" ws ({items_type_name})? ws "]"')
-            return type_name
-        elif schema_type == "string":
-            # Handle string type with enum support
-            if "enum" in schema_part:
-                enum_values = schema_part["enum"]
-                enum_options = ' | '.join([f'"\\"" "{val}" "\\""' for val in enum_values])
-                add_rule(type_name, enum_options)
-                return type_name
-            else:
-                return "string"
-        elif schema_type == "number" or schema_type == "integer":
-            return "number"
-        elif schema_type == "boolean":
-            return "boolean"
-        else:
-            return "string"  # fallback
-    # Process root schema
-    process_type(schema, root_name)
-    # Basic GBNF rules for primitives
-    basic_rules = [
-        'ws ::= [ \\t\\n]*',
-        'string ::= "\\"" char* "\\""',
-        'char ::= [^"\\\\] | "\\\\" (["\\\\bfnrt] | "u" hex hex hex hex)',
-        'hex ::= [0-9a-fA-F]',
-        'number ::= "-"? ("0" | [1-9] [0-9]*) ("." [0-9]+)? ([eE] [+-]? [0-9]+)?',
-        'boolean ::= "true" | "false"',
-        'null ::= "null"'
-    ]
-    # Add basic rules only if they haven't been added yet
-    for rule in basic_rules:
-        rule_name = rule.split(' ::= ')[0]
-        if rule_name not in rule_names:
-            rules.append(rule)
-            rule_names.add(rule_name)
-    return "\\n".join(rules)
     def generate_structured_response(self,
                                    prompt: str,
                                    json_schema: Union[str, Dict[str, Any]],
@@ -360,17 +311,21 @@ def _json_schema_to_gbnf(schema: Dict[str, Any], root_name: str = "root") -> str
             generation_params = {
                 "max_tokens": Config.MAX_NEW_TOKENS,
                 "temperature": Config.TEMPERATURE,
                 "echo": False
             }
             # Add grammar or stop tokens based on mode
             if use_grammar and grammar is not None:
                 generation_params["grammar"] = grammar
-                # For grammar mode, use a simpler prompt without schema explanation
-                simple_prompt = f"User: {prompt}\n\nAssistant:"
                 response = self.llm(simple_prompt, **generation_params)
             else:
-                generation_params["stop"] = ["User:", "\n\n", "Assistant:", "Human:"]
                 response = self.llm(formatted_prompt, **generation_params)
             # Extract generated text
@@ -385,11 +340,7 @@ def _json_schema_to_gbnf(schema: Dict[str, Any], root_name: str = "root") -> str
                 if json_start != -1 and json_end > json_start:
                     json_str = generated_text[json_start:json_end]
                     parsed_response = json.loads(json_str)
-                    return {
-                        "success": True,
-                        "data": parsed_response,
-                        "raw_response": generated_text
-                    }
                 else:
                     return {
                         "error": "Could not find JSON in model response",
@@ -408,6 +359,99 @@ def _json_schema_to_gbnf(schema: Dict[str, Any], root_name: str = "root") -> str
                 "error": f"Generation error: {str(e)}"
             }
 def test_grammar_generation(json_schema_str: str) -> Dict[str, Any]:
     """
     Test grammar generation without running the full model
@@ -457,6 +501,43 @@ def process_request(prompt: str,
     result = llm_client.generate_structured_response(prompt, json_schema, image, use_grammar)
     return json.dumps(result, ensure_ascii=False, indent=2)
 # Examples for demonstration
 example_schema = """{
   "type": "object",
@@ -502,89 +583,12 @@ def create_gradio_interface():
         else:
             gr.Markdown("✅ **Status**: Model successfully loaded and ready to work")
-        with gr.Row():
-            with gr.Column():
-                prompt_input = gr.Textbox(
-                    label="Prompt for model",
-                    placeholder="Enter your request...",
-                    lines=5,
-                    value=example_prompt
-                )
-                image_input = gr.Image(
-                    label="Image (optional, for multimodal models)",
-                    type="pil"
-                )
-                schema_input = gr.Textbox(
-                    label="JSON schema for response structure",
-                    placeholder="Enter JSON schema...",
-                    lines=15,
-                    value=example_schema
-                )
-                grammar_checkbox = gr.Checkbox(
-                    label="🔗 Use Grammar (GBNF) Mode",
-                    value=True,
-                    info="Enable grammar-based structured output for more precise JSON generation"
-                )
-                submit_btn = gr.Button("Generate Response", variant="primary")
-            with gr.Column():
-                output = gr.Textbox(
-                    label="Structured Response",
-                    lines=20,
-                    interactive=False
-                )
-        submit_btn.click(
-            fn=process_request,
-            inputs=[prompt_input, schema_input, image_input, grammar_checkbox],
-            outputs=output
-        )
-        # Examples
-        gr.Markdown("## 📋 Usage Examples")
-        examples = gr.Examples(
-            examples=[
-                [
-                    "Describe today's weather in New York",
-                    """{
-  "type": "object",
-  "properties": {
-    "temperature": {"type": "number"},
-    "description": {"type": "string"},
-    "humidity": {"type": "number"}
-  }
-}""",
-                    None
-                ],
-                [
-                    "Create a Python learning plan for one month",
-                    """{
-  "type": "object",
-  "properties": {
-    "weeks": {
-      "type": "array",
-      "items": {
-        "type": "object",
-        "properties": {
-          "week_number": {"type": "integer"},
-          "topics": {"type": "array", "items": {"type": "string"}},
-          "practice_hours": {"type": "number"}
-        }
-      }
-    },
-    "total_hours": {"type": "number"}
-  }
-}""",
-                    None
-                ]
-            ],
-            inputs=[prompt_input, schema_input, image_input]
-        )
         # Model information
         gr.Markdown(f"""
@@ -612,10 +616,155 @@ def create_gradio_interface():
 - Strict enforcement of JSON structure during generation
 - Support for objects, arrays, strings, numbers, booleans, and enums
 - Improved consistency and reliability of structured outputs
         """)
     return demo
 if __name__ == "__main__":
     # Create and launch Gradio interface
     demo = create_gradio_interface()
@@ -623,5 +772,5 @@ if __name__ == "__main__":
         server_name=Config.HOST,
         server_port=Config.GRADIO_PORT,
         share=False,
-        debug=True
     )

+import os
+os.environ.setdefault("OMP_NUM_THREADS", "1")
+os.environ.setdefault("OPENBLAS_NUM_THREADS", "1")
+os.environ.setdefault("MKL_NUM_THREADS", "1")
+os.environ.setdefault("NUMEXPR_NUM_THREADS", "1")
 import json
 import os
 import gradio as gr
 # Try to import llama_cpp with fallback
 try:
+    from llama_cpp import Llama, LlamaGrammar, LlamaRAMCache
     LLAMA_CPP_AVAILABLE = True
 except ImportError as e:
     print(f"Warning: llama-cpp-python not available: {e}")
     hf_hub_download = None
 # Setup logging
+log_level = getattr(logging, Config.LOG_LEVEL.upper())
+logging.basicConfig(level=log_level)
 logger = logging.getLogger(__name__)
+# Reduce llama-cpp-python verbosity
+llama_logger = logging.getLogger('llama_cpp')
+llama_logger.setLevel(logging.WARNING)
 class StructuredOutputRequest(BaseModel):
     prompt: str
     image: Optional[str] = None  # base64 encoded image
                 lora_base=None,
                 lora_path=None,
                 seed=Config.SEED,
+                verbose=False  # Disable verbose to reduce log noise
             )
+            # cache = LlamaRAMCache()
+            # self.llm.set_cache(cache)
             logger.info("Model successfully loaded and initialized")
             # Test model with a simple prompt to verify it's working
+            from time import time
             logger.info("Testing model with simple prompt...")
+            start_time = time()
+            test_response = self.llm("Hello", max_tokens=1, temperature=1.0, top_k=64, top_p=0.95, min_p=0.0)
+            logger.info(f"Model test time: {time() - start_time:.2f} seconds, response: {test_response}")
             logger.info("Model test successful")
         except Exception as e:
     def _format_prompt_with_schema(self, prompt: str, json_schema: Dict[str, Any]) -> str:
         """
+        Format prompt for structured output generation using Gemma chat format
         """
         schema_str = json.dumps(json_schema, ensure_ascii=False, indent=2)
+        # Use Gemma chat format with proper tokens
+        formatted_prompt = f"""<bos><start_of_turn>user
+{prompt}
 Please respond in strict accordance with the following JSON schema:
 {schema_str}
 ```
+Return ONLY valid JSON without additional comments or explanations.<end_of_turn>
+<start_of_turn>model
+"""
         return formatted_prompt
+    def _format_gemma_chat(self, messages: list) -> str:
+        """
+        Format messages in Gemma chat format
+        Args:
+            messages: List of dicts with 'role' and 'content' keys
+                     role can be 'user' or 'model'
+        """
+        formatted_parts = ["<bos>"]
+        for message in messages:
+            role = message.get('role', 'user')
+            content = message.get('content', '')
+            if role not in ['user', 'model']:
+                role = 'user'  # fallback to user role
+            formatted_parts.append(f"<start_of_turn>{role}")
+            formatted_parts.append(content)
+            formatted_parts.append("<end_of_turn>")
+        # Add start of model response
+        formatted_parts.append("<start_of_turn>model")
+        return "\n".join(formatted_parts)
+    def generate_chat_response(self, messages: list, max_tokens: int = None) -> str:
+        """
+        Generate response using Gemma chat format
+        Args:
+            messages: List of message dicts with 'role' and 'content' keys
+            max_tokens: Maximum tokens for generation
+        Returns:
+            Generated response text
+        """
+        if not messages:
+            raise ValueError("Messages list cannot be empty")
+        # Format messages using Gemma chat format
+        formatted_prompt = self._format_gemma_chat(messages)
+        # Set generation parameters
+        generation_params = {
+            "max_tokens": max_tokens or Config.MAX_NEW_TOKENS,
+            "temperature": Config.TEMPERATURE,
+            "top_k": 64,
+            "top_p": 0.95,
+            "min_p": 0.0,
+            "echo": False,
+            "stop": ["<end_of_turn>", "<start_of_turn>", "<bos>"]
+        }
+        # Generate response
+        response = self.llm(formatted_prompt, **generation_params)
+        generated_text = response['choices'][0]['text'].strip()
+        return generated_text
     def generate_structured_response(self,
                                    prompt: str,
                                    json_schema: Union[str, Dict[str, Any]],
             generation_params = {
                 "max_tokens": Config.MAX_NEW_TOKENS,
                 "temperature": Config.TEMPERATURE,
+                "top_k": 64,
+                "top_p": 0.95,
+                "min_p": 0.0,
                 "echo": False
             }
             # Add grammar or stop tokens based on mode
             if use_grammar and grammar is not None:
                 generation_params["grammar"] = grammar
+                # For grammar mode, use a simpler prompt in Gemma format
+                simple_prompt = f"<bos><start_of_turn>user\n{prompt}<end_of_turn>\n<start_of_turn>model\n"
                 response = self.llm(simple_prompt, **generation_params)
             else:
+                # Update stop tokens for Gemma format
+                generation_params["stop"] = ["<end_of_turn>", "<start_of_turn>", "<bos>"]
                 response = self.llm(formatted_prompt, **generation_params)
             # Extract generated text
                 if json_start != -1 and json_end > json_start:
                     json_str = generated_text[json_start:json_end]
                     parsed_response = json.loads(json_str)
+                    return parsed_response
                 else:
                     return {
                         "error": "Could not find JSON in model response",
                 "error": f"Generation error: {str(e)}"
             }
+def _json_schema_to_gbnf(schema: Dict[str, Any], root_name: str = "root") -> str:
+    """Convert JSON schema to GBNF (Backus-Naur Form) grammar for structured output"""
+    rules = {}  # Use dict to maintain order and avoid duplicates
+    def add_rule(name: str, definition: str):
+        if name not in rules:
+            rules[name] = f"{name} ::= {definition}"
+    def process_type(schema_part: Dict[str, Any], type_name: str = "value") -> str:
+        if "type" not in schema_part:
+            # Handle anyOf, oneOf, allOf cases - simplified to string for now
+            return "string"
+        schema_type = schema_part["type"]
+        if schema_type == "object":
+            # Handle object type
+            properties = schema_part.get("properties", {})
+            required = schema_part.get("required", [])
+            if not properties:
+                add_rule(type_name, '"{" ws "}"')
+                return type_name
+            # Build object properties
+            property_rules = []
+            for prop_name, prop_schema in properties.items():
+                prop_type_name = f"{type_name}_{prop_name}"
+                prop_type = process_type(prop_schema, prop_type_name)
+                property_rules.append(f'"\\"" "{prop_name}" "\\"" ws ":" ws {prop_type}')
+            # Create a simplified object structure with all properties as required
+            # This avoids complex optional field handling that can cause parsing issues
+            if len(property_rules) == 1:
+                object_def = f'"{{" ws {property_rules[0]} ws "}}"'
+            else:
+                properties_joined = ' ws "," ws '.join(property_rules)
+                object_def = f'"{{" ws {properties_joined} ws "}}"'
+            add_rule(type_name, object_def)
+            return type_name
+        elif schema_type == "array":
+            # Handle array type
+            items_schema = schema_part.get("items", {})
+            items_type_name = f"{type_name}_items"
+            item_type = process_type(items_schema, f"{type_name}_item")
+            # Create array items rule
+            add_rule(items_type_name, f"{item_type} (ws \",\" ws {item_type})*")
+            add_rule(type_name, f'"[" ws ({items_type_name})? ws "]"')
+            return type_name
+        elif schema_type == "string":
+            # Handle string type with enum support
+            if "enum" in schema_part:
+                enum_values = schema_part["enum"]
+                enum_options = ' | '.join([f'"\\"" "{val}" "\\""' for val in enum_values])
+                add_rule(type_name, enum_options)
+                return type_name
+            else:
+                return "string"
+        elif schema_type == "number" or schema_type == "integer":
+            return "number"
+        elif schema_type == "boolean":
+            return "boolean"
+        else:
+            return "string"  # fallback
+    # First add basic GBNF rules for primitives to ensure they come first
+    basic_rules_data = [
+        ('ws', '[ \\t\\n]*'),
+        ('string', '"\\"" char* "\\""'),
+        ('char', '[^"\\\\] | "\\\\" (["\\\\bfnrt] | "u" hex hex hex hex)'),
+        ('hex', '[0-9a-fA-F]'),
+        ('number', '"-"? ("0" | [1-9] [0-9]*) ("." [0-9]+)? ([eE] [+-]? [0-9]+)?'),
+        ('boolean', '"true" | "false"'),
+        ('null', '"null"')
+    ]
+    for rule_name, rule_def in basic_rules_data:
+        add_rule(rule_name, rule_def)
+    # Process root schema to build all custom rules
+    process_type(schema, root_name)
+    # Return rules in the order they were added
+    return "\n".join(rules.values())
 def test_grammar_generation(json_schema_str: str) -> Dict[str, Any]:
     """
     Test grammar generation without running the full model
     result = llm_client.generate_structured_response(prompt, json_schema, image, use_grammar)
     return json.dumps(result, ensure_ascii=False, indent=2)
+def test_gemma_chat(messages_text: str) -> str:
+    """
+    Test Gemma chat format with example conversation
+    """
+    if llm_client is None:
+        return "Error: LLM client not initialized"
+    try:
+        # Parse messages from text (simple format: role:message per line)
+        messages = []
+        for line in messages_text.strip().split('\n'):
+            if ':' in line:
+                role, content = line.split(':', 1)
+                role = role.strip().lower()
+                content = content.strip()
+                if role in ['user', 'model']:
+                    messages.append({"role": role, "content": content})
+        if not messages:
+            # Use default example if no valid messages provided
+            messages = [
+                {"role": "user", "content": "Hello!"},
+                {"role": "model", "content": "Hey there!"},
+                {"role": "user", "content": "What is 1+1?"}
+            ]
+        # Generate formatted prompt to show the structure
+        formatted_prompt = llm_client._format_gemma_chat(messages)
+        # Generate response
+        response = llm_client.generate_chat_response(messages, max_tokens=100)
+        return f"Formatted prompt:\n{formatted_prompt}\n\nGenerated response:\n{response}"
+    except Exception as e:
+        return f"Error: {str(e)}"
 # Examples for demonstration
 example_schema = """{
   "type": "object",
         else:
             gr.Markdown("✅ **Status**: Model successfully loaded and ready to work")
+        with gr.Tabs():
+            with gr.TabItem("🔧 Structured Output"):
+                create_structured_output_tab()
+            with gr.TabItem("💬 Gemma Chat Format"):
+                create_gemma_chat_tab()
         # Model information
         gr.Markdown(f"""
 - Strict enforcement of JSON structure during generation
 - Support for objects, arrays, strings, numbers, booleans, and enums
 - Improved consistency and reliability of structured outputs
+📝 **Gemma Format Features**:
+- Uses proper Gemma chat tokens: `<bos>`, `<start_of_turn>`, `<end_of_turn>`
+- Supports multi-turn conversations with user/model roles
+- Compatible with Gemma model's expected input format
+- Improved response quality with proper token structure
         """)
     return demo
+def create_structured_output_tab():
+    """Create structured output tab"""
+    with gr.Row():
+        with gr.Column():
+            prompt_input = gr.Textbox(
+                label="Prompt for model",
+                placeholder="Enter your request...",
+                lines=5,
+                value=example_prompt
+            )
+            image_input = gr.Image(
+                label="Image (optional, for multimodal models)",
+                type="pil"
+            )
+            schema_input = gr.Textbox(
+                label="JSON schema for response structure",
+                placeholder="Enter JSON schema...",
+                lines=15,
+                value=example_schema
+            )
+            grammar_checkbox = gr.Checkbox(
+                label="🔗 Use Grammar (GBNF) Mode",
+                value=True,
+                info="Enable grammar-based structured output for more precise JSON generation"
+            )
+            submit_btn = gr.Button("Generate Response", variant="primary")
+        with gr.Column():
+            output = gr.Textbox(
+                label="Structured Response",
+                lines=20,
+                interactive=False
+            )
+    submit_btn.click(
+        fn=process_request,
+        inputs=[prompt_input, schema_input, image_input, grammar_checkbox],
+        outputs=output
+    )
+    # Examples
+    gr.Markdown("## 📋 Usage Examples")
+    examples = gr.Examples(
+        examples=[
+            [
+                "Describe today's weather in New York",
+                """{
+  "type": "object",
+  "properties": {
+    "temperature": {"type": "number"},
+    "description": {"type": "string"},
+    "humidity": {"type": "number"}
+  }
+}""",
+                None
+            ],
+            [
+                "Create a Python learning plan for one month",
+                """{
+  "type": "object",
+  "properties": {
+    "weeks": {
+      "type": "array",
+      "items": {
+        "type": "object",
+        "properties": {
+          "week_number": {"type": "integer"},
+          "topics": {"type": "array", "items": {"type": "string"}},
+          "practice_hours": {"type": "number"}
+        }
+      }
+    },
+    "total_hours": {"type": "number"}
+  }
+}""",
+                None
+            ]
+        ],
+        inputs=[prompt_input, schema_input, image_input]
+    )
+def create_gemma_chat_tab():
+    """Create Gemma chat format demonstration tab"""
+    gr.Markdown("## 💬 Gemma Chat Format Demo")
+    gr.Markdown("This tab demonstrates the Gemma chat format with `<bos>`, `<start_of_turn>`, and `<end_of_turn>` tokens.")
+    with gr.Row():
+        with gr.Column():
+            messages_input = gr.Textbox(
+                label="Conversation Messages (format: role: message per line)",
+                placeholder="user: Hello!\nmodel: Hey there!\nuser: What is 1+1?",
+                lines=8,
+                value="user: Hello!\nmodel: Hey there!\nuser: What is 1+1?"
+            )
+            test_btn = gr.Button("Test Gemma Format", variant="primary")
+        with gr.Column():
+            chat_output = gr.Textbox(
+                label="Formatted Prompt and Response",
+                lines=15,
+                interactive=False
+            )
+    test_btn.click(
+        fn=test_gemma_chat,
+        inputs=messages_input,
+        outputs=chat_output
+    )
+    # Example explanation
+    gr.Markdown("""
+    ### 📝 Format Explanation
+    The Gemma chat format uses special tokens to structure conversations:
+    - `<bos>` - Beginning of sequence
+    - `<start_of_turn>user` - Start user message
+    - `<end_of_turn>` - End current message
+    - `<start_of_turn>model` - Start model response
+    **Example structure:**
+    ```
+    <bos><start_of_turn>user
+    Hello!<end_of_turn>
+    <start_of_turn>model
+    Hey there!<end_of_turn>
+    <start_of_turn>user
+    What is 1+1?<end_of_turn>
+    <start_of_turn>model
+    ```
+    This format is now used for both structured output and regular chat generation.
+    """)
 if __name__ == "__main__":
     # Create and launch Gradio interface
     demo = create_gradio_interface()
         server_name=Config.HOST,
         server_port=Config.GRADIO_PORT,
         share=False,
+        debug=False
     )

config.py CHANGED Viewed

@@ -5,19 +5,19 @@ class Config:
     """Application configuration for working with local GGUF models"""
     # Model settings - using Hugging Face downloaded model
-    MODEL_REPO: str = os.getenv("MODEL_REPO", "lmstudio-community/gemma-3n-E4B-it-text-GGUF")
-    MODEL_FILENAME: str = os.getenv("MODEL_FILENAME", "gemma-3n-E4B-it-Q8_0.gguf")
-    MODEL_PATH: str = os.getenv("MODEL_PATH", "/app/models/gemma-3n-E4B-it-Q8_0.gguf")
     HUGGINGFACE_TOKEN: str = os.getenv("HUGGINGFACE_TOKEN", "")
     # Model loading settings - optimized for Docker container
-    N_CTX: int = int(os.getenv("N_CTX", "4096"))  # Reduced context window for Docker
     N_GPU_LAYERS: int = int(os.getenv("N_GPU_LAYERS", "0"))  # CPU-only for Docker by default
-    N_THREADS: int = int(os.getenv("N_THREADS", "4"))  # Conservative thread count
     N_BATCH: int = int(os.getenv("N_BATCH", "512"))  # Smaller batch size for Docker
     USE_MLOCK: bool = os.getenv("USE_MLOCK", "false").lower() == "true"  # Disabled for Docker
     USE_MMAP: bool = os.getenv("USE_MMAP", "true").lower() == "true"  # Keep memory mapping
-    F16_KV: bool = os.getenv("F16_KV", "true").lower() == "true"  # Use 16-bit keys and values
     SEED: int = int(os.getenv("SEED", "42"))  # Random seed for reproducibility
     # Server settings - Docker compatible
@@ -25,9 +25,12 @@ class Config:
     GRADIO_PORT: int = int(os.getenv("GRADIO_PORT", "7860"))  # Standard HuggingFace Spaces port
     API_PORT: int = int(os.getenv("API_PORT", "8000"))
     # Generation settings - optimized for Docker
     MAX_NEW_TOKENS: int = int(os.getenv("MAX_NEW_TOKENS", "256"))  # Reduced for faster response
-    TEMPERATURE: float = float(os.getenv("TEMPERATURE", "0.1"))
     # File upload settings
     MAX_FILE_SIZE: int = int(os.getenv("MAX_FILE_SIZE", "10485760"))  # 10MB

     """Application configuration for working with local GGUF models"""
     # Model settings - using Hugging Face downloaded model
+    MODEL_REPO = "unsloth/gemma-3-270m-it-GGUF"
+    MODEL_FILENAME = "gemma-3-270m-it-Q8_0.gguf"
+    MODEL_PATH = f"/app/models/{MODEL_FILENAME}"
     HUGGINGFACE_TOKEN: str = os.getenv("HUGGINGFACE_TOKEN", "")
     # Model loading settings - optimized for Docker container
+    N_CTX: int = int(os.getenv("N_CTX", "1024"))  # Reduced context window for Docker
     N_GPU_LAYERS: int = int(os.getenv("N_GPU_LAYERS", "0"))  # CPU-only for Docker by default
+    N_THREADS: int = int(os.getenv("N_THREADS", "2"))  # Conservative thread count
     N_BATCH: int = int(os.getenv("N_BATCH", "512"))  # Smaller batch size for Docker
     USE_MLOCK: bool = os.getenv("USE_MLOCK", "false").lower() == "true"  # Disabled for Docker
     USE_MMAP: bool = os.getenv("USE_MMAP", "true").lower() == "true"  # Keep memory mapping
+    F16_KV: bool = os.getenv("F16_KV", "false").lower() == "true"  # Use 16-bit keys and values
     SEED: int = int(os.getenv("SEED", "42"))  # Random seed for reproducibility
     # Server settings - Docker compatible
     GRADIO_PORT: int = int(os.getenv("GRADIO_PORT", "7860"))  # Standard HuggingFace Spaces port
     API_PORT: int = int(os.getenv("API_PORT", "8000"))
+    # Logging settings
+    LOG_LEVEL: str = os.getenv("LOG_LEVEL", "INFO")  # INFO, WARNING, ERROR, DEBUG
     # Generation settings - optimized for Docker
     MAX_NEW_TOKENS: int = int(os.getenv("MAX_NEW_TOKENS", "256"))  # Reduced for faster response
+    TEMPERATURE: float = 1.0
     # File upload settings
     MAX_FILE_SIZE: int = int(os.getenv("MAX_FILE_SIZE", "10485760"))  # 10MB

requirements.txt CHANGED Viewed

@@ -1,6 +1,4 @@
 huggingface_hub==0.25.2
-# Core ML dependencies - updated for compatibility with gemma-3n-E4B model
-# https://github.com/abetlen/llama-cpp-python/releases/download/v0.3.2/llama_cpp_python-0.3.2-cp310-cp310-linux_x86_64.whl
 # Web interface
 gradio==4.44.1

 huggingface_hub==0.25.2
 # Web interface
 gradio==4.44.1

test.ipynb DELETED Viewed

@@ -1,24 +0,0 @@
-{
- "cells": [],
- "metadata": {
-  "kernelspec": {
-   "display_name": "py310",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.18"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}

wheels/llama_cpp_python-0.3.16-cp310-cp310-linux_x86_64.whl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:73ff502f10b7d2c985879796fc80ea212a71a9114bf26b90b7bd70c2842ba967
+size 4259580