Spaces:

samarthnaikk
/

llamamodel

Runtime error

App Files Files Community

Samarth Naik commited on Dec 20, 2025

Commit

01fa9b6

1 Parent(s): 414d456

added init files

Browse files

Files changed (5) hide show

Dockerfile +24 -0
README.md +123 -4
app.py +209 -0
requirements.txt +7 -0
test_api.py +60 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,24 @@

+FROM python:3.9-slim
+WORKDIR /code
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    git \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install Python dependencies
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY . .
+# Expose port
+EXPOSE 5001
+# Run the application
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,12 +1,131 @@
 ---
 title: Llamamodel
 emoji: ⚡
 colorFrom: yellow
 colorTo: pink
-sdk: gradio
-sdk_version: 6.2.0
-app_file: app.py
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Llama-3.1-8B-Instruct Flask API
 ---
 title: Llamamodel
 emoji: ⚡
 colorFrom: yellow
 colorTo: pink
+sdk: docker
+app_port: 5001
 pinned: false
 ---
+A Flask web application that serves the Meta Llama-3.1-8B-Instruct model via a REST API.
+## Features
+- RESTful API with `/compute` endpoint
+- JSON input/output
+- Configurable generation parameters
+- Memory-optimized model loading with 8-bit quantization
+- CORS support
+- Error handling and logging
+## Deployment to Hugging Face Spaces
+This application is configured to run on Hugging Face Spaces using Docker. Once pushed:
+1. The model will automatically load on startup
+2. The `/compute` endpoint will be available at your space URL
+3. Use POST requests with JSON payloads to generate responses
+## Local Development
+1. Install the required dependencies:
+```bash
+pip install -r requirements.txt
+```
+2. Run the Flask application:
+```bash
+python app.py
+```
+The application will start on `http://localhost:5000` by default.
+## Usage
+### Health Check
+```bash
+GET http://localhost:5000/
+```
+Response:
+```json
+{
+  "status": "success",
+  "message": "Llama-3.1-8B-Instruct Flask API is running",
+  "model_loaded": true
+}
+```
+### Generate Response
+```bash
+POST https://your-space-name-username.hf.space/compute
+```
+Request body:
+```json
+{
+  "prompt": "What is the capital of France?",
+  "max_length": 256,
+  "temperature": 0.7,
+  "top_p": 0.9
+}
+```
+Response:
+```json
+{
+  "status": "success",
+  "prompt": "What is the capital of France?",
+  "response": "The capital of France is Paris...",
+  "parameters": {
+    "max_length": 256,
+    "temperature": 0.7,
+    "top_p": 0.9
+  }
+}
+```
+### Parameters
+- `prompt` (required): The input text prompt
+- `max_length` (optional): Maximum length of generated response (default: 512)
+- `temperature` (optional): Sampling temperature (default: 0.7)
+- `top_p` (optional): Top-p sampling parameter (default: 0.9)
+## Testing
+Run the test script to verify the API is working:
+```bash
+python test_api.py
+```
+## Example with curl
+```bash
+# Health check
+curl http://localhost:5000/
+# Generate response
+curl -X POST http://localhost:5000/compute \
+  -H "Content-Type: application/json" \
+  -d '{"prompt": "Explain machine learning in simple terms"}'
+```
+## System Requirements
+- Python 3.8+
+- CUDA-capable GPU (recommended)
+- At least 16GB RAM
+- 20GB+ free disk space for model weights
+## Notes
+- The model uses 8-bit quantization to reduce memory usage
+- First request may take longer as the model initializes
+- The application logs model loading progress and errors

app.py ADDED Viewed

	@@ -0,0 +1,209 @@

+from flask import Flask, request, jsonify
+from flask_cors import CORS
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch
+import logging
+import os
+# Set up logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+app = Flask(__name__)
+CORS(app)  # Enable CORS for all routes
+# Global variables for model and tokenizer
+model = None
+tokenizer = None
+def load_model():
+    """Load the Llama model and tokenizer"""
+    global model, tokenizer
+    try:
+        logger.info("Loading Llama-3.1-8B-Instruct model...")
+        model_name = "meta-llama/Llama-3.1-8B-Instruct"
+        # Load tokenizer
+        tokenizer = AutoTokenizer.from_pretrained(model_name)
+        # Set pad token if not exists
+        if tokenizer.pad_token is None:
+            tokenizer.pad_token = tokenizer.eos_token
+        # Load model with optimizations
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            torch_dtype=torch.float16,
+            device_map="auto",
+            load_in_8bit=True,  # Use 8-bit quantization to reduce memory usage
+            trust_remote_code=True
+        )
+        logger.info("Model loaded successfully!")
+    except Exception as e:
+        logger.error(f"Error loading model: {str(e)}")
+        raise e
+def generate_response(prompt, max_length=512, temperature=0.7, top_p=0.9):
+    """Generate response using the loaded Llama model"""
+    global model, tokenizer
+    if model is None or tokenizer is None:
+        raise ValueError("Model not loaded. Please ensure the model is properly initialized.")
+    try:
+        # Format the prompt for Llama-3.1-Instruct
+        formatted_prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
+        # Tokenize the input
+        inputs = tokenizer.encode(formatted_prompt, return_tensors="pt")
+        # Move to the same device as the model
+        inputs = inputs.to(model.device)
+        # Generate response
+        with torch.no_grad():
+            outputs = model.generate(
+                inputs,
+                max_length=len(inputs[0]) + max_length,
+                temperature=temperature,
+                top_p=top_p,
+                do_sample=True,
+                pad_token_id=tokenizer.eos_token_id,
+                eos_token_id=tokenizer.eos_token_id,
+                repetition_penalty=1.1
+            )
+        # Decode the response
+        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+        # Extract only the assistant's response
+        if "<|start_header_id|>assistant<|end_header_id|>" in response:
+            response = response.split("<|start_header_id|>assistant<|end_header_id|>")[-1].strip()
+        return response
+    except Exception as e:
+        logger.error(f"Error generating response: {str(e)}")
+        raise e
+@app.route('/', methods=['GET'])
+def home():
+    """Health check endpoint"""
+    return jsonify({
+        "status": "success",
+        "message": "Llama-3.1-8B-Instruct Flask API is running",
+        "model_loaded": model is not None and tokenizer is not None
+    })
+@app.route('/compute', methods=['POST'])
+def compute():
+    """Main endpoint to process prompts and return model responses"""
+    try:
+        # Check if model is loaded
+        if model is None or tokenizer is None:
+            return jsonify({
+                "status": "error",
+                "message": "Model not loaded. Please wait for initialization."
+            }), 503
+        # Get JSON data from request
+        data = request.get_json()
+        if not data:
+            return jsonify({
+                "status": "error",
+                "message": "No JSON data provided"
+            }), 400
+        # Extract prompt from JSON
+        prompt = data.get('prompt')
+        if not prompt:
+            return jsonify({
+                "status": "error",
+                "message": "No 'prompt' field found in JSON data"
+            }), 400
+        if not isinstance(prompt, str) or len(prompt.strip()) == 0:
+            return jsonify({
+                "status": "error",
+                "message": "Prompt must be a non-empty string"
+            }), 400
+        # Get optional parameters
+        max_length = data.get('max_length', 512)
+        temperature = data.get('temperature', 0.7)
+        top_p = data.get('top_p', 0.9)
+        # Validate parameters
+        if not isinstance(max_length, int) or max_length <= 0 or max_length > 2048:
+            max_length = 512
+        if not isinstance(temperature, (int, float)) or temperature <= 0 or temperature > 2:
+            temperature = 0.7
+        if not isinstance(top_p, (int, float)) or top_p <= 0 or top_p > 1:
+            top_p = 0.9
+        # Generate response
+        logger.info(f"Processing prompt: {prompt[:100]}...")
+        response = generate_response(prompt, max_length, temperature, top_p)
+        return jsonify({
+            "status": "success",
+            "prompt": prompt,
+            "response": response,
+            "parameters": {
+                "max_length": max_length,
+                "temperature": temperature,
+                "top_p": top_p
+            }
+        })
+    except Exception as e:
+        logger.error(f"Error in compute endpoint: {str(e)}")
+        return jsonify({
+            "status": "error",
+            "message": f"Internal server error: {str(e)}"
+        }), 500
+@app.errorhandler(404)
+def not_found(error):
+    return jsonify({
+        "status": "error",
+        "message": "Endpoint not found"
+    }), 404
+@app.errorhandler(500)
+def internal_error(error):
+    return jsonify({
+        "status": "error",
+        "message": "Internal server error"
+    }), 500
+if __name__ == '__main__':
+    # Load the model when starting the app
+    logger.info("Starting Flask application...")
+    try:
+        load_model()
+        logger.info("Application ready!")
+        logger.info("API endpoints:")
+        logger.info("  GET  / - Health check")
+        logger.info("  POST /compute - Generate responses")
+        # Run the Flask app
+        port = int(os.environ.get('PORT', 5001))
+        app.run(
+            host='0.0.0.0',
+            port=port,
+            debug=False,
+            threaded=True
+        )
+    except Exception as e:
+        logger.error(f"Failed to start application: {str(e)}")
+        exit(1)

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+flask==3.0.0
+transformers==4.36.0
+torch==2.1.0
+accelerate==0.25.0
+bitsandbytes==0.41.3
+flask-cors==4.0.0
+huggingface_hub

test_api.py ADDED Viewed

	@@ -0,0 +1,60 @@

+import requests
+import json
+# Test the Flask API
+def test_api():
+    url = "http://localhost:5001/compute"
+    # Test data
+    test_data = {
+        "prompt": "What is the capital of France?",
+        "max_length": 256,
+        "temperature": 0.7,
+        "top_p": 0.9
+    }
+    try:
+        print("Testing the /compute endpoint...")
+        print(f"Sending prompt: {test_data['prompt']}")
+        response = requests.post(url, json=test_data)
+        if response.status_code == 200:
+            result = response.json()
+            print("\nResponse received successfully!")
+            print(f"Status: {result['status']}")
+            print(f"Response: {result['response']}")
+        else:
+            print(f"Error: {response.status_code}")
+            print(response.text)
+    except requests.exceptions.ConnectionError:
+        print("Error: Could not connect to the server. Make sure the Flask app is running on port 5001.")
+    except Exception as e:
+        print(f"Error: {str(e)}")
+def test_health_check():
+    url = "http://localhost:5001/"
+    try:
+        print("Testing health check endpoint...")
+        response = requests.get(url)
+        if response.status_code == 200:
+            result = response.json()
+            print("Health check successful!")
+            print(json.dumps(result, indent=2))
+        else:
+            print(f"Error: {response.status_code}")
+            print(response.text)
+    except requests.exceptions.ConnectionError:
+        print("Error: Could not connect to the server. Make sure the Flask app is running on port 5001.")
+    except Exception as e:
+        print(f"Error: {str(e)}")
+if __name__ == "__main__":
+    print("=== Flask API Test ===")
+    test_health_check()
+    print("\n" + "="*50 + "\n")
+    test_api()