Spaces:

david167
/

question-generation-api

Sleeping

App Files Files Community

david167 commited on Aug 6, 2025

Commit

0bf99b7

1 Parent(s): 50ec035

Initial setup: Question Generation API with DeepHermes reasoning model

Browse files

Files changed (6) hide show

.gitignore +60 -0
Dockerfile +61 -0
README.md +180 -5
app.py +310 -0
requirements.txt +13 -0
test_api.py +215 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,60 @@

+# Python
+__pycache__/
+*.py[cod]
+*$py.class
+*.so
+.Python
+build/
+develop-eggs/
+dist/
+downloads/
+eggs/
+.eggs/
+lib/
+lib64/
+parts/
+sdist/
+var/
+wheels/
+*.egg-info/
+.installed.cfg
+*.egg
+MANIFEST
+# Virtual environments
+.env
+.venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# IDE
+.vscode/
+.idea/
+*.swp
+*.swo
+# OS
+.DS_Store
+.DS_Store?
+._*
+.Spotlight-V100
+.Trashes
+ehthumbs.db
+Thumbs.db
+# Model cache
+.cache/
+*.bin
+*.safetensors
+*.gguf
+# Logs
+*.log
+logs/
+# Temporary files
+*.tmp
+*.temp

Dockerfile ADDED Viewed

	@@ -0,0 +1,61 @@

+# Use NVIDIA CUDA base image optimized for A10G
+FROM nvidia/cuda:11.8-devel-ubuntu20.04
+# Set environment variables
+ENV DEBIAN_FRONTEND=noninteractive
+ENV PYTHONUNBUFFERED=1
+ENV CUDA_VISIBLE_DEVICES=0
+# Install system dependencies
+RUN apt-get update && apt-get install -y \
+    python3.9 \
+    python3.9-dev \
+    python3-pip \
+    git \
+    wget \
+    curl \
+    build-essential \
+    cmake \
+    && rm -rf /var/lib/apt/lists/*
+# Set Python 3.9 as default
+RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.9 1
+RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
+# Upgrade pip
+RUN python -m pip install --upgrade pip
+# Set working directory
+WORKDIR /app
+# Copy requirements first for better caching
+COPY requirements.txt .
+# Install Python dependencies with CUDA support
+RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
+# Install llama-cpp-python with CUDA support
+ENV CMAKE_ARGS="-DLLAMA_CUBLAS=on"
+ENV FORCE_CMAKE=1
+RUN pip install llama-cpp-python --force-reinstall --no-cache-dir
+# Install other requirements
+RUN pip install -r requirements.txt
+# Copy application code
+COPY app.py .
+COPY README.md .
+# Create cache directory for Hugging Face
+RUN mkdir -p /app/.cache
+ENV HF_HOME=/app/.cache
+# Expose port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Run the application
+CMD ["python", "app.py"]

README.md CHANGED Viewed

@@ -1,10 +1,185 @@
 ---
-title: Question Generation Api
-emoji: ⚡
-colorFrom: yellow
-colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Question Generation API
+emoji: 🤔
+colorFrom: blue
+colorTo: purple
 sdk: docker
 pinned: false
+license: apache-2.0
+app_port: 7860
 ---
+# Question Generation API
+This Hugging Face Space provides an API for generating thoughtful questions from input statements using the **DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF** model.
+## Features
+- 🧠 **Deep Reasoning**: Uses enhanced reasoning capabilities for better question quality
+- 📚 **1M Context**: Supports very long input statements (up to 1 million tokens)
+- 🎯 **Customizable**: Adjust number of questions, difficulty level, and generation parameters
+- 🚀 **Fast API**: RESTful API with automatic documentation
+- 🔧 **GPU Optimized**: Optimized for NVIDIA A10G hardware
+## API Endpoints
+### Generate Questions
+**POST** `/generate-questions`
+Generate questions from a given statement.
+**Request Body:**
+```json
+{
+  "statement": "Your input statement here",
+  "num_questions": 5,
+  "temperature": 0.8,
+  "max_length": 2048,
+  "difficulty_level": "mixed"
+}
+```
+**Parameters:**
+- `statement` (required): The input text to generate questions from
+- `num_questions` (1-10): Number of questions to generate (default: 5)
+- `temperature` (0.1-2.0): Generation creativity (default: 0.8)
+- `max_length` (100-4096): Maximum response length (default: 2048)
+- `difficulty_level`: "easy", "medium", "hard", or "mixed" (default: "mixed")
+**Response:**
+```json
+{
+  "questions": [
+    "What is the main concept discussed?",
+    "How does this relate to...?",
+    "Why is this important?"
+  ],
+  "statement": "Your original statement",
+  "metadata": {
+    "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
+    "temperature": 0.8,
+    "difficulty_level": "mixed"
+  }
+}
+```
+### Health Check
+**GET** `/health`
+Check the API and model status.
+**Response:**
+```json
+{
+  "status": "healthy",
+  "model_loaded": true,
+  "device": "cuda",
+  "memory_usage": {
+    "allocated_gb": 12.5,
+    "reserved_gb": 14.2,
+    "total_gb": 24.0
+  }
+}
+```
+## Usage Examples
+### Python
+```python
+import requests
+# API endpoint
+url = "https://your-space-name.hf.space/generate-questions"
+# Request payload
+data = {
+    "statement": "Artificial intelligence is transforming healthcare by enabling more accurate diagnoses, personalized treatments, and efficient drug discovery processes.",
+    "num_questions": 3,
+    "difficulty_level": "medium"
+}
+# Make request
+response = requests.post(url, json=data)
+questions = response.json()["questions"]
+for i, question in enumerate(questions, 1):
+    print(f"{i}. {question}")
+```
+### JavaScript
+```javascript
+const generateQuestions = async (statement) => {
+  const response = await fetch('https://your-space-name.hf.space/generate-questions', {
+    method: 'POST',
+    headers: {
+      'Content-Type': 'application/json',
+    },
+    body: JSON.stringify({
+      statement: statement,
+      num_questions: 5,
+      difficulty_level: 'mixed'
+    })
+  });
+  const data = await response.json();
+  return data.questions;
+};
+```
+### cURL
+```bash
+curl -X POST "https://your-space-name.hf.space/generate-questions" \
+     -H "Content-Type: application/json" \
+     -d '{
+       "statement": "Climate change is one of the most pressing challenges of our time.",
+       "num_questions": 4,
+       "difficulty_level": "hard"
+     }'
+```
+## Model Information
+This API uses the **DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF** model, which features:
+- **Enhanced Reasoning**: Built on DeepHermes reasoning capabilities
+- **Large Context**: Supports up to 1 million tokens context length
+- **Optimized Format**: GGUF quantization for efficient inference
+- **Thinking Process**: Uses `<think>` tags for internal reasoning
+## Hardware Requirements
+- **GPU**: NVIDIA A10G (24GB VRAM)
+- **Memory**: ~14-16GB VRAM usage
+- **Context**: Up to 32K tokens (adjustable based on available memory)
+## API Documentation
+Visit `/docs` for interactive API documentation with Swagger UI.
+## Error Handling
+The API returns appropriate HTTP status codes:
+- `200`: Success
+- `400`: Bad Request (invalid parameters)
+- `503`: Service Unavailable (model not loaded)
+- `500`: Internal Server Error
+## Rate Limits
+This is a demo space. For production use, consider:
+- Implementing rate limiting
+- Adding authentication
+- Scaling to multiple instances
+- Using dedicated inference endpoints
+## Support
+For issues or questions:
+1. Check the `/health` endpoint
+2. Review the error messages
+3. Ensure your requests match the API schema
+4. Consider adjusting parameters for your hardware
+---
+**Note**: This Space requires a GPU runtime to function properly. Make sure your Space is configured with GPU support.

app.py ADDED Viewed

	@@ -0,0 +1,310 @@

+import os
+import logging
+from typing import List, Optional, Dict, Any
+from contextlib import asynccontextmanager
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+import uvicorn
+from fastapi import FastAPI, HTTPException, BackgroundTasks
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel, Field
+import gc
+# Configure logging
+logging.basicConfig(level=logging.INFO)
+logger = logging.getLogger(__name__)
+# Global variables for model and tokenizer
+model = None
+tokenizer = None
+device = None
+class QuestionGenerationRequest(BaseModel):
+    statement: str = Field(..., description="The input statement to generate questions from")
+    num_questions: int = Field(default=5, ge=1, le=10, description="Number of questions to generate (1-10)")
+    temperature: float = Field(default=0.8, ge=0.1, le=2.0, description="Temperature for generation (0.1-2.0)")
+    max_length: int = Field(default=2048, ge=100, le=4096, description="Maximum length of generated text")
+    difficulty_level: str = Field(default="mixed", description="Difficulty level: easy, medium, hard, or mixed")
+class QuestionGenerationResponse(BaseModel):
+    questions: List[str]
+    statement: str
+    metadata: Dict[str, Any]
+class HealthResponse(BaseModel):
+    status: str
+    model_loaded: bool
+    device: str
+    memory_usage: Dict[str, float]
+async def load_model():
+    """Load the model and tokenizer"""
+    global model, tokenizer, device
+    try:
+        logger.info("Starting model loading...")
+        # Check if CUDA is available
+        device = "cuda" if torch.cuda.is_available() else "cpu"
+        logger.info(f"Using device: {device}")
+        if device == "cuda":
+            logger.info(f"GPU: {torch.cuda.get_device_name()}")
+            logger.info(f"VRAM Available: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
+        model_name = "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF"
+        model_file = "Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-Q4_K_M.gguf"
+        # Use llama-cpp-python for GGUF files
+        try:
+            from llama_cpp import Llama
+            logger.info("Loading model with llama-cpp-python...")
+            model = Llama(
+                model_path=f"hf://{model_name}/{model_file}",
+                n_ctx=32768,  # Context length - adjust based on your needs vs VRAM
+                n_gpu_layers=-1 if device == "cuda" else 0,  # Use all GPU layers if CUDA available
+                verbose=False,
+                n_threads=4,
+                n_batch=512,
+                use_mlock=True,
+                use_mmap=True,
+            )
+            # For llama-cpp-python, we don't need a separate tokenizer
+            tokenizer = None
+            logger.info("Model loaded successfully with llama-cpp-python!")
+        except ImportError:
+            logger.error("llama-cpp-python not installed. Please install it for GGUF support.")
+            raise
+    except Exception as e:
+        logger.error(f"Error loading model: {str(e)}")
+        raise
+async def unload_model():
+    """Clean up model from memory"""
+    global model, tokenizer
+    try:
+        if model is not None:
+            del model
+        if tokenizer is not None:
+            del tokenizer
+        # Clear CUDA cache if available
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+        # Force garbage collection
+        gc.collect()
+        logger.info("Model unloaded successfully")
+    except Exception as e:
+        logger.error(f"Error unloading model: {str(e)}")
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    """Manage application lifespan"""
+    # Startup
+    logger.info("Starting up...")
+    await load_model()
+    yield
+    # Shutdown
+    logger.info("Shutting down...")
+    await unload_model()
+# Create FastAPI app
+app = FastAPI(
+    title="Question Generation API",
+    description="API for generating questions from statements using DeepHermes reasoning model",
+    version="1.0.0",
+    lifespan=lifespan
+)
+# Add CORS middleware
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+def create_question_prompt(statement: str, num_questions: int, difficulty_level: str) -> str:
+    """Create a prompt for question generation with reasoning"""
+    difficulty_instruction = {
+        "easy": "Generate simple, straightforward questions that test basic understanding.",
+        "medium": "Generate questions that require some analysis and comprehension.",
+        "hard": "Generate complex questions that require deep thinking and reasoning.",
+        "mixed": "Generate a mix of easy, medium, and hard questions."
+    }
+    system_prompt = """You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
+You are an expert educator and question generator. Your task is to create thoughtful, well-crafted questions from given statements."""
+    user_prompt = f"""<think>
+I need to analyze this statement and generate {num_questions} high-quality questions. Let me think about:
+1. The key concepts and information in the statement
+2. Different types of questions I can ask (factual, analytical, inferential, evaluative)
+3. The difficulty level requested: {difficulty_level}
+4. How to make questions that promote understanding and critical thinking
+</think>
+Based on the following statement, generate exactly {num_questions} questions.
+Statement: "{statement}"
+Requirements:
+- {difficulty_instruction[difficulty_level]}
+- Questions should be clear, well-formed, and grammatically correct
+- Vary the question types (what, how, why, when, where, etc.)
+- Each question should test different aspects of the statement
+- Make questions engaging and thought-provoking
+- Number each question (1., 2., 3., etc.)
+Generate the questions now:"""
+    return f"{system_prompt}\n\n{user_prompt}"
+def extract_questions(generated_text: str) -> List[str]:
+    """Extract questions from the generated text"""
+    questions = []
+    lines = generated_text.split('\n')
+    for line in lines:
+        line = line.strip()
+        # Look for numbered questions
+        if line and (line[0].isdigit() or line.startswith('Q')):
+            # Remove numbering and clean up
+            question = line
+            # Remove common prefixes
+            for prefix in ['1.', '2.', '3.', '4.', '5.', '6.', '7.', '8.', '9.', '10.', 'Q1:', 'Q2:', 'Q3:', 'Q4:', 'Q5:', 'Question 1:', 'Question 2:', 'Question 3:', 'Question 4:', 'Question 5:']:
+                if question.startswith(prefix):
+                    question = question[len(prefix):].strip()
+                    break
+            if question and question.endswith('?'):
+                questions.append(question)
+    # If no numbered questions found, try to extract any questions
+    if not questions:
+        for line in lines:
+            line = line.strip()
+            if line.endswith('?') and len(line) > 10:
+                questions.append(line)
+    return questions
+@app.get("/health", response_model=HealthResponse)
+async def health_check():
+    """Health check endpoint"""
+    global model
+    memory_usage = {}
+    if torch.cuda.is_available():
+        memory_usage = {
+            "allocated_gb": torch.cuda.memory_allocated() / 1024**3,
+            "reserved_gb": torch.cuda.memory_reserved() / 1024**3,
+            "total_gb": torch.cuda.get_device_properties(0).total_memory / 1024**3
+        }
+    return HealthResponse(
+        status="healthy" if model is not None else "unhealthy",
+        model_loaded=model is not None,
+        device=device if device else "unknown",
+        memory_usage=memory_usage
+    )
+@app.post("/generate-questions", response_model=QuestionGenerationResponse)
+async def generate_questions(request: QuestionGenerationRequest):
+    """Generate questions from a statement"""
+    global model
+    if model is None:
+        raise HTTPException(status_code=503, detail="Model not loaded")
+    try:
+        logger.info(f"Generating {request.num_questions} questions for statement: {request.statement[:100]}...")
+        # Create prompt
+        prompt = create_question_prompt(
+            request.statement,
+            request.num_questions,
+            request.difficulty_level
+        )
+        # Generate response using llama-cpp-python
+        response = model(
+            prompt,
+            max_tokens=request.max_length,
+            temperature=request.temperature,
+            top_p=0.95,
+            top_k=40,
+            repeat_penalty=1.1,
+            stop=["<|im_end|>", "</think>"],
+            echo=False
+        )
+        generated_text = response['choices'][0]['text']
+        logger.info(f"Generated text length: {len(generated_text)}")
+        # Extract questions from the generated text
+        questions = extract_questions(generated_text)
+        # Ensure we have the requested number of questions
+        if len(questions) < request.num_questions:
+            logger.warning(f"Only extracted {len(questions)} questions, requested {request.num_questions}")
+        # Limit to requested number
+        questions = questions[:request.num_questions]
+        # If we still don't have enough questions, add a fallback
+        while len(questions) < request.num_questions:
+            questions.append(f"What is the main point of this statement: '{request.statement[:100]}...'?")
+        metadata = {
+            "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
+            "temperature": request.temperature,
+            "difficulty_level": request.difficulty_level,
+            "generated_text_length": len(generated_text),
+            "questions_extracted": len(questions)
+        }
+        logger.info(f"Successfully generated {len(questions)} questions")
+        return QuestionGenerationResponse(
+            questions=questions,
+            statement=request.statement,
+            metadata=metadata
+        )
+    except Exception as e:
+        logger.error(f"Error generating questions: {str(e)}")
+        raise HTTPException(status_code=500, detail=f"Error generating questions: {str(e)}")
+@app.get("/")
+async def root():
+    """Root endpoint with basic info"""
+    return {
+        "message": "Question Generation API",
+        "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
+        "endpoints": {
+            "health": "/health",
+            "generate": "/generate-questions",
+            "docs": "/docs"
+        }
+    }
+if __name__ == "__main__":
+    uvicorn.run(
+        "app:app",
+        host="0.0.0.0",
+        port=7860,
+        reload=False
+    )

requirements.txt ADDED Viewed

	@@ -0,0 +1,13 @@

+fastapi==0.104.1
+uvicorn[standard]==0.24.0
+pydantic==2.5.0
+torch>=2.0.0
+transformers>=4.35.0
+accelerate>=0.24.0
+bitsandbytes>=0.41.0
+llama-cpp-python>=0.2.20
+huggingface-hub>=0.19.0
+python-multipart==0.0.6
+numpy>=1.24.0
+sentencepiece>=0.1.99
+protobuf>=3.20.0

test_api.py ADDED Viewed

	@@ -0,0 +1,215 @@

+#!/usr/bin/env python3
+"""
+Test script for the Question Generation API
+Run this after your Space is deployed to test the API endpoints
+"""
+import requests
+import json
+import time
+# Replace with your actual Space URL
+BASE_URL = "https://your-space-name.hf.space"
+def test_health_endpoint():
+    """Test the health check endpoint"""
+    print("🔍 Testing health endpoint...")
+    try:
+        response = requests.get(f"{BASE_URL}/health", timeout=30)
+        print(f"Status Code: {response.status_code}")
+        if response.status_code == 200:
+            data = response.json()
+            print(f"✅ Health Check Passed")
+            print(f"Model Loaded: {data['model_loaded']}")
+            print(f"Device: {data['device']}")
+            if data.get('memory_usage'):
+                memory = data['memory_usage']
+                print(f"VRAM Usage: {memory.get('allocated_gb', 0):.2f}GB / {memory.get('total_gb', 0):.2f}GB")
+            return True
+        else:
+            print(f"❌ Health Check Failed: {response.text}")
+            return False
+    except requests.exceptions.RequestException as e:
+        print(f"❌ Health Check Error: {e}")
+        return False
+def test_question_generation():
+    """Test the question generation endpoint"""
+    print("\n🤔 Testing question generation...")
+    test_cases = [
+        {
+            "name": "Simple Statement",
+            "data": {
+                "statement": "Artificial intelligence is transforming healthcare by enabling more accurate diagnoses, personalized treatments, and efficient drug discovery processes.",
+                "num_questions": 3,
+                "difficulty_level": "medium"
+            }
+        },
+        {
+            "name": "Complex Statement",
+            "data": {
+                "statement": "Climate change represents one of the most significant challenges of the 21st century, involving complex interactions between atmospheric chemistry, ocean currents, biodiversity loss, and human economic systems. The greenhouse effect, primarily driven by carbon dioxide emissions from fossil fuel combustion, is causing global temperatures to rise at an unprecedented rate.",
+                "num_questions": 5,
+                "difficulty_level": "hard",
+                "temperature": 0.9
+            }
+        },
+        {
+            "name": "Short Statement",
+            "data": {
+                "statement": "Water boils at 100 degrees Celsius at sea level.",
+                "num_questions": 2,
+                "difficulty_level": "easy"
+            }
+        }
+    ]
+    for i, test_case in enumerate(test_cases, 1):
+        print(f"\n📝 Test Case {i}: {test_case['name']}")
+        print(f"Statement: {test_case['data']['statement'][:100]}...")
+        try:
+            response = requests.post(
+                f"{BASE_URL}/generate-questions",
+                json=test_case['data'],
+                timeout=60  # Increased timeout for model inference
+            )
+            print(f"Status Code: {response.status_code}")
+            if response.status_code == 200:
+                data = response.json()
+                questions = data['questions']
+                print(f"✅ Generated {len(questions)} questions:")
+                for j, question in enumerate(questions, 1):
+                    print(f"   {j}. {question}")
+                print(f"Metadata: {data['metadata']}")
+            else:
+                print(f"❌ Generation Failed: {response.text}")
+        except requests.exceptions.RequestException as e:
+            print(f"❌ Request Error: {e}")
+def test_error_handling():
+    """Test error handling"""
+    print("\n🚨 Testing error handling...")
+    # Test invalid parameters
+    invalid_tests = [
+        {
+            "name": "Missing statement",
+            "data": {"num_questions": 3}
+        },
+        {
+            "name": "Invalid num_questions",
+            "data": {
+                "statement": "Test statement",
+                "num_questions": 15  # Too high
+            }
+        },
+        {
+            "name": "Invalid temperature",
+            "data": {
+                "statement": "Test statement",
+                "temperature": 5.0  # Too high
+            }
+        }
+    ]
+    for test in invalid_tests:
+        print(f"\n🔍 Testing: {test['name']}")
+        try:
+            response = requests.post(
+                f"{BASE_URL}/generate-questions",
+                json=test['data'],
+                timeout=30
+            )
+            if response.status_code == 422:
+                print("✅ Correctly rejected invalid input")
+            else:
+                print(f"⚠️ Unexpected status code: {response.status_code}")
+        except requests.exceptions.RequestException as e:
+            print(f"❌ Request Error: {e}")
+def benchmark_performance():
+    """Simple performance benchmark"""
+    print("\n⚡ Performance Benchmark...")
+    statement = "Machine learning algorithms are becoming increasingly sophisticated, enabling computers to learn patterns from data without being explicitly programmed for every scenario."
+    times = []
+    for i in range(3):
+        print(f"Run {i+1}/3...", end=" ")
+        start_time = time.time()
+        try:
+            response = requests.post(
+                f"{BASE_URL}/generate-questions",
+                json={
+                    "statement": statement,
+                    "num_questions": 3,
+                    "difficulty_level": "medium"
+                },
+                timeout=60
+            )
+            end_time = time.time()
+            duration = end_time - start_time
+            times.append(duration)
+            if response.status_code == 200:
+                print(f"✅ {duration:.2f}s")
+            else:
+                print(f"❌ Failed ({response.status_code})")
+        except requests.exceptions.RequestException as e:
+            print(f"❌ Error: {e}")
+    if times:
+        avg_time = sum(times) / len(times)
+        print(f"\n📊 Average Response Time: {avg_time:.2f}s")
+        print(f"📊 Min: {min(times):.2f}s, Max: {max(times):.2f}s")
+def main():
+    """Run all tests"""
+    print("🚀 Starting API Tests")
+    print(f"Base URL: {BASE_URL}")
+    print("=" * 50)
+    # Test health first
+    if not test_health_endpoint():
+        print("\n❌ Health check failed. Make sure your Space is running and accessible.")
+        return
+    # Wait a moment for model to be ready
+    print("\n⏳ Waiting for model to be ready...")
+    time.sleep(5)
+    # Run tests
+    test_question_generation()
+    test_error_handling()
+    benchmark_performance()
+    print("\n" + "=" * 50)
+    print("✅ All tests completed!")
+    print("\n💡 Usage Examples:")
+    print(f"curl -X POST '{BASE_URL}/generate-questions' \\")
+    print("     -H 'Content-Type: application/json' \\")
+    print("     -d '{\"statement\": \"Your statement here\", \"num_questions\": 3}'")
+if __name__ == "__main__":
+    # Update this with your actual Space URL before running
+    if "your-space-name" in BASE_URL:
+        print("⚠️  Please update BASE_URL with your actual Space URL before running tests!")
+        print("Example: BASE_URL = 'https://username-question-generation-api.hf.space'")
+    else:
+        main()