david167 commited on
Commit
0bf99b7
Β·
1 Parent(s): 50ec035

Initial setup: Question Generation API with DeepHermes reasoning model

Browse files
Files changed (6) hide show
  1. .gitignore +60 -0
  2. Dockerfile +61 -0
  3. README.md +180 -5
  4. app.py +310 -0
  5. requirements.txt +13 -0
  6. test_api.py +215 -0
.gitignore ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ *$py.class
5
+ *.so
6
+ .Python
7
+ build/
8
+ develop-eggs/
9
+ dist/
10
+ downloads/
11
+ eggs/
12
+ .eggs/
13
+ lib/
14
+ lib64/
15
+ parts/
16
+ sdist/
17
+ var/
18
+ wheels/
19
+ *.egg-info/
20
+ .installed.cfg
21
+ *.egg
22
+ MANIFEST
23
+
24
+ # Virtual environments
25
+ .env
26
+ .venv
27
+ env/
28
+ venv/
29
+ ENV/
30
+ env.bak/
31
+ venv.bak/
32
+
33
+ # IDE
34
+ .vscode/
35
+ .idea/
36
+ *.swp
37
+ *.swo
38
+
39
+ # OS
40
+ .DS_Store
41
+ .DS_Store?
42
+ ._*
43
+ .Spotlight-V100
44
+ .Trashes
45
+ ehthumbs.db
46
+ Thumbs.db
47
+
48
+ # Model cache
49
+ .cache/
50
+ *.bin
51
+ *.safetensors
52
+ *.gguf
53
+
54
+ # Logs
55
+ *.log
56
+ logs/
57
+
58
+ # Temporary files
59
+ *.tmp
60
+ *.temp
Dockerfile ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Use NVIDIA CUDA base image optimized for A10G
2
+ FROM nvidia/cuda:11.8-devel-ubuntu20.04
3
+
4
+ # Set environment variables
5
+ ENV DEBIAN_FRONTEND=noninteractive
6
+ ENV PYTHONUNBUFFERED=1
7
+ ENV CUDA_VISIBLE_DEVICES=0
8
+
9
+ # Install system dependencies
10
+ RUN apt-get update && apt-get install -y \
11
+ python3.9 \
12
+ python3.9-dev \
13
+ python3-pip \
14
+ git \
15
+ wget \
16
+ curl \
17
+ build-essential \
18
+ cmake \
19
+ && rm -rf /var/lib/apt/lists/*
20
+
21
+ # Set Python 3.9 as default
22
+ RUN update-alternatives --install /usr/bin/python python /usr/bin/python3.9 1
23
+ RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
24
+
25
+ # Upgrade pip
26
+ RUN python -m pip install --upgrade pip
27
+
28
+ # Set working directory
29
+ WORKDIR /app
30
+
31
+ # Copy requirements first for better caching
32
+ COPY requirements.txt .
33
+
34
+ # Install Python dependencies with CUDA support
35
+ RUN pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
36
+
37
+ # Install llama-cpp-python with CUDA support
38
+ ENV CMAKE_ARGS="-DLLAMA_CUBLAS=on"
39
+ ENV FORCE_CMAKE=1
40
+ RUN pip install llama-cpp-python --force-reinstall --no-cache-dir
41
+
42
+ # Install other requirements
43
+ RUN pip install -r requirements.txt
44
+
45
+ # Copy application code
46
+ COPY app.py .
47
+ COPY README.md .
48
+
49
+ # Create cache directory for Hugging Face
50
+ RUN mkdir -p /app/.cache
51
+ ENV HF_HOME=/app/.cache
52
+
53
+ # Expose port
54
+ EXPOSE 7860
55
+
56
+ # Health check
57
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
58
+ CMD curl -f http://localhost:7860/health || exit 1
59
+
60
+ # Run the application
61
+ CMD ["python", "app.py"]
README.md CHANGED
@@ -1,10 +1,185 @@
1
  ---
2
- title: Question Generation Api
3
- emoji: ⚑
4
- colorFrom: yellow
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Question Generation API
3
+ emoji: πŸ€”
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: docker
7
  pinned: false
8
+ license: apache-2.0
9
+ app_port: 7860
10
  ---
11
 
12
+ # Question Generation API
13
+
14
+ This Hugging Face Space provides an API for generating thoughtful questions from input statements using the **DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF** model.
15
+
16
+ ## Features
17
+
18
+ - 🧠 **Deep Reasoning**: Uses enhanced reasoning capabilities for better question quality
19
+ - πŸ“š **1M Context**: Supports very long input statements (up to 1 million tokens)
20
+ - 🎯 **Customizable**: Adjust number of questions, difficulty level, and generation parameters
21
+ - πŸš€ **Fast API**: RESTful API with automatic documentation
22
+ - πŸ”§ **GPU Optimized**: Optimized for NVIDIA A10G hardware
23
+
24
+ ## API Endpoints
25
+
26
+ ### Generate Questions
27
+ **POST** `/generate-questions`
28
+
29
+ Generate questions from a given statement.
30
+
31
+ **Request Body:**
32
+ ```json
33
+ {
34
+ "statement": "Your input statement here",
35
+ "num_questions": 5,
36
+ "temperature": 0.8,
37
+ "max_length": 2048,
38
+ "difficulty_level": "mixed"
39
+ }
40
+ ```
41
+
42
+ **Parameters:**
43
+ - `statement` (required): The input text to generate questions from
44
+ - `num_questions` (1-10): Number of questions to generate (default: 5)
45
+ - `temperature` (0.1-2.0): Generation creativity (default: 0.8)
46
+ - `max_length` (100-4096): Maximum response length (default: 2048)
47
+ - `difficulty_level`: "easy", "medium", "hard", or "mixed" (default: "mixed")
48
+
49
+ **Response:**
50
+ ```json
51
+ {
52
+ "questions": [
53
+ "What is the main concept discussed?",
54
+ "How does this relate to...?",
55
+ "Why is this important?"
56
+ ],
57
+ "statement": "Your original statement",
58
+ "metadata": {
59
+ "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
60
+ "temperature": 0.8,
61
+ "difficulty_level": "mixed"
62
+ }
63
+ }
64
+ ```
65
+
66
+ ### Health Check
67
+ **GET** `/health`
68
+
69
+ Check the API and model status.
70
+
71
+ **Response:**
72
+ ```json
73
+ {
74
+ "status": "healthy",
75
+ "model_loaded": true,
76
+ "device": "cuda",
77
+ "memory_usage": {
78
+ "allocated_gb": 12.5,
79
+ "reserved_gb": 14.2,
80
+ "total_gb": 24.0
81
+ }
82
+ }
83
+ ```
84
+
85
+ ## Usage Examples
86
+
87
+ ### Python
88
+ ```python
89
+ import requests
90
+
91
+ # API endpoint
92
+ url = "https://your-space-name.hf.space/generate-questions"
93
+
94
+ # Request payload
95
+ data = {
96
+ "statement": "Artificial intelligence is transforming healthcare by enabling more accurate diagnoses, personalized treatments, and efficient drug discovery processes.",
97
+ "num_questions": 3,
98
+ "difficulty_level": "medium"
99
+ }
100
+
101
+ # Make request
102
+ response = requests.post(url, json=data)
103
+ questions = response.json()["questions"]
104
+
105
+ for i, question in enumerate(questions, 1):
106
+ print(f"{i}. {question}")
107
+ ```
108
+
109
+ ### JavaScript
110
+ ```javascript
111
+ const generateQuestions = async (statement) => {
112
+ const response = await fetch('https://your-space-name.hf.space/generate-questions', {
113
+ method: 'POST',
114
+ headers: {
115
+ 'Content-Type': 'application/json',
116
+ },
117
+ body: JSON.stringify({
118
+ statement: statement,
119
+ num_questions: 5,
120
+ difficulty_level: 'mixed'
121
+ })
122
+ });
123
+
124
+ const data = await response.json();
125
+ return data.questions;
126
+ };
127
+ ```
128
+
129
+ ### cURL
130
+ ```bash
131
+ curl -X POST "https://your-space-name.hf.space/generate-questions" \
132
+ -H "Content-Type: application/json" \
133
+ -d '{
134
+ "statement": "Climate change is one of the most pressing challenges of our time.",
135
+ "num_questions": 4,
136
+ "difficulty_level": "hard"
137
+ }'
138
+ ```
139
+
140
+ ## Model Information
141
+
142
+ This API uses the **DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF** model, which features:
143
+
144
+ - **Enhanced Reasoning**: Built on DeepHermes reasoning capabilities
145
+ - **Large Context**: Supports up to 1 million tokens context length
146
+ - **Optimized Format**: GGUF quantization for efficient inference
147
+ - **Thinking Process**: Uses `<think>` tags for internal reasoning
148
+
149
+ ## Hardware Requirements
150
+
151
+ - **GPU**: NVIDIA A10G (24GB VRAM)
152
+ - **Memory**: ~14-16GB VRAM usage
153
+ - **Context**: Up to 32K tokens (adjustable based on available memory)
154
+
155
+ ## API Documentation
156
+
157
+ Visit `/docs` for interactive API documentation with Swagger UI.
158
+
159
+ ## Error Handling
160
+
161
+ The API returns appropriate HTTP status codes:
162
+ - `200`: Success
163
+ - `400`: Bad Request (invalid parameters)
164
+ - `503`: Service Unavailable (model not loaded)
165
+ - `500`: Internal Server Error
166
+
167
+ ## Rate Limits
168
+
169
+ This is a demo space. For production use, consider:
170
+ - Implementing rate limiting
171
+ - Adding authentication
172
+ - Scaling to multiple instances
173
+ - Using dedicated inference endpoints
174
+
175
+ ## Support
176
+
177
+ For issues or questions:
178
+ 1. Check the `/health` endpoint
179
+ 2. Review the error messages
180
+ 3. Ensure your requests match the API schema
181
+ 4. Consider adjusting parameters for your hardware
182
+
183
+ ---
184
+
185
+ **Note**: This Space requires a GPU runtime to function properly. Make sure your Space is configured with GPU support.
app.py ADDED
@@ -0,0 +1,310 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import logging
3
+ from typing import List, Optional, Dict, Any
4
+ from contextlib import asynccontextmanager
5
+
6
+ import torch
7
+ from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
8
+ import uvicorn
9
+ from fastapi import FastAPI, HTTPException, BackgroundTasks
10
+ from fastapi.middleware.cors import CORSMiddleware
11
+ from pydantic import BaseModel, Field
12
+ import gc
13
+
14
+ # Configure logging
15
+ logging.basicConfig(level=logging.INFO)
16
+ logger = logging.getLogger(__name__)
17
+
18
+ # Global variables for model and tokenizer
19
+ model = None
20
+ tokenizer = None
21
+ device = None
22
+
23
+ class QuestionGenerationRequest(BaseModel):
24
+ statement: str = Field(..., description="The input statement to generate questions from")
25
+ num_questions: int = Field(default=5, ge=1, le=10, description="Number of questions to generate (1-10)")
26
+ temperature: float = Field(default=0.8, ge=0.1, le=2.0, description="Temperature for generation (0.1-2.0)")
27
+ max_length: int = Field(default=2048, ge=100, le=4096, description="Maximum length of generated text")
28
+ difficulty_level: str = Field(default="mixed", description="Difficulty level: easy, medium, hard, or mixed")
29
+
30
+ class QuestionGenerationResponse(BaseModel):
31
+ questions: List[str]
32
+ statement: str
33
+ metadata: Dict[str, Any]
34
+
35
+ class HealthResponse(BaseModel):
36
+ status: str
37
+ model_loaded: bool
38
+ device: str
39
+ memory_usage: Dict[str, float]
40
+
41
+ async def load_model():
42
+ """Load the model and tokenizer"""
43
+ global model, tokenizer, device
44
+
45
+ try:
46
+ logger.info("Starting model loading...")
47
+
48
+ # Check if CUDA is available
49
+ device = "cuda" if torch.cuda.is_available() else "cpu"
50
+ logger.info(f"Using device: {device}")
51
+
52
+ if device == "cuda":
53
+ logger.info(f"GPU: {torch.cuda.get_device_name()}")
54
+ logger.info(f"VRAM Available: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
55
+
56
+ model_name = "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF"
57
+ model_file = "Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-Q4_K_M.gguf"
58
+
59
+ # Use llama-cpp-python for GGUF files
60
+ try:
61
+ from llama_cpp import Llama
62
+
63
+ logger.info("Loading model with llama-cpp-python...")
64
+ model = Llama(
65
+ model_path=f"hf://{model_name}/{model_file}",
66
+ n_ctx=32768, # Context length - adjust based on your needs vs VRAM
67
+ n_gpu_layers=-1 if device == "cuda" else 0, # Use all GPU layers if CUDA available
68
+ verbose=False,
69
+ n_threads=4,
70
+ n_batch=512,
71
+ use_mlock=True,
72
+ use_mmap=True,
73
+ )
74
+
75
+ # For llama-cpp-python, we don't need a separate tokenizer
76
+ tokenizer = None
77
+ logger.info("Model loaded successfully with llama-cpp-python!")
78
+
79
+ except ImportError:
80
+ logger.error("llama-cpp-python not installed. Please install it for GGUF support.")
81
+ raise
82
+
83
+ except Exception as e:
84
+ logger.error(f"Error loading model: {str(e)}")
85
+ raise
86
+
87
+ async def unload_model():
88
+ """Clean up model from memory"""
89
+ global model, tokenizer
90
+
91
+ try:
92
+ if model is not None:
93
+ del model
94
+ if tokenizer is not None:
95
+ del tokenizer
96
+
97
+ # Clear CUDA cache if available
98
+ if torch.cuda.is_available():
99
+ torch.cuda.empty_cache()
100
+
101
+ # Force garbage collection
102
+ gc.collect()
103
+
104
+ logger.info("Model unloaded successfully")
105
+
106
+ except Exception as e:
107
+ logger.error(f"Error unloading model: {str(e)}")
108
+
109
+ @asynccontextmanager
110
+ async def lifespan(app: FastAPI):
111
+ """Manage application lifespan"""
112
+ # Startup
113
+ logger.info("Starting up...")
114
+ await load_model()
115
+ yield
116
+ # Shutdown
117
+ logger.info("Shutting down...")
118
+ await unload_model()
119
+
120
+ # Create FastAPI app
121
+ app = FastAPI(
122
+ title="Question Generation API",
123
+ description="API for generating questions from statements using DeepHermes reasoning model",
124
+ version="1.0.0",
125
+ lifespan=lifespan
126
+ )
127
+
128
+ # Add CORS middleware
129
+ app.add_middleware(
130
+ CORSMiddleware,
131
+ allow_origins=["*"],
132
+ allow_credentials=True,
133
+ allow_methods=["*"],
134
+ allow_headers=["*"],
135
+ )
136
+
137
+ def create_question_prompt(statement: str, num_questions: int, difficulty_level: str) -> str:
138
+ """Create a prompt for question generation with reasoning"""
139
+
140
+ difficulty_instruction = {
141
+ "easy": "Generate simple, straightforward questions that test basic understanding.",
142
+ "medium": "Generate questions that require some analysis and comprehension.",
143
+ "hard": "Generate complex questions that require deep thinking and reasoning.",
144
+ "mixed": "Generate a mix of easy, medium, and hard questions."
145
+ }
146
+
147
+ system_prompt = """You are a deep thinking AI, you may use extremely long chains of thought to deeply consider the problem and deliberate with yourself via systematic reasoning processes to help come to a correct solution prior to answering. You should enclose your thoughts and internal monologue inside <think> </think> tags, and then provide your solution or response to the problem.
148
+
149
+ You are an expert educator and question generator. Your task is to create thoughtful, well-crafted questions from given statements."""
150
+
151
+ user_prompt = f"""<think>
152
+ I need to analyze this statement and generate {num_questions} high-quality questions. Let me think about:
153
+ 1. The key concepts and information in the statement
154
+ 2. Different types of questions I can ask (factual, analytical, inferential, evaluative)
155
+ 3. The difficulty level requested: {difficulty_level}
156
+ 4. How to make questions that promote understanding and critical thinking
157
+ </think>
158
+
159
+ Based on the following statement, generate exactly {num_questions} questions.
160
+
161
+ Statement: "{statement}"
162
+
163
+ Requirements:
164
+ - {difficulty_instruction[difficulty_level]}
165
+ - Questions should be clear, well-formed, and grammatically correct
166
+ - Vary the question types (what, how, why, when, where, etc.)
167
+ - Each question should test different aspects of the statement
168
+ - Make questions engaging and thought-provoking
169
+ - Number each question (1., 2., 3., etc.)
170
+
171
+ Generate the questions now:"""
172
+
173
+ return f"{system_prompt}\n\n{user_prompt}"
174
+
175
+ def extract_questions(generated_text: str) -> List[str]:
176
+ """Extract questions from the generated text"""
177
+ questions = []
178
+ lines = generated_text.split('\n')
179
+
180
+ for line in lines:
181
+ line = line.strip()
182
+ # Look for numbered questions
183
+ if line and (line[0].isdigit() or line.startswith('Q')):
184
+ # Remove numbering and clean up
185
+ question = line
186
+ # Remove common prefixes
187
+ for prefix in ['1.', '2.', '3.', '4.', '5.', '6.', '7.', '8.', '9.', '10.', 'Q1:', 'Q2:', 'Q3:', 'Q4:', 'Q5:', 'Question 1:', 'Question 2:', 'Question 3:', 'Question 4:', 'Question 5:']:
188
+ if question.startswith(prefix):
189
+ question = question[len(prefix):].strip()
190
+ break
191
+
192
+ if question and question.endswith('?'):
193
+ questions.append(question)
194
+
195
+ # If no numbered questions found, try to extract any questions
196
+ if not questions:
197
+ for line in lines:
198
+ line = line.strip()
199
+ if line.endswith('?') and len(line) > 10:
200
+ questions.append(line)
201
+
202
+ return questions
203
+
204
+ @app.get("/health", response_model=HealthResponse)
205
+ async def health_check():
206
+ """Health check endpoint"""
207
+ global model
208
+
209
+ memory_usage = {}
210
+ if torch.cuda.is_available():
211
+ memory_usage = {
212
+ "allocated_gb": torch.cuda.memory_allocated() / 1024**3,
213
+ "reserved_gb": torch.cuda.memory_reserved() / 1024**3,
214
+ "total_gb": torch.cuda.get_device_properties(0).total_memory / 1024**3
215
+ }
216
+
217
+ return HealthResponse(
218
+ status="healthy" if model is not None else "unhealthy",
219
+ model_loaded=model is not None,
220
+ device=device if device else "unknown",
221
+ memory_usage=memory_usage
222
+ )
223
+
224
+ @app.post("/generate-questions", response_model=QuestionGenerationResponse)
225
+ async def generate_questions(request: QuestionGenerationRequest):
226
+ """Generate questions from a statement"""
227
+ global model
228
+
229
+ if model is None:
230
+ raise HTTPException(status_code=503, detail="Model not loaded")
231
+
232
+ try:
233
+ logger.info(f"Generating {request.num_questions} questions for statement: {request.statement[:100]}...")
234
+
235
+ # Create prompt
236
+ prompt = create_question_prompt(
237
+ request.statement,
238
+ request.num_questions,
239
+ request.difficulty_level
240
+ )
241
+
242
+ # Generate response using llama-cpp-python
243
+ response = model(
244
+ prompt,
245
+ max_tokens=request.max_length,
246
+ temperature=request.temperature,
247
+ top_p=0.95,
248
+ top_k=40,
249
+ repeat_penalty=1.1,
250
+ stop=["<|im_end|>", "</think>"],
251
+ echo=False
252
+ )
253
+
254
+ generated_text = response['choices'][0]['text']
255
+ logger.info(f"Generated text length: {len(generated_text)}")
256
+
257
+ # Extract questions from the generated text
258
+ questions = extract_questions(generated_text)
259
+
260
+ # Ensure we have the requested number of questions
261
+ if len(questions) < request.num_questions:
262
+ logger.warning(f"Only extracted {len(questions)} questions, requested {request.num_questions}")
263
+
264
+ # Limit to requested number
265
+ questions = questions[:request.num_questions]
266
+
267
+ # If we still don't have enough questions, add a fallback
268
+ while len(questions) < request.num_questions:
269
+ questions.append(f"What is the main point of this statement: '{request.statement[:100]}...'?")
270
+
271
+ metadata = {
272
+ "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
273
+ "temperature": request.temperature,
274
+ "difficulty_level": request.difficulty_level,
275
+ "generated_text_length": len(generated_text),
276
+ "questions_extracted": len(questions)
277
+ }
278
+
279
+ logger.info(f"Successfully generated {len(questions)} questions")
280
+
281
+ return QuestionGenerationResponse(
282
+ questions=questions,
283
+ statement=request.statement,
284
+ metadata=metadata
285
+ )
286
+
287
+ except Exception as e:
288
+ logger.error(f"Error generating questions: {str(e)}")
289
+ raise HTTPException(status_code=500, detail=f"Error generating questions: {str(e)}")
290
+
291
+ @app.get("/")
292
+ async def root():
293
+ """Root endpoint with basic info"""
294
+ return {
295
+ "message": "Question Generation API",
296
+ "model": "DavidAU/Llama-3.1-1-million-ctx-DeepHermes-Deep-Reasoning-8B-GGUF",
297
+ "endpoints": {
298
+ "health": "/health",
299
+ "generate": "/generate-questions",
300
+ "docs": "/docs"
301
+ }
302
+ }
303
+
304
+ if __name__ == "__main__":
305
+ uvicorn.run(
306
+ "app:app",
307
+ host="0.0.0.0",
308
+ port=7860,
309
+ reload=False
310
+ )
requirements.txt ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ fastapi==0.104.1
2
+ uvicorn[standard]==0.24.0
3
+ pydantic==2.5.0
4
+ torch>=2.0.0
5
+ transformers>=4.35.0
6
+ accelerate>=0.24.0
7
+ bitsandbytes>=0.41.0
8
+ llama-cpp-python>=0.2.20
9
+ huggingface-hub>=0.19.0
10
+ python-multipart==0.0.6
11
+ numpy>=1.24.0
12
+ sentencepiece>=0.1.99
13
+ protobuf>=3.20.0
test_api.py ADDED
@@ -0,0 +1,215 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for the Question Generation API
4
+ Run this after your Space is deployed to test the API endpoints
5
+ """
6
+
7
+ import requests
8
+ import json
9
+ import time
10
+
11
+ # Replace with your actual Space URL
12
+ BASE_URL = "https://your-space-name.hf.space"
13
+
14
+ def test_health_endpoint():
15
+ """Test the health check endpoint"""
16
+ print("πŸ” Testing health endpoint...")
17
+
18
+ try:
19
+ response = requests.get(f"{BASE_URL}/health", timeout=30)
20
+ print(f"Status Code: {response.status_code}")
21
+
22
+ if response.status_code == 200:
23
+ data = response.json()
24
+ print(f"βœ… Health Check Passed")
25
+ print(f"Model Loaded: {data['model_loaded']}")
26
+ print(f"Device: {data['device']}")
27
+ if data.get('memory_usage'):
28
+ memory = data['memory_usage']
29
+ print(f"VRAM Usage: {memory.get('allocated_gb', 0):.2f}GB / {memory.get('total_gb', 0):.2f}GB")
30
+ return True
31
+ else:
32
+ print(f"❌ Health Check Failed: {response.text}")
33
+ return False
34
+
35
+ except requests.exceptions.RequestException as e:
36
+ print(f"❌ Health Check Error: {e}")
37
+ return False
38
+
39
+ def test_question_generation():
40
+ """Test the question generation endpoint"""
41
+ print("\nπŸ€” Testing question generation...")
42
+
43
+ test_cases = [
44
+ {
45
+ "name": "Simple Statement",
46
+ "data": {
47
+ "statement": "Artificial intelligence is transforming healthcare by enabling more accurate diagnoses, personalized treatments, and efficient drug discovery processes.",
48
+ "num_questions": 3,
49
+ "difficulty_level": "medium"
50
+ }
51
+ },
52
+ {
53
+ "name": "Complex Statement",
54
+ "data": {
55
+ "statement": "Climate change represents one of the most significant challenges of the 21st century, involving complex interactions between atmospheric chemistry, ocean currents, biodiversity loss, and human economic systems. The greenhouse effect, primarily driven by carbon dioxide emissions from fossil fuel combustion, is causing global temperatures to rise at an unprecedented rate.",
56
+ "num_questions": 5,
57
+ "difficulty_level": "hard",
58
+ "temperature": 0.9
59
+ }
60
+ },
61
+ {
62
+ "name": "Short Statement",
63
+ "data": {
64
+ "statement": "Water boils at 100 degrees Celsius at sea level.",
65
+ "num_questions": 2,
66
+ "difficulty_level": "easy"
67
+ }
68
+ }
69
+ ]
70
+
71
+ for i, test_case in enumerate(test_cases, 1):
72
+ print(f"\nπŸ“ Test Case {i}: {test_case['name']}")
73
+ print(f"Statement: {test_case['data']['statement'][:100]}...")
74
+
75
+ try:
76
+ response = requests.post(
77
+ f"{BASE_URL}/generate-questions",
78
+ json=test_case['data'],
79
+ timeout=60 # Increased timeout for model inference
80
+ )
81
+
82
+ print(f"Status Code: {response.status_code}")
83
+
84
+ if response.status_code == 200:
85
+ data = response.json()
86
+ questions = data['questions']
87
+
88
+ print(f"βœ… Generated {len(questions)} questions:")
89
+ for j, question in enumerate(questions, 1):
90
+ print(f" {j}. {question}")
91
+
92
+ print(f"Metadata: {data['metadata']}")
93
+
94
+ else:
95
+ print(f"❌ Generation Failed: {response.text}")
96
+
97
+ except requests.exceptions.RequestException as e:
98
+ print(f"❌ Request Error: {e}")
99
+
100
+ def test_error_handling():
101
+ """Test error handling"""
102
+ print("\n🚨 Testing error handling...")
103
+
104
+ # Test invalid parameters
105
+ invalid_tests = [
106
+ {
107
+ "name": "Missing statement",
108
+ "data": {"num_questions": 3}
109
+ },
110
+ {
111
+ "name": "Invalid num_questions",
112
+ "data": {
113
+ "statement": "Test statement",
114
+ "num_questions": 15 # Too high
115
+ }
116
+ },
117
+ {
118
+ "name": "Invalid temperature",
119
+ "data": {
120
+ "statement": "Test statement",
121
+ "temperature": 5.0 # Too high
122
+ }
123
+ }
124
+ ]
125
+
126
+ for test in invalid_tests:
127
+ print(f"\nπŸ” Testing: {test['name']}")
128
+ try:
129
+ response = requests.post(
130
+ f"{BASE_URL}/generate-questions",
131
+ json=test['data'],
132
+ timeout=30
133
+ )
134
+
135
+ if response.status_code == 422:
136
+ print("βœ… Correctly rejected invalid input")
137
+ else:
138
+ print(f"⚠️ Unexpected status code: {response.status_code}")
139
+
140
+ except requests.exceptions.RequestException as e:
141
+ print(f"❌ Request Error: {e}")
142
+
143
+ def benchmark_performance():
144
+ """Simple performance benchmark"""
145
+ print("\n⚑ Performance Benchmark...")
146
+
147
+ statement = "Machine learning algorithms are becoming increasingly sophisticated, enabling computers to learn patterns from data without being explicitly programmed for every scenario."
148
+
149
+ times = []
150
+ for i in range(3):
151
+ print(f"Run {i+1}/3...", end=" ")
152
+
153
+ start_time = time.time()
154
+ try:
155
+ response = requests.post(
156
+ f"{BASE_URL}/generate-questions",
157
+ json={
158
+ "statement": statement,
159
+ "num_questions": 3,
160
+ "difficulty_level": "medium"
161
+ },
162
+ timeout=60
163
+ )
164
+
165
+ end_time = time.time()
166
+ duration = end_time - start_time
167
+ times.append(duration)
168
+
169
+ if response.status_code == 200:
170
+ print(f"βœ… {duration:.2f}s")
171
+ else:
172
+ print(f"❌ Failed ({response.status_code})")
173
+
174
+ except requests.exceptions.RequestException as e:
175
+ print(f"❌ Error: {e}")
176
+
177
+ if times:
178
+ avg_time = sum(times) / len(times)
179
+ print(f"\nπŸ“Š Average Response Time: {avg_time:.2f}s")
180
+ print(f"πŸ“Š Min: {min(times):.2f}s, Max: {max(times):.2f}s")
181
+
182
+ def main():
183
+ """Run all tests"""
184
+ print("πŸš€ Starting API Tests")
185
+ print(f"Base URL: {BASE_URL}")
186
+ print("=" * 50)
187
+
188
+ # Test health first
189
+ if not test_health_endpoint():
190
+ print("\n❌ Health check failed. Make sure your Space is running and accessible.")
191
+ return
192
+
193
+ # Wait a moment for model to be ready
194
+ print("\n⏳ Waiting for model to be ready...")
195
+ time.sleep(5)
196
+
197
+ # Run tests
198
+ test_question_generation()
199
+ test_error_handling()
200
+ benchmark_performance()
201
+
202
+ print("\n" + "=" * 50)
203
+ print("βœ… All tests completed!")
204
+ print("\nπŸ’‘ Usage Examples:")
205
+ print(f"curl -X POST '{BASE_URL}/generate-questions' \\")
206
+ print(" -H 'Content-Type: application/json' \\")
207
+ print(" -d '{\"statement\": \"Your statement here\", \"num_questions\": 3}'")
208
+
209
+ if __name__ == "__main__":
210
+ # Update this with your actual Space URL before running
211
+ if "your-space-name" in BASE_URL:
212
+ print("⚠️ Please update BASE_URL with your actual Space URL before running tests!")
213
+ print("Example: BASE_URL = 'https://username-question-generation-api.hf.space'")
214
+ else:
215
+ main()