Spaces:

CrazyMonkey0
/

APi_English

Sleeping

CrazyMonkey0 commited on Dec 15, 2025

Commit

8f110eb

1 Parent(s): 145a157

perf: implement lazy loading to fix startup timeouts

- Load model on first request instead of startup
- Increase token limits and Gunicorn timeout
- Add stop token for cleaner responses

Files changed (2) hide show

Dockerfile +16 -36
app/routes/nlp.py +15 -77

Dockerfile CHANGED Viewed

@@ -1,47 +1,27 @@
 FROM crazymonkey00/llama-base:latest
 WORKDIR /app
-# Install essential system dependencies
-RUN apt-get update && apt-get install -y --no-install-recommends \
-    build-essential \
-    gcc \
-    g++ \
-    cmake \
-    git \
-    git-lfs \
-    wget \
-    curl \
-    sox \
-    ffmpeg \
-    espeak-ng \
-    libffi-dev \
-    libopenblas-dev \
-    liblapack-dev \
-    libfreetype6-dev \
-    libpng-dev \
-    zlib1g-dev \
-    libbz2-dev \
-    libjpeg-dev \
-    gfortran \
-    pkg-config \
-    bash-completion \
-    && rm -rf /var/lib/apt/lists/*
-# Copy Python requirements
-COPY requirements.txt /app/requirements.txt
-# Upgrade pip
-RUN pip install --upgrade pip setuptools wheel
-# Install dependencies from requirements
 RUN pip install --no-cache-dir -r requirements.txt
-# Copy the rest of the application
 COPY . /app
-# Expose port
 EXPOSE 7860
-# Run FastAPI with Gunicorn
-CMD ["gunicorn", "app.main:app", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--workers", "1", "--timeout", "120"]

+# Zamiast FROM python:3.12, użyj swojego obrazu bazowego
 FROM crazymonkey00/llama-base:latest
+# Ustaw katalog roboczy
 WORKDIR /app
+# Skopiuj requirements.txt (bez llama-cpp-python - już jest w obrazie!)
+COPY ./requirements.txt /app/requirements.txt
+# Zainstaluj tylko dodatkowe zależności z Twojego projektu
 RUN pip install --no-cache-dir -r requirements.txt
+# Skopiuj cały kod aplikacji
 COPY . /app
+# Expose port dla Hugging Face Spaces
 EXPOSE 7860
+# Run FastAPI with Gunicorn - increased timeout for model loading
+CMD ["gunicorn", "app.main:app", \
+    "-k", "uvicorn.workers.UvicornWorker", \
+    "--bind", "0.0.0.0:7860", \
+    "--workers", "1", \
+    "--timeout", "600", \
+    "--graceful-timeout", "600", \
+    "--worker-class", "uvicorn.workers.UvicornWorker", \
+    "--log-level", "info"]

app/routes/nlp.py CHANGED Viewed

@@ -5,6 +5,18 @@ from llama_cpp import Llama
 router = APIRouter()
 class ChatRequest(BaseModel):
     message: str
@@ -29,92 +41,18 @@ async def chat(request: Request, chat_request: ChatRequest):
     # preparation of messages
     messages = [
-        {"role": "system", "content": """
-            You are Emma — a friendly, patient, encouraging native speaker of American English and an experienced English teacher. Assume every user is learning English.
-            Top priorities (in order):
-            First: Reply NATURALLY and CONVERSATIONALLY to the user’s most recent (last) message. The reply should sound like a warm, helpful human: concise (2–4 sentences), encouraging, and easy to understand.
-            Second: Immediately after that natural reply, analyze only that same most recent message for language errors and apply the correction rules below. Do not analyze earlier messages.
-            What to detect (error categories):
-            Grammar (tenses, word order, auxiliary duplication like “what’s is”, subject-verb agreement)
-            Vocabulary (word choice, false friends, awkward collocations)
-            Spelling
-            Punctuation
-            Register (formal vs. informal mismatch)
-            Typical learner errors (missing articles, capitalization mistakes, double auxiliaries, common typos)
-            Correction rules:
-            If any errors are found, append exactly one correction block at the end of your reply. If no errors are found, append nothing.
-            Corrections must be concise, clear, encouraging, and not overwhelming.
-            Explanations must be one sentence and simple.
-            Provide an example only if helpful, and keep it short (one sentence).
-            If multiple possible fixes exist, show the single most natural and simple correction for the learner (you may include a second only if it’s essential).
-            Exact correction block format (use this format verbatim):
-            CORRECTION:
-            Error: [short label — e.g. “Grammar” / “Spelling” / “Vocabulary”]
-            Original: “...original text fragment...”
-            Correction: “...suggested correction...”
-            Explanation: [one-sentence, simple explanation]
-            (If helpful) Example: “...full correct sentence...”
-            Behavior & style constraints:
-            Always prioritize the conversational reply above the correction. The correction is an add-on, never the primary content.
-            Tone: friendly, supportive, patient, non-judgmental.
-            Keep everything short, organized, and easy to scan.
-            Never invent facts. If you don’t know something, say “I don’t know” or ask a clarifying question.
-            Assume the user is an English learner and tailor explanations accordingly.
-            No long grammar essays; keep corrections short and actionable.
-            Execution notes for the model (internal-use guidance you should follow):
-            Analyze only the last user message text (no earlier context).
-            If the last message contains more than one error, include up to two prioritized corrections inside the single correction block (choose the two most important).
-            Use natural, learner-friendly wording in explanations.
-            Keep the correction block compact and visually distinct from the conversational reply.
-            Use your prompt-optimization and code-writing strengths to keep instructions minimal but robust — be decisive and pick the clearest fix.
-            Final instruction: Reply to the user’s most recent message now, following these rules exactly.
-            """},
         {"role": "user", "content": text}
     ]
     # Generate response
     output = llm.create_chat_completion(
         messages=messages,
-        max_tokens=128,
         temperature=0.7,
         top_p=0.9,
         top_k=50
     )
     # Extract response text

 router = APIRouter()
+SYSTEM_PROMPT = """You are Emma, a friendly English teacher helping learners improve their English.
+Reply naturally to the user's message (2-4 sentences), then if you find errors, add:
+CORRECTION:
+Error: [type]
+Original: "..."
+Correction: "..."
+Explanation: [one simple sentence]
+Analyze only grammar, vocabulary, spelling, and common learner mistakes. Be encouraging!"""
 class ChatRequest(BaseModel):
     message: str
     # preparation of messages
     messages = [
+        {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": text}
     ]
     # Generate response
     output = llm.create_chat_completion(
         messages=messages,
+        max_tokens=512,
         temperature=0.7,
         top_p=0.9,
         top_k=50
+        stop=["<|im_end|>"]
     )
     # Extract response text