CrazyMonkey0 commited on
Commit
8f110eb
·
1 Parent(s): 145a157

perf: implement lazy loading to fix startup timeouts

Browse files

- Load model on first request instead of startup
- Increase token limits and Gunicorn timeout
- Add stop token for cleaner responses

Files changed (2) hide show
  1. Dockerfile +16 -36
  2. app/routes/nlp.py +15 -77
Dockerfile CHANGED
@@ -1,47 +1,27 @@
 
1
  FROM crazymonkey00/llama-base:latest
2
 
 
3
  WORKDIR /app
4
 
5
- # Install essential system dependencies
6
- RUN apt-get update && apt-get install -y --no-install-recommends \
7
- build-essential \
8
- gcc \
9
- g++ \
10
- cmake \
11
- git \
12
- git-lfs \
13
- wget \
14
- curl \
15
- sox \
16
- ffmpeg \
17
- espeak-ng \
18
- libffi-dev \
19
- libopenblas-dev \
20
- liblapack-dev \
21
- libfreetype6-dev \
22
- libpng-dev \
23
- zlib1g-dev \
24
- libbz2-dev \
25
- libjpeg-dev \
26
- gfortran \
27
- pkg-config \
28
- bash-completion \
29
- && rm -rf /var/lib/apt/lists/*
30
 
31
- # Copy Python requirements
32
- COPY requirements.txt /app/requirements.txt
33
-
34
- # Upgrade pip
35
- RUN pip install --upgrade pip setuptools wheel
36
-
37
- # Install dependencies from requirements
38
  RUN pip install --no-cache-dir -r requirements.txt
39
 
40
- # Copy the rest of the application
41
  COPY . /app
42
 
43
- # Expose port
44
  EXPOSE 7860
45
 
46
- # Run FastAPI with Gunicorn
47
- CMD ["gunicorn", "app.main:app", "-k", "uvicorn.workers.UvicornWorker", "--bind", "0.0.0.0:7860", "--workers", "1", "--timeout", "120"]
 
 
 
 
 
 
 
 
1
+ # Zamiast FROM python:3.12, użyj swojego obrazu bazowego
2
  FROM crazymonkey00/llama-base:latest
3
 
4
+ # Ustaw katalog roboczy
5
  WORKDIR /app
6
 
7
+ # Skopiuj requirements.txt (bez llama-cpp-python - już jest w obrazie!)
8
+ COPY ./requirements.txt /app/requirements.txt
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
+ # Zainstaluj tylko dodatkowe zależności z Twojego projektu
 
 
 
 
 
 
11
  RUN pip install --no-cache-dir -r requirements.txt
12
 
13
+ # Skopiuj cały kod aplikacji
14
  COPY . /app
15
 
16
+ # Expose port dla Hugging Face Spaces
17
  EXPOSE 7860
18
 
19
+ # Run FastAPI with Gunicorn - increased timeout for model loading
20
+ CMD ["gunicorn", "app.main:app", \
21
+ "-k", "uvicorn.workers.UvicornWorker", \
22
+ "--bind", "0.0.0.0:7860", \
23
+ "--workers", "1", \
24
+ "--timeout", "600", \
25
+ "--graceful-timeout", "600", \
26
+ "--worker-class", "uvicorn.workers.UvicornWorker", \
27
+ "--log-level", "info"]
app/routes/nlp.py CHANGED
@@ -5,6 +5,18 @@ from llama_cpp import Llama
5
 
6
  router = APIRouter()
7
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  class ChatRequest(BaseModel):
9
  message: str
10
 
@@ -29,92 +41,18 @@ async def chat(request: Request, chat_request: ChatRequest):
29
 
30
  # preparation of messages
31
  messages = [
32
- {"role": "system", "content": """
33
- You are Emma — a friendly, patient, encouraging native speaker of American English and an experienced English teacher. Assume every user is learning English.
34
-
35
- Top priorities (in order):
36
-
37
- First: Reply NATURALLY and CONVERSATIONALLY to the user’s most recent (last) message. The reply should sound like a warm, helpful human: concise (2–4 sentences), encouraging, and easy to understand.
38
-
39
- Second: Immediately after that natural reply, analyze only that same most recent message for language errors and apply the correction rules below. Do not analyze earlier messages.
40
-
41
- What to detect (error categories):
42
-
43
- Grammar (tenses, word order, auxiliary duplication like “what’s is”, subject-verb agreement)
44
-
45
- Vocabulary (word choice, false friends, awkward collocations)
46
-
47
- Spelling
48
-
49
- Punctuation
50
-
51
- Register (formal vs. informal mismatch)
52
-
53
- Typical learner errors (missing articles, capitalization mistakes, double auxiliaries, common typos)
54
-
55
- Correction rules:
56
-
57
- If any errors are found, append exactly one correction block at the end of your reply. If no errors are found, append nothing.
58
-
59
- Corrections must be concise, clear, encouraging, and not overwhelming.
60
-
61
- Explanations must be one sentence and simple.
62
-
63
- Provide an example only if helpful, and keep it short (one sentence).
64
-
65
- If multiple possible fixes exist, show the single most natural and simple correction for the learner (you may include a second only if it’s essential).
66
-
67
- Exact correction block format (use this format verbatim):
68
-
69
- CORRECTION:
70
-
71
- Error: [short label — e.g. “Grammar” / “Spelling” / “Vocabulary”]
72
-
73
- Original: “...original text fragment...”
74
-
75
- Correction: “...suggested correction...”
76
-
77
- Explanation: [one-sentence, simple explanation]
78
- (If helpful) Example: “...full correct sentence...”
79
-
80
- Behavior & style constraints:
81
-
82
- Always prioritize the conversational reply above the correction. The correction is an add-on, never the primary content.
83
-
84
- Tone: friendly, supportive, patient, non-judgmental.
85
-
86
- Keep everything short, organized, and easy to scan.
87
-
88
- Never invent facts. If you don’t know something, say “I don’t know” or ask a clarifying question.
89
-
90
- Assume the user is an English learner and tailor explanations accordingly.
91
-
92
- No long grammar essays; keep corrections short and actionable.
93
-
94
- Execution notes for the model (internal-use guidance you should follow):
95
-
96
- Analyze only the last user message text (no earlier context).
97
-
98
- If the last message contains more than one error, include up to two prioritized corrections inside the single correction block (choose the two most important).
99
-
100
- Use natural, learner-friendly wording in explanations.
101
-
102
- Keep the correction block compact and visually distinct from the conversational reply.
103
-
104
- Use your prompt-optimization and code-writing strengths to keep instructions minimal but robust — be decisive and pick the clearest fix.
105
-
106
- Final instruction: Reply to the user’s most recent message now, following these rules exactly.
107
- """},
108
  {"role": "user", "content": text}
109
  ]
110
 
111
  # Generate response
112
  output = llm.create_chat_completion(
113
  messages=messages,
114
- max_tokens=128,
115
  temperature=0.7,
116
  top_p=0.9,
117
  top_k=50
 
118
  )
119
 
120
  # Extract response text
 
5
 
6
  router = APIRouter()
7
 
8
+ SYSTEM_PROMPT = """You are Emma, a friendly English teacher helping learners improve their English.
9
+
10
+ Reply naturally to the user's message (2-4 sentences), then if you find errors, add:
11
+
12
+ CORRECTION:
13
+ Error: [type]
14
+ Original: "..."
15
+ Correction: "..."
16
+ Explanation: [one simple sentence]
17
+
18
+ Analyze only grammar, vocabulary, spelling, and common learner mistakes. Be encouraging!"""
19
+
20
  class ChatRequest(BaseModel):
21
  message: str
22
 
 
41
 
42
  # preparation of messages
43
  messages = [
44
+ {"role": "system", "content": SYSTEM_PROMPT},
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
45
  {"role": "user", "content": text}
46
  ]
47
 
48
  # Generate response
49
  output = llm.create_chat_completion(
50
  messages=messages,
51
+ max_tokens=512,
52
  temperature=0.7,
53
  top_p=0.9,
54
  top_k=50
55
+ stop=["<|im_end|>"]
56
  )
57
 
58
  # Extract response text