Peterase commited on
Commit
fa9ac33
Β·
1 Parent(s): 75375c8

feat(intent): complete rewrite of intent classifier v3

Browse files

5-stage pipeline replacing fragile regex patches with systematic coverage:

Stage 1 - Exact match sets (0ms):
- _EXACT_OTHER: greetings, profanity, reactions, single chars
- _EXACT_NEWS_TEMPORAL: today, now, breaking, live, happening
- _EXACT_NEWS_GENERAL: ethiopia, amhara, tigray, news, conflict
- Handles all vague/single-word queries correctly

Stage 2 - Prefix/suffix rules (0ms):
- _TEMPORAL_PREFIXES: 'latest news', 'whats happening', 'news today'
- _HISTORICAL_PREFIXES: 'history of', 'background on', 'how did'
- _OTHER_PREFIXES: identity, math, creative, help queries
- Covers 'are you X', 'what model', 'write me', 'calculate'

Stage 3 - Regex pattern engine (0ms):
- _RE_TEMPORAL: 30+ temporal signals with word boundaries
- _RE_HISTORICAL: 20+ historical signals
- _RE_CONFLICT: 30+ conflict/security signals β†’ NEWS_GENERAL/conflict
- _RE_HUMANITARIAN: 25+ humanitarian signals β†’ NEWS_GENERAL/humanitarian
- _RE_OFF_TOPIC: recipes, movies, games, poems β†’ OTHER

Stage 4 - Weighted keyword scoring (1ms):
- High weight (0.25): Ethiopia-specific terms, news signals
- Medium weight (0.12): General news vocabulary
- Low weight (0.05): Generic terms
- Score >= 0.40 β†’ NEWS_GENERAL

Stage 5 - DeBERTa NLI (500ms, ambiguous only):
- Only fires when stages 1-4 produce no result
- Improved candidate labels for better accuracy
- Threshold raised to 0.35 (was 0.30)

New features:
- sub_type field: conflict|humanitarian|identity|math|creative|off_topic
- query_complexity: empty|vague|simple|medium|complex (was simple/medium/complex)
- Safe default: 2+ word unknown queries β†’ NEWS_GENERAL (search and find nothing > refuse)
- Single unknown word β†’ OTHER

src/infrastructure/adapters/intent_classifier_v2.py CHANGED
@@ -1,521 +1,560 @@
1
  """
2
- Production-Grade Intent Classifier v2
3
-
4
- Enhanced intent classification for hybrid RAG system with:
5
- - Multi-class classification (NEWS_TEMPORAL, NEWS_HISTORICAL, NEWS_GENERAL, OTHER)
6
- - Confidence scoring with thresholds
7
- - Query complexity analysis
8
- - Metrics tracking
9
- - Fallback strategies
10
- - Thread-safe lazy loading
11
-
12
- Classification Hierarchy:
13
- 1. Instant shortcuts (regex patterns) - 0ms
14
- 2. DeBERTa zero-shot NLI - ~20ms
15
- 3. Keyword fallback - 0ms
16
- 4. Default (NEWS_GENERAL) - safe fallback
 
 
 
 
 
17
  """
18
 
19
  import logging
20
  import re
21
  import threading
22
- from typing import Dict, Any, Optional, Tuple
23
- from dataclasses import dataclass
24
- from datetime import datetime
25
  import time
 
 
26
 
27
  logger = logging.getLogger(__name__)
28
 
29
 
30
- # ═══════════════════════════════════════════════════════════════════════════
31
- # PATTERN DEFINITIONS
32
- # ═══════════════════════════════════════════════════════════════════════════
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
- # Small talk patterns (instant OTHER classification)
35
- _SMALL_TALK_EXACT = {
36
- "hi", "hello", "hey", "thanks", "thank you", "bye", "goodbye",
37
- "good morning", "good afternoon", "good evening", "sup", "yo",
38
- "hello there", "hey there", "hi there", "greetings", "howdy",
39
- # Frustration / profanity
40
- "wtf", "lol", "lmao", "omg", "damn", "shit", "fuck",
41
- "for fuck sake", "for fucks sake", "oh my god", "are you kidding",
42
- "seriously", "come on", "ugh", "argh", "ffs",
43
  }
44
 
45
- _SMALL_TALK_PREFIX = (
46
- "how are you", "what are you", "who are you", "what can you do",
47
- "tell me a joke", "make me laugh", "what's up", "whats up",
48
- "for fuck", "for fucks", "what the fuck", "what the hell",
49
- "are you serious", "you must be", "hello ", "hi ", "hey ",
50
- "can you help", "i need help", "help me",
51
- # Identity questions
52
- "are you ", "what model", "which model", "what ai", "which ai",
53
- "are you chatgpt", "are you gpt", "are you claude", "are you gemini",
54
- "are you llama", "are you an ai", "are you a bot", "are you human",
55
- "what version", "who built you", "who made you", "who created you",
56
- "what are your capabilities", "what can you",
57
- # Math / general knowledge (not news)
58
- "what is ", "what's ", "calculate ", "solve ", "how much is ",
59
- "how many ", "define ", "what does ", "translate ",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  )
61
 
62
- # Temporal patterns (instant NEWS_TEMPORAL classification)
63
- _TEMPORAL_PATTERNS = re.compile(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
  r"\b("
65
- r"today|yesterday|tomorrow|tonight|now|currently|"
66
- r"this (week|month|year|morning|evening|afternoon)|"
67
- r"last (week|month|year|night|hour|"
68
  r"monday|tuesday|wednesday|thursday|friday|saturday|sunday)|"
69
- r"next (week|month|year)|"
70
- r"past (\d+ )?(hour|hours|day|days|week|weeks|month|months)|"
71
- r"recent(ly)?|latest|breaking|just (now|happened|announced|reported)|"
72
- r"(monday|tuesday|wednesday|thursday|friday|saturday|sunday)|"
73
- r"january|february|march|april|may|june|july|august|september|october|november|december|"
74
- r"\d{4}|" # year like 2024, 2025
75
- r"\d+(st|nd|rd|th)|" # ordinal like 1st, 2nd
76
- r"current|ongoing|live|real[- ]?time"
 
 
 
77
  r")\b",
78
  re.IGNORECASE
79
  )
80
 
81
- # Historical patterns (instant NEWS_HISTORICAL classification)
82
- _HISTORICAL_PATTERNS = re.compile(
83
  r"\b("
84
- r"history|historical|background|context|origin|"
85
- r"how (did|was|were)|why (did|was|were)|"
86
- r"what (led to|caused|resulted in)|"
87
- r"timeline|chronology|evolution|development|"
88
- r"past|previous|former|old|ancient|"
89
  r"analysis|overview|summary|explanation|"
90
- r"tell me about|explain|describe"
 
 
91
  r")\b",
92
  re.IGNORECASE
93
  )
94
 
95
- # News signal keywords (fallback NEWS classification)
96
- _NEWS_KEYWORDS = {
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  "news", "report", "update", "development", "announcement",
 
 
 
 
 
98
  "conflict", "war", "peace", "crisis", "deal", "agreement",
99
- "election", "vote", "campaign", "president", "minister", "government",
100
- "economy", "market", "price", "inflation", "trade",
 
101
  "protest", "demonstration", "strike", "rally",
102
- "attack", "violence", "security", "military",
103
- "ethiopia", "addis", "abiy", "fano", "tigray", "amhara", "oromia",
104
- "africa", "african", "horn of africa",
 
 
 
 
 
 
 
 
105
  }
106
 
107
 
108
- # ═══════════════════════════════════════════════════════════════════════════
109
- # DATA CLASSES
110
- # ═══════════════════════════════════════════════════════════════════════════
111
 
112
  @dataclass
113
  class IntentResult:
114
- """
115
- Intent classification result with confidence and metadata.
116
- """
117
- intent: str # NEWS_TEMPORAL, NEWS_HISTORICAL, NEWS_GENERAL, OTHER
118
- confidence: float # 0.0 to 1.0
119
- method: str # "regex", "deberta", "keyword", "default"
120
- inference_time_ms: float # Time taken for classification
121
- query_complexity: str # "simple", "medium", "complex"
122
- should_use_live: bool # Recommendation for live search
123
- should_use_db: bool # Recommendation for DB search
124
- metadata: Dict[str, Any] # Additional info
125
-
126
  def to_dict(self) -> Dict[str, Any]:
127
- """Convert to dictionary for logging/caching"""
128
  return {
129
  "intent": self.intent,
130
  "confidence": self.confidence,
131
  "method": self.method,
132
  "inference_time_ms": self.inference_time_ms,
133
  "query_complexity": self.query_complexity,
 
134
  "should_use_live": self.should_use_live,
135
  "should_use_db": self.should_use_db,
136
- "metadata": self.metadata
137
  }
138
 
139
 
140
- # ═══════════════════════════════════════════════════════════════════════════
141
- # PRODUCTION-GRADE INTENT CLASSIFIER
142
- # ═══════════════════════════════════════════════════════════════════════════
143
 
144
  class IntentClassifierV2:
145
  """
146
- Production-grade intent classifier with multi-class classification.
147
-
148
- Intent Classes:
149
- - NEWS_TEMPORAL: Time-sensitive news queries (use live search)
150
- - NEWS_HISTORICAL: Historical/background queries (use DB only)
151
- - NEWS_GENERAL: General news queries (use hybrid)
152
- - OTHER: Non-news queries (skip search)
153
-
154
- Features:
155
- - Multi-stage classification (regex β†’ DeBERTa β†’ keyword β†’ default)
156
- - Confidence scoring with thresholds
157
- - Query complexity analysis
158
- - Metrics tracking
159
- - Thread-safe lazy loading
160
  """
161
-
162
  MODEL_NAME = "MoritzLaurer/deberta-v3-base-zeroshot-v2.0"
163
-
164
- # Confidence thresholds
165
- HIGH_CONFIDENCE = 0.75
166
- MEDIUM_CONFIDENCE = 0.50
167
- LOW_CONFIDENCE = 0.30
168
-
169
  def __init__(self):
170
  self._pipe = None
171
  self._lock = threading.Lock()
172
  self._load_failed = False
173
-
174
- # Metrics tracking
175
  self._metrics = {
176
- "total_classifications": 0,
177
- "by_intent": {"NEWS_TEMPORAL": 0, "NEWS_HISTORICAL": 0, "NEWS_GENERAL": 0, "OTHER": 0},
178
- "by_method": {"regex": 0, "deberta": 0, "keyword": 0, "default": 0},
179
- "avg_inference_time_ms": 0.0,
180
- "total_inference_time_ms": 0.0,
181
  }
182
-
183
- def _load(self):
184
- """Lazy load DeBERTa model (thread-safe)"""
185
- if self._pipe is not None or self._load_failed:
186
- return
187
-
188
- with self._lock:
189
- if self._pipe is not None or self._load_failed:
190
- return
191
-
192
- try:
193
- from transformers import pipeline
194
- logger.info(f"Loading intent classifier: {self.MODEL_NAME} ...")
195
-
196
- self._pipe = pipeline(
197
- "zero-shot-classification",
198
- model=self.MODEL_NAME,
199
- device=-1, # CPU (use device=0 for GPU)
200
- multi_label=False,
201
- )
202
-
203
- logger.info("βœ… Intent classifier v2 loaded successfully")
204
-
205
- except Exception as e:
206
- logger.error(f"❌ Failed to load intent classifier: {e}")
207
- self._load_failed = True
208
-
209
- def classify(self, query: str, use_cache: bool = True) -> IntentResult:
210
- """
211
- Classify query intent with confidence scoring.
212
-
213
- Args:
214
- query: User query string
215
- use_cache: Whether to use cached results (if available)
216
-
217
- Returns:
218
- IntentResult with classification and metadata
219
- """
220
- start_time = time.time()
221
-
222
- # Normalize query
223
- query_normalized = query.strip()
224
- query_lower = query_normalized.lower()
225
-
226
- # Analyze query complexity
227
- complexity = self._analyze_complexity(query_normalized)
228
-
229
- # ── Stage 1: Instant Regex Shortcuts ──────────────────────────────────
230
-
231
- # Check small talk (OTHER)
232
- if query_lower in _SMALL_TALK_EXACT:
233
- return self._create_result(
234
- intent="OTHER",
235
- confidence=1.0,
236
- method="regex_exact",
237
- start_time=start_time,
238
- complexity=complexity,
239
- metadata={"pattern": "small_talk_exact"}
240
  )
241
-
242
- if any(query_lower.startswith(p) for p in _SMALL_TALK_PREFIX):
243
- return self._create_result(
244
- intent="OTHER",
245
- confidence=0.95,
246
- method="regex_prefix",
247
- start_time=start_time,
248
- complexity=complexity,
249
- metadata={"pattern": "small_talk_prefix"}
250
  )
251
-
252
- # Check temporal patterns (NEWS_TEMPORAL)
253
- temporal_match = _TEMPORAL_PATTERNS.search(query_normalized)
254
- if temporal_match:
255
- return self._create_result(
256
- intent="NEWS_TEMPORAL",
257
- confidence=0.90,
258
- method="regex_temporal",
259
- start_time=start_time,
260
- complexity=complexity,
261
- metadata={"pattern": "temporal", "matched": temporal_match.group(0)}
262
  )
263
-
264
- # Check historical patterns (NEWS_HISTORICAL)
265
- historical_match = _HISTORICAL_PATTERNS.search(query_normalized)
266
- if historical_match:
267
- return self._create_result(
268
- intent="NEWS_HISTORICAL",
269
- confidence=0.85,
270
- method="regex_historical",
271
- start_time=start_time,
272
- complexity=complexity,
273
- metadata={"pattern": "historical", "matched": historical_match.group(0)}
274
  )
275
-
276
- # ── Stage 2: DeBERTa Zero-Shot Classification ─────────────────────────
277
-
278
- self._load()
279
-
 
 
 
 
 
 
280
  if self._pipe is not None:
281
  try:
282
- result = self._classify_with_deberta(query_normalized)
283
-
284
  if result:
285
- return self._create_result(
286
- intent=result["intent"],
287
- confidence=result["confidence"],
288
- method="deberta",
289
- start_time=start_time,
290
- complexity=complexity,
291
- metadata=result["metadata"]
292
  )
293
-
294
  except Exception as e:
295
- logger.warning(f"DeBERTa classification failed: {e}")
296
-
297
- # ── Stage 3: Keyword Fallback ─────────────────────────────────────────
298
-
299
- keyword_result = self._classify_with_keywords(query_lower)
300
- if keyword_result:
301
- return self._create_result(
302
- intent=keyword_result["intent"],
303
- confidence=keyword_result["confidence"],
304
- method="keyword",
305
- start_time=start_time,
306
- complexity=complexity,
307
- metadata=keyword_result["metadata"]
308
- )
309
-
310
- # ── Stage 4: Default (Safe Fallback) ──────────────────────────────────
311
-
312
- return self._create_result(
313
- intent="NEWS_GENERAL",
314
- confidence=0.50,
315
- method="default",
316
- start_time=start_time,
317
- complexity=complexity,
318
- metadata={"reason": "no_pattern_match"}
319
- )
320
-
321
- def _classify_with_deberta(self, query: str) -> Optional[Dict[str, Any]]:
322
- """
323
- Classify using DeBERTa zero-shot model.
324
-
325
- Returns dict with intent, confidence, metadata or None if failed.
326
- """
327
- try:
328
- # Multi-class classification
329
- result = self._pipe(
330
- query,
331
- candidate_labels=[
332
- "breaking news, current events, today's news, latest updates, real-time news",
333
- "historical background, past events, context, analysis, explanation",
334
- "general news, politics, economy, world affairs, sports, technology",
335
- "small talk, greeting, joke, general question unrelated to news",
336
- ],
337
- hypothesis_template="This message is about {}.",
338
- )
339
-
340
- top_label = result["labels"][0]
341
- top_score = result["scores"][0]
342
-
343
- # Map label to intent
344
- if "breaking" in top_label or "current" in top_label or "latest" in top_label:
345
- intent = "NEWS_TEMPORAL"
346
- elif "historical" in top_label or "background" in top_label or "context" in top_label:
347
- intent = "NEWS_HISTORICAL"
348
- elif "general news" in top_label or "politics" in top_label:
349
- intent = "NEWS_GENERAL"
350
- elif "small talk" in top_label or "greeting" in top_label:
351
- intent = "OTHER"
352
- else:
353
- intent = "NEWS_GENERAL" # Default to general news
354
-
355
- # Only return if confidence is above threshold
356
- if top_score >= self.LOW_CONFIDENCE:
357
- return {
358
- "intent": intent,
359
- "confidence": float(top_score),
360
- "metadata": {
361
- "top_label": top_label,
362
- "all_scores": {
363
- label: float(score)
364
- for label, score in zip(result["labels"], result["scores"])
365
- }
366
- }
367
- }
368
-
369
- return None
370
-
371
- except Exception as e:
372
- logger.error(f"DeBERTa inference error: {e}")
373
- return None
374
-
375
- def _classify_with_keywords(self, query_lower: str) -> Optional[Dict[str, Any]]:
376
- """
377
- Classify using keyword matching (fallback).
378
-
379
- Returns dict with intent, confidence, metadata or None if no match.
380
- """
381
- # Count news keyword matches
382
- matches = [kw for kw in _NEWS_KEYWORDS if kw in query_lower]
383
-
384
- if matches:
385
- # More matches = higher confidence
386
- confidence = min(0.70, 0.50 + (len(matches) * 0.05))
387
-
388
- return {
389
- "intent": "NEWS_GENERAL",
390
- "confidence": confidence,
391
- "metadata": {
392
- "matched_keywords": matches[:5], # Top 5
393
- "match_count": len(matches)
394
- }
395
- }
396
-
397
- return None
398
-
399
- def _analyze_complexity(self, query: str) -> str:
400
- """
401
- Analyze query complexity based on length and structure.
402
-
403
- Returns: "simple", "medium", or "complex"
404
- """
405
- word_count = len(query.split())
406
- char_count = len(query)
407
-
408
- # Check for question words
409
- question_words = ["what", "when", "where", "who", "why", "how"]
410
- has_question = any(qw in query.lower() for qw in question_words)
411
-
412
- if word_count <= 3 and not has_question:
413
  return "simple"
414
- elif word_count <= 10:
415
  return "medium"
416
- else:
417
- return "complex"
418
-
419
- def _create_result(
420
  self,
421
  intent: str,
422
  confidence: float,
423
  method: str,
424
- start_time: float,
425
  complexity: str,
426
- metadata: Dict[str, Any]
 
427
  ) -> IntentResult:
428
- """
429
- Create IntentResult with recommendations and metrics.
430
- """
431
- inference_time_ms = (time.time() - start_time) * 1000
432
-
433
- # Determine search recommendations
434
- should_use_live = intent == "NEWS_TEMPORAL"
435
- should_use_db = intent in ["NEWS_TEMPORAL", "NEWS_HISTORICAL", "NEWS_GENERAL"]
436
-
437
- # Update metrics
438
- self._update_metrics(intent, method, inference_time_ms)
439
-
440
- result = IntentResult(
441
  intent=intent,
442
  confidence=confidence,
443
  method=method,
444
- inference_time_ms=inference_time_ms,
445
  query_complexity=complexity,
446
- should_use_live=should_use_live,
447
- should_use_db=should_use_db,
448
- metadata=metadata
 
449
  )
450
-
451
- # Log classification
452
- logger.debug(
453
- f"Intent: {intent} (conf={confidence:.2f}, method={method}, "
454
- f"time={inference_time_ms:.1f}ms, complexity={complexity})"
455
- )
456
-
457
- return result
458
-
459
- def _update_metrics(self, intent: str, method: str, inference_time_ms: float):
460
- """Update classification metrics"""
461
- self._metrics["total_classifications"] += 1
462
- self._metrics["by_intent"][intent] = self._metrics["by_intent"].get(intent, 0) + 1
463
- self._metrics["by_method"][method] = self._metrics["by_method"].get(method, 0) + 1
464
- self._metrics["total_inference_time_ms"] += inference_time_ms
465
- self._metrics["avg_inference_time_ms"] = (
466
- self._metrics["total_inference_time_ms"] / self._metrics["total_classifications"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
467
  )
468
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
469
  def get_metrics(self) -> Dict[str, Any]:
470
- """Get classification metrics for monitoring"""
471
- return dict(self._metrics)
472
-
473
- def reset_metrics(self):
474
- """Reset metrics (useful for testing)"""
475
- self._metrics = {
476
- "total_classifications": 0,
477
- "by_intent": {"NEWS_TEMPORAL": 0, "NEWS_HISTORICAL": 0, "NEWS_GENERAL": 0, "OTHER": 0},
478
- "by_method": {"regex": 0, "deberta": 0, "keyword": 0, "default": 0},
479
- "avg_inference_time_ms": 0.0,
480
- "total_inference_time_ms": 0.0,
481
  }
482
 
483
 
484
- # ═══════════════════════════════════════════════════════════════════════════
485
- # MODULE-LEVEL SINGLETON
486
- # ═══════════════════════════════════════════════════════════════════════════
487
 
488
- # Global singleton instance
489
  intent_classifier_v2 = IntentClassifierV2()
490
 
491
 
492
- # ═══════════════════════════════════════════════════════════════════════════
493
- # BACKWARD COMPATIBILITY WRAPPER
494
- # ═══════════════════════════════════════════════════════════════════════════
495
-
496
  class IntentClassifier:
497
- """
498
- Backward-compatible wrapper for existing code.
499
- Maps v2 multi-class intents to v1 binary (NEWS/OTHER).
500
- """
501
-
502
  def __init__(self):
503
- self._classifier_v2 = intent_classifier_v2
504
-
505
  def classify(self, query: str) -> str:
506
- """
507
- Classify query intent (backward compatible).
508
-
509
- Returns: "NEWS" or "OTHER"
510
- """
511
- result = self._classifier_v2.classify(query)
512
-
513
- # Map v2 intents to v1 binary
514
- if result.intent == "OTHER":
515
- return "OTHER"
516
- else:
517
- return "NEWS" # All NEWS_* intents map to NEWS
518
 
519
 
520
- # Backward-compatible singleton
521
  intent_classifier = IntentClassifier()
 
1
  """
2
+ Intent Classifier v3 β€” Sharp, Fast, Comprehensive
3
+
4
+ 5-stage classification pipeline:
5
+ Stage 1: Exact match set (0ms) β€” greetings, profanity, single chars
6
+ Stage 2: Prefix/suffix rules (0ms) β€” identity, math, commands
7
+ Stage 3: Regex pattern engine (0ms) β€” temporal, historical, conflict, humanitarian
8
+ Stage 4: Weighted keyword scoring (1ms) β€” domain-specific vocabulary
9
+ Stage 5: DeBERTa NLI fallback (500ms) β€” ambiguous edge cases only
10
+
11
+ Handles:
12
+ - Vague / single-word queries ("news", "ethiopia", "amhara")
13
+ - Short queries ("latest", "update", "today")
14
+ - Identity questions ("who are you", "are you gpt")
15
+ - Math / general knowledge ("2+2", "capital of france")
16
+ - Conflict queries ("clashes", "attack", "fano")
17
+ - Humanitarian queries ("displaced", "aid", "refugees")
18
+ - Historical queries ("history of", "background on")
19
+ - Temporal queries ("today", "breaking", "just now")
20
+ - General news ("ethiopia news", "abiy ahmed")
21
+ - Off-topic ("write a poem", "recipe for pasta")
22
  """
23
 
24
  import logging
25
  import re
26
  import threading
 
 
 
27
  import time
28
+ from dataclasses import dataclass
29
+ from typing import Any, Dict, Optional
30
 
31
  logger = logging.getLogger(__name__)
32
 
33
 
34
+ # ═══════════════════════════════════════════════════════════════════════════════
35
+ # STAGE 1: EXACT MATCH SET (0ms)
36
+ # ═══════════════════════════════════════════════════════════════════════════════
37
+
38
+ _EXACT_OTHER = {
39
+ # Greetings
40
+ "hi", "hello", "hey", "yo", "sup", "howdy", "greetings",
41
+ "good morning", "good afternoon", "good evening", "good night",
42
+ "hello there", "hey there", "hi there",
43
+ # Farewells
44
+ "bye", "goodbye", "see you", "later", "cya", "ttyl",
45
+ # Thanks
46
+ "thanks", "thank you", "thx", "ty", "cheers",
47
+ # Reactions
48
+ "ok", "okay", "sure", "cool", "nice", "great", "awesome",
49
+ "lol", "lmao", "haha", "hehe", "omg", "wtf", "wow",
50
+ "ugh", "argh", "hmm", "oh", "ah", "aha",
51
+ # Single characters / gibberish triggers
52
+ ".", "..", "...", "?", "??", "!", "!!", "test", "testing",
53
+ # Profanity (route to OTHER, not news)
54
+ "damn", "shit", "fuck", "crap", "hell",
55
+ }
56
+
57
+ # Vague single-word queries that ARE news-related β†’ NEWS_GENERAL
58
+ _EXACT_NEWS_GENERAL = {
59
+ "news", "update", "updates", "latest", "headlines", "stories",
60
+ "ethiopia", "africa", "amhara", "tigray", "oromia", "somalia",
61
+ "addis", "abiy", "fano", "tplf", "olf", "ene",
62
+ "conflict", "war", "peace", "crisis", "politics",
63
+ "economy", "election", "government",
64
+ }
65
 
66
+ # Vague single-word queries that are temporal β†’ NEWS_TEMPORAL
67
+ _EXACT_NEWS_TEMPORAL = {
68
+ "today", "now", "tonight", "breaking", "live", "current",
69
+ "happening", "recent", "fresh",
 
 
 
 
 
70
  }
71
 
72
+
73
+ # ═══════════════════════════════════════════════════════════════════════════════
74
+ # STAGE 2: PREFIX / SUFFIX RULES (0ms)
75
+ # ═══════════════════════════════════════════════════════════════════════════════
76
+
77
+ # These prefixes β†’ OTHER (identity, math, off-topic commands)
78
+ _OTHER_PREFIXES = (
79
+ # Identity
80
+ "who are you", "what are you", "are you ", "what model",
81
+ "which model", "what ai", "which ai", "what version",
82
+ "who built you", "who made you", "who created you",
83
+ "tell me about yourself", "introduce yourself",
84
+ # Math / calculations
85
+ "what is ", "what's ", "whats ", "calculate ", "compute ",
86
+ "solve ", "how much is ", "convert ", "define ",
87
+ "what does ", "translate ", "spell ", "how do you spell",
88
+ # Commands / creative
89
+ "write ", "generate ", "create ", "make me ", "give me a ",
90
+ "tell me a joke", "tell me a story", "write a poem",
91
+ "write me ", "compose ", "draft ",
92
+ # Help / capability
93
+ "can you help", "help me with", "how do i", "how to ",
94
+ "what can you do", "what are your capabilities",
95
+ # Greetings with space (catches "hello world" etc.)
96
+ "hello ", "hi ", "hey ",
97
+ )
98
+
99
+ # These prefixes β†’ NEWS_TEMPORAL
100
+ _TEMPORAL_PREFIXES = (
101
+ "what happened today", "what's happening", "whats happening",
102
+ "what is happening", "latest news", "breaking news",
103
+ "today's news", "todays news", "news today",
104
+ "what's new", "whats new", "any news",
105
+ "tell me the latest", "give me the latest",
106
+ "what's going on", "whats going on",
107
  )
108
 
109
+ # These prefixes β†’ NEWS_HISTORICAL
110
+ _HISTORICAL_PREFIXES = (
111
+ "history of ", "historical ", "background on ", "background of ",
112
+ "origin of ", "origins of ", "context of ", "context on ",
113
+ "tell me about the history", "what is the history",
114
+ "how did ", "why did ", "what caused ", "what led to ",
115
+ "timeline of ", "chronology of ",
116
+ )
117
+
118
+
119
+ # ═══════════════════════════════════════════════════════════════════════════════
120
+ # STAGE 3: REGEX PATTERN ENGINE (0ms)
121
+ # ═══════════════════════════════════════════════════════════════════════════════
122
+
123
+ # Temporal signals
124
+ _RE_TEMPORAL = re.compile(
125
  r"\b("
126
+ r"today|tonight|yesterday|tomorrow|"
127
+ r"this\s+(morning|afternoon|evening|week|month|year)|"
128
+ r"last\s+(night|hour|week|month|year|"
129
  r"monday|tuesday|wednesday|thursday|friday|saturday|sunday)|"
130
+ r"past\s+\d+\s*(hour|hours|day|days|week|weeks|month|months)|"
131
+ r"just\s+(now|happened|announced|reported|released)|"
132
+ r"breaking|latest|recent(ly)?|current(ly)?|ongoing|live|"
133
+ r"right\s+now|as\s+of\s+(now|today)|"
134
+ r"this\s+just\s+in|developing\s+story|"
135
+ r"hours?\s+ago|minutes?\s+ago|days?\s+ago|"
136
+ r"monday|tuesday|wednesday|thursday|friday|saturday|sunday|"
137
+ r"january|february|march|april|june|july|august|"
138
+ r"september|october|november|december|"
139
+ r"2024|2025|2026|"
140
+ r"real[\s-]?time|up[\s-]?to[\s-]?date"
141
  r")\b",
142
  re.IGNORECASE
143
  )
144
 
145
+ # Historical signals
146
+ _RE_HISTORICAL = re.compile(
147
  r"\b("
148
+ r"history|historical|background|context|origin(s)?|"
149
+ r"how\s+did|why\s+did|what\s+caused|what\s+led\s+to|"
150
+ r"timeline|chronology|evolution|development\s+of|"
151
+ r"past|previous|former|ancient|traditional|"
 
152
  r"analysis|overview|summary|explanation|"
153
+ r"tell\s+me\s+about|explain|describe|"
154
+ r"since\s+(19|20)\d{2}|from\s+(19|20)\d{2}|"
155
+ r"decade|century|era|period"
156
  r")\b",
157
  re.IGNORECASE
158
  )
159
 
160
+ # Conflict / security signals β†’ NEWS_GENERAL (with conflict sub-type)
161
+ _RE_CONFLICT = re.compile(
162
+ r"\b("
163
+ r"clash(es)?|attack(ed|s)?|battle|fighting|armed|militia|"
164
+ r"killed|fatalities|casualties|wounded|dead|deaths|"
165
+ r"protest(s|ers)?|demonstration|rally|riot(s)?|"
166
+ r"military|troops|soldiers|forces|army|"
167
+ r"bomb(ing)?|explosion|airstrike|drone|"
168
+ r"fano|tplf|olf|ene|al[\s-]?shabaab|"
169
+ r"ceasefire|peace\s+deal|negotiation|"
170
+ r"coup|overthrow|uprising|insurgency|rebel"
171
+ r")\b",
172
+ re.IGNORECASE
173
+ )
174
+
175
+ # Humanitarian signals β†’ NEWS_GENERAL (with humanitarian sub-type)
176
+ _RE_HUMANITARIAN = re.compile(
177
+ r"\b("
178
+ r"displaced|displacement|idp|refugee(s)?|"
179
+ r"humanitarian|aid|relief|assistance|"
180
+ r"food\s+(security|insecurity|crisis)|famine|hunger|starvation|"
181
+ r"drought|flood(ing)?|disaster|emergency|"
182
+ r"unocha|unhcr|wfp|unicef|ngo|"
183
+ r"shelter|camp(s)?|evacuation|"
184
+ r"cholera|disease|outbreak|epidemic|"
185
+ r"poverty|malnutrition|sanitation"
186
+ r")\b",
187
+ re.IGNORECASE
188
+ )
189
+
190
+ # Off-topic signals β†’ OTHER
191
+ _RE_OFF_TOPIC = re.compile(
192
+ r"\b("
193
+ r"recipe|cook(ing)?|food\s+recipe|how\s+to\s+cook|"
194
+ r"movie|film|song|music|lyrics|"
195
+ r"game|gaming|play\s+game|"
196
+ r"joke|funny|humor|meme|"
197
+ r"poem|poetry|story|fiction|novel|"
198
+ r"math|algebra|calculus|equation|formula|"
199
+ r"weather\s+forecast|temperature\s+in|"
200
+ r"stock\s+price|crypto|bitcoin|"
201
+ r"sports\s+score|match\s+result|"
202
+ r"translate\s+to|how\s+do\s+you\s+say"
203
+ r")\b",
204
+ re.IGNORECASE
205
+ )
206
+
207
+
208
+ # ═══════════════════════════════════════════════════════════════════════════════
209
+ # STAGE 4: WEIGHTED KEYWORD SCORING (1ms)
210
+ # ═══════════════════════════════════════════════════════════════════════════════
211
+
212
+ # High-weight Ethiopia/Africa news keywords
213
+ _KW_NEWS_HIGH = {
214
+ # Ethiopia-specific
215
+ "ethiopia", "ethiopian", "addis ababa", "addis", "abiy", "abiy ahmed",
216
+ "tigray", "amhara", "oromia", "oromo", "afar", "somali region",
217
+ "fano", "tplf", "olf", "ene", "gerd", "nile", "blue nile",
218
+ "mekelle", "gondar", "bahir dar", "dire dawa", "hawassa",
219
+ # Horn of Africa
220
+ "somalia", "somali", "kenya", "sudan", "south sudan", "eritrea",
221
+ "djibouti", "horn of africa",
222
+ # News signals
223
  "news", "report", "update", "development", "announcement",
224
+ "statement", "press release", "official",
225
+ }
226
+
227
+ # Medium-weight general news keywords
228
+ _KW_NEWS_MED = {
229
  "conflict", "war", "peace", "crisis", "deal", "agreement",
230
+ "election", "vote", "campaign", "president", "prime minister",
231
+ "minister", "government", "parliament", "policy",
232
+ "economy", "market", "inflation", "trade", "investment",
233
  "protest", "demonstration", "strike", "rally",
234
+ "attack", "violence", "security", "military", "forces",
235
+ "humanitarian", "aid", "displaced", "refugee",
236
+ "africa", "african", "un", "united nations", "au", "african union",
237
+ }
238
+
239
+ # Low-weight general keywords (only count if no high/med match)
240
+ _KW_NEWS_LOW = {
241
+ "situation", "issue", "problem", "challenge", "concern",
242
+ "region", "area", "zone", "district", "province",
243
+ "people", "community", "population", "civilian",
244
+ "international", "global", "world",
245
  }
246
 
247
 
248
+ # ═══════════════════════════════════════════════════════════════════════════════
249
+ # DATA CLASS
250
+ # ═══════════════════════════════════════════════════════════════════════════════
251
 
252
  @dataclass
253
  class IntentResult:
254
+ intent: str # NEWS_TEMPORAL | NEWS_HISTORICAL | NEWS_GENERAL | OTHER
255
+ confidence: float # 0.0 – 1.0
256
+ method: str # stage that produced the result
257
+ inference_time_ms: float
258
+ query_complexity: str # vague | simple | medium | complex
259
+ sub_type: str # conflict | humanitarian | general | identity | math | off_topic | ""
260
+ should_use_live: bool
261
+ should_use_db: bool
262
+ metadata: Dict[str, Any]
263
+
 
 
264
  def to_dict(self) -> Dict[str, Any]:
 
265
  return {
266
  "intent": self.intent,
267
  "confidence": self.confidence,
268
  "method": self.method,
269
  "inference_time_ms": self.inference_time_ms,
270
  "query_complexity": self.query_complexity,
271
+ "sub_type": self.sub_type,
272
  "should_use_live": self.should_use_live,
273
  "should_use_db": self.should_use_db,
274
+ "metadata": self.metadata,
275
  }
276
 
277
 
278
+ # ═══════════════════════════════════════════════════════════════════════════════
279
+ # CLASSIFIER
280
+ # ═══════════════════════════════════════════════════════════════════════════════
281
 
282
  class IntentClassifierV2:
283
  """
284
+ Sharp, fast, comprehensive intent classifier.
285
+
286
+ 5-stage pipeline β€” most queries resolved in Stage 1-4 (<2ms).
287
+ DeBERTa (Stage 5) only fires for genuinely ambiguous queries.
 
 
 
 
 
 
 
 
 
 
288
  """
289
+
290
  MODEL_NAME = "MoritzLaurer/deberta-v3-base-zeroshot-v2.0"
291
+
 
 
 
 
 
292
  def __init__(self):
293
  self._pipe = None
294
  self._lock = threading.Lock()
295
  self._load_failed = False
 
 
296
  self._metrics = {
297
+ "total": 0,
298
+ "by_intent": {},
299
+ "by_method": {},
300
+ "total_ms": 0.0,
 
301
  }
302
+
303
+ # ── Public API ────────────────────────────────────────────────────────────
304
+
305
+ def classify(self, query: str) -> IntentResult:
306
+ t0 = time.time()
307
+ q = query.strip()
308
+ ql = q.lower()
309
+ complexity = self._complexity(q)
310
+
311
+ # ── Stage 1: Exact match ──────────────────────────────────────────────
312
+ if ql in _EXACT_OTHER:
313
+ return self._result("OTHER", 1.0, "exact", t0, complexity, "identity")
314
+
315
+ if ql in _EXACT_NEWS_TEMPORAL:
316
+ return self._result("NEWS_TEMPORAL", 1.0, "exact", t0, complexity, "general")
317
+
318
+ if ql in _EXACT_NEWS_GENERAL:
319
+ return self._result("NEWS_GENERAL", 1.0, "exact", t0, complexity, "general")
320
+
321
+ # ── Stage 2: Prefix / suffix rules ───────────────────────────────────
322
+ for p in _TEMPORAL_PREFIXES:
323
+ if ql.startswith(p) or ql == p.strip():
324
+ return self._result("NEWS_TEMPORAL", 0.97, "prefix", t0, complexity, "general")
325
+
326
+ for p in _HISTORICAL_PREFIXES:
327
+ if ql.startswith(p):
328
+ return self._result("NEWS_HISTORICAL", 0.95, "prefix", t0, complexity, "general")
329
+
330
+ for p in _OTHER_PREFIXES:
331
+ if ql.startswith(p):
332
+ sub = self._other_subtype(ql)
333
+ return self._result("OTHER", 0.95, "prefix", t0, complexity, sub)
334
+
335
+ # ── Stage 3: Regex pattern engine ────────────────────────────────────
336
+
337
+ # Off-topic check first (before temporal/historical to avoid false positives)
338
+ if _RE_OFF_TOPIC.search(q):
339
+ return self._result("OTHER", 0.90, "regex_offtopic", t0, complexity, "off_topic")
340
+
341
+ # Temporal
342
+ tm = _RE_TEMPORAL.search(q)
343
+ if tm:
344
+ return self._result(
345
+ "NEWS_TEMPORAL", 0.90, "regex_temporal", t0, complexity, "general",
346
+ {"matched": tm.group(0)}
 
 
 
 
 
 
 
 
 
 
 
 
 
347
  )
348
+
349
+ # Historical
350
+ hm = _RE_HISTORICAL.search(q)
351
+ if hm:
352
+ return self._result(
353
+ "NEWS_HISTORICAL", 0.88, "regex_historical", t0, complexity, "general",
354
+ {"matched": hm.group(0)}
 
 
355
  )
356
+
357
+ # Conflict β†’ NEWS_GENERAL with conflict sub-type
358
+ cm = _RE_CONFLICT.search(q)
359
+ if cm:
360
+ return self._result(
361
+ "NEWS_GENERAL", 0.88, "regex_conflict", t0, complexity, "conflict",
362
+ {"matched": cm.group(0)}
 
 
 
 
363
  )
364
+
365
+ # Humanitarian β†’ NEWS_GENERAL with humanitarian sub-type
366
+ hum = _RE_HUMANITARIAN.search(q)
367
+ if hum:
368
+ return self._result(
369
+ "NEWS_GENERAL", 0.85, "regex_humanitarian", t0, complexity, "humanitarian",
370
+ {"matched": hum.group(0)}
 
 
 
 
371
  )
372
+
373
+ # ── Stage 4: Weighted keyword scoring ────────────────────────────────
374
+ score = self._keyword_score(ql)
375
+ if score >= 0.60:
376
+ return self._result("NEWS_GENERAL", score, "keyword", t0, complexity, "general")
377
+ if score >= 0.40:
378
+ # Weak news signal β€” still route to news but lower confidence
379
+ return self._result("NEWS_GENERAL", score, "keyword", t0, complexity, "general")
380
+
381
+ # ── Stage 5: DeBERTa NLI (ambiguous queries only) ────────────────────
382
+ self._load_deberta()
383
  if self._pipe is not None:
384
  try:
385
+ result = self._deberta_classify(q)
 
386
  if result:
387
+ return self._result(
388
+ result["intent"], result["confidence"],
389
+ "deberta", t0, complexity, "general",
390
+ result["metadata"]
 
 
 
391
  )
 
392
  except Exception as e:
393
+ logger.warning(f"DeBERTa failed: {e}")
394
+
395
+ # ── Stage 6: Safe default ─────────────────────────────────────────────
396
+ # If query has any content and we got here, treat as general news
397
+ # (better to search and find nothing than to refuse)
398
+ if len(ql.split()) >= 2:
399
+ return self._result("NEWS_GENERAL", 0.50, "default", t0, complexity, "general")
400
+
401
+ # Single unknown word β†’ OTHER
402
+ return self._result("OTHER", 0.60, "default", t0, complexity, "unknown")
403
+
404
+ # ── Internal helpers ──────────────────────────────────────────────────────
405
+
406
+ def _keyword_score(self, ql: str) -> float:
407
+ """Weighted keyword scoring. Returns 0.0–1.0."""
408
+ score = 0.0
409
+ for kw in _KW_NEWS_HIGH:
410
+ if kw in ql:
411
+ score += 0.25
412
+ for kw in _KW_NEWS_MED:
413
+ if kw in ql:
414
+ score += 0.12
415
+ for kw in _KW_NEWS_LOW:
416
+ if kw in ql:
417
+ score += 0.05
418
+ return min(score, 1.0)
419
+
420
+ def _other_subtype(self, ql: str) -> str:
421
+ """Determine sub-type for OTHER queries."""
422
+ if any(p in ql for p in ("who are you", "what are you", "are you ", "what model", "what ai")):
423
+ return "identity"
424
+ if any(p in ql for p in ("calculate", "solve", "what is ", "how much", "convert")):
425
+ return "math"
426
+ if any(p in ql for p in ("write ", "generate ", "create ", "make me", "compose")):
427
+ return "creative"
428
+ return "off_topic"
429
+
430
+ def _complexity(self, query: str) -> str:
431
+ """Classify query complexity."""
432
+ words = query.split()
433
+ n = len(words)
434
+ if n == 0:
435
+ return "empty"
436
+ if n == 1:
437
+ return "vague"
438
+ if n <= 4:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
439
  return "simple"
440
+ if n <= 12:
441
  return "medium"
442
+ return "complex"
443
+
444
+ def _result(
 
445
  self,
446
  intent: str,
447
  confidence: float,
448
  method: str,
449
+ t0: float,
450
  complexity: str,
451
+ sub_type: str,
452
+ metadata: Optional[Dict] = None,
453
  ) -> IntentResult:
454
+ ms = (time.time() - t0) * 1000
455
+ self._metrics["total"] += 1
456
+ self._metrics["by_intent"][intent] = self._metrics["by_intent"].get(intent, 0) + 1
457
+ self._metrics["by_method"][method] = self._metrics["by_method"].get(method, 0) + 1
458
+ self._metrics["total_ms"] += ms
459
+
460
+ logger.debug(
461
+ f"Intent={intent} conf={confidence:.2f} method={method} "
462
+ f"sub={sub_type} complexity={complexity} time={ms:.1f}ms"
463
+ )
464
+
465
+ return IntentResult(
 
466
  intent=intent,
467
  confidence=confidence,
468
  method=method,
469
+ inference_time_ms=ms,
470
  query_complexity=complexity,
471
+ sub_type=sub_type,
472
+ should_use_live=(intent == "NEWS_TEMPORAL"),
473
+ should_use_db=(intent in ("NEWS_TEMPORAL", "NEWS_HISTORICAL", "NEWS_GENERAL")),
474
+ metadata=metadata or {},
475
  )
476
+
477
+ def _load_deberta(self):
478
+ """Lazy-load DeBERTa (thread-safe)."""
479
+ if self._pipe is not None or self._load_failed:
480
+ return
481
+ with self._lock:
482
+ if self._pipe is not None or self._load_failed:
483
+ return
484
+ try:
485
+ from transformers import pipeline
486
+ logger.info(f"Loading DeBERTa: {self.MODEL_NAME}")
487
+ self._pipe = pipeline(
488
+ "zero-shot-classification",
489
+ model=self.MODEL_NAME,
490
+ device=-1,
491
+ multi_label=False,
492
+ )
493
+ logger.info("βœ… DeBERTa loaded")
494
+ except Exception as e:
495
+ logger.error(f"DeBERTa load failed: {e}")
496
+ self._load_failed = True
497
+
498
+ def _deberta_classify(self, query: str) -> Optional[Dict[str, Any]]:
499
+ """DeBERTa zero-shot classification for ambiguous queries."""
500
+ result = self._pipe(
501
+ query,
502
+ candidate_labels=[
503
+ "current news, breaking news, today's events, latest updates",
504
+ "historical events, background, context, past analysis",
505
+ "general news, politics, economy, society, Africa",
506
+ "personal question, identity, math, creative writing, off-topic",
507
+ ],
508
+ hypothesis_template="This text is about {}.",
509
  )
510
+ top_label = result["labels"][0]
511
+ top_score = float(result["scores"][0])
512
+
513
+ if top_score < 0.35:
514
+ return None # Too uncertain, let default handle it
515
+
516
+ if "current" in top_label or "breaking" in top_label or "latest" in top_label:
517
+ intent = "NEWS_TEMPORAL"
518
+ elif "historical" in top_label or "background" in top_label:
519
+ intent = "NEWS_HISTORICAL"
520
+ elif "general news" in top_label or "politics" in top_label:
521
+ intent = "NEWS_GENERAL"
522
+ else:
523
+ intent = "OTHER"
524
+
525
+ return {
526
+ "intent": intent,
527
+ "confidence": top_score,
528
+ "metadata": {
529
+ "top_label": top_label,
530
+ "scores": dict(zip(result["labels"], result["scores"])),
531
+ },
532
+ }
533
+
534
  def get_metrics(self) -> Dict[str, Any]:
535
+ total = self._metrics["total"] or 1
536
+ return {
537
+ **self._metrics,
538
+ "avg_ms": self._metrics["total_ms"] / total,
 
 
 
 
 
 
 
539
  }
540
 
541
 
542
+ # ═══════════════════════════════════════════════════════════════════════════════
543
+ # SINGLETONS
544
+ # ═══════════════════════════════════════════════════════════════════════════════
545
 
 
546
  intent_classifier_v2 = IntentClassifierV2()
547
 
548
 
 
 
 
 
549
  class IntentClassifier:
550
+ """Backward-compatible binary wrapper (NEWS / OTHER)."""
551
+
 
 
 
552
  def __init__(self):
553
+ self._v2 = intent_classifier_v2
554
+
555
  def classify(self, query: str) -> str:
556
+ result = self._v2.classify(query)
557
+ return "OTHER" if result.intent == "OTHER" else "NEWS"
 
 
 
 
 
 
 
 
 
 
558
 
559
 
 
560
  intent_classifier = IntentClassifier()