# Tối ưu Tốc độ và Độ chính xác Chatbot Ngày tạo: 2025-01-27 ## 1. Phân tích Bottlenecks hiện tại ### 1.1 Intent Classification **Vấn đề:** - Loop qua nhiều keywords mỗi lần (fine_keywords: 9 items, fine_single_words: 7 items) - Tính `_remove_accents()` nhiều lần cho cùng keyword - Không có compiled regex patterns **Impact:** ~5-10ms mỗi query ### 1.2 Search Pipeline **Vấn đề:** - `list(queryset)` - Load TẤT CẢ objects vào memory trước khi search - TF-IDF vectorization cho toàn bộ dataset mỗi lần - Không có early exit khi tìm thấy kết quả tốt - Query expansion query database mỗi lần **Impact:** ~100-500ms cho dataset lớn ### 1.3 LLM Generation **Vấn đề:** - Prompt được build lại mỗi lần (không cache) - Không có streaming response - max_new_tokens=150 (OK) nhưng có thể tối ưu thêm - Không cache generated responses **Impact:** ~1-5s cho local model, ~2-10s cho API ### 1.4 Không có Response Caching **Vấn đề:** - Cùng query được xử lý lại từ đầu - Search results không được cache - Intent classification không được cache **Impact:** ~100-500ms cho duplicate queries ## 2. Tối ưu Intent Classification ### 2.1 Pre-compile Keyword Patterns ```python # backend/hue_portal/core/chatbot.py import re from functools import lru_cache class Chatbot: def __init__(self): self.intent_classifier = None self.vectorizer = None # Pre-compile keyword patterns self._compile_keyword_patterns() self._train_classifier() def _compile_keyword_patterns(self): """Pre-compile regex patterns for faster matching.""" # Fine keywords (multi-word first, then single) self.fine_patterns_multi = [ re.compile(r'\b' + re.escape(kw) + r'\b', re.IGNORECASE) for kw in ["mức phạt", "vi phạm", "đèn đỏ", "nồng độ cồn", "mũ bảo hiểm", "tốc độ", "bằng lái", "vượt đèn"] ] self.fine_patterns_single = [ re.compile(r'\b' + re.escape(kw) + r'\b', re.IGNORECASE) for kw in ["phạt", "vượt", "đèn", "mức"] ] # Pre-compute accent-free versions self.fine_keywords_ascii = [self._remove_accents(kw) for kw in ["mức phạt", "vi phạm", "đèn đỏ", ...]] # Procedure, Office, Advisory patterns... # Similar pattern compilation @lru_cache(maxsize=1000) def classify_intent(self, query: str) -> Tuple[str, float]: """Cached intent classification.""" query_lower = query.lower().strip() # Fast path: Check compiled patterns for pattern in self.fine_patterns_multi: if pattern.search(query_lower): return ("search_fine", 0.95) # ... rest of logic ``` **Lợi ích:** - Giảm ~50% thời gian intent classification - Cache kết quả cho duplicate queries ### 2.2 Early Exit Strategy ```python def _keyword_based_intent(self, query: str) -> Tuple[str, float]: query_lower = query.lower().strip() # Fast path: Check most common intents first # Fine queries are most common → check first if any(pattern.search(query_lower) for pattern in self.fine_patterns_multi): return ("search_fine", 0.95) # Early exit for very short queries (likely greeting) if len(query.split()) <= 2: if any(greeting in query_lower for greeting in ["xin chào", "chào", "hello"]): return ("greeting", 0.9) # ... rest ``` ## 3. Tối ưu Search Pipeline ### 3.1 Limit QuerySet trước khi Load ```python # backend/hue_portal/core/search_ml.py def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True): if not query: return queryset[:top_k] # OPTIMIZATION: Limit queryset early for large datasets # Only search in first N records if dataset is huge MAX_SEARCH_CANDIDATES = 1000 total_count = queryset.count() if total_count > MAX_SEARCH_CANDIDATES: # Use database-level filtering first # Try exact match on primary field first primary_field = text_fields[0] if text_fields else None if primary_field: exact_matches = queryset.filter( **{f"{primary_field}__icontains": query} )[:top_k * 2] if exact_matches.count() >= top_k: # We have enough exact matches, return them return exact_matches[:top_k] # Limit candidates for ML search queryset = queryset[:MAX_SEARCH_CANDIDATES] # Continue with existing search logic... ``` ### 3.2 Cache Search Results ```python # backend/hue_portal/core/search_ml.py from functools import lru_cache import hashlib import json def _get_query_hash(query: str, model_name: str, text_fields: tuple) -> str: """Generate hash for query caching.""" key = f"{query}|{model_name}|{':'.join(text_fields)}" return hashlib.md5(key.encode()).hexdigest() # Cache search results for 1 hour @lru_cache(maxsize=500) def _cached_search(query_hash: str, queryset_ids: tuple, top_k: int): """Cached search results.""" # This will be called with actual queryset in wrapper pass def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True): # Check cache first query_hash = _get_query_hash(query, queryset.model.__name__, tuple(text_fields)) # Try to get from cache (if queryset hasn't changed) # Note: Full caching requires tracking queryset state # ... existing search logic ``` ### 3.3 Optimize TF-IDF Calculation ```python # Pre-compute TF-IDF vectors for common queries # Use incremental TF-IDF instead of recalculating from sklearn.feature_extraction.text import TfidfVectorizer import numpy as np class CachedTfidfVectorizer: """TF-IDF vectorizer with caching.""" def __init__(self): self.vectorizer = None self.doc_vectors = None self.doc_ids = None def fit_transform_cached(self, documents: List[str], doc_ids: List[int]): """Fit and cache document vectors.""" if self.doc_ids == tuple(doc_ids): # Same documents, reuse vectors return self.doc_vectors # New documents, recompute self.vectorizer = TfidfVectorizer( analyzer='word', ngram_range=(1, 2), min_df=1, max_df=0.95, lowercase=True ) self.doc_vectors = self.vectorizer.fit_transform(documents) self.doc_ids = tuple(doc_ids) return self.doc_vectors ``` ### 3.4 Early Exit khi có Exact Match ```python def search_with_ml(queryset, query, text_fields, top_k=20, min_score=0.1, use_hybrid=True): # OPTIMIZATION: Check exact matches first (fastest) query_normalized = normalize_text(query) # Try exact match on primary field primary_field = text_fields[0] if text_fields else None if primary_field: exact_qs = queryset.filter(**{f"{primary_field}__iexact": query}) if exact_qs.exists(): # Found exact match, return immediately return exact_qs[:top_k] # Try case-insensitive contains (faster than ML) contains_qs = queryset.filter(**{f"{primary_field}__icontains": query}) if contains_qs.count() <= top_k * 2: # Small result set, return directly return contains_qs[:top_k] # Only use ML search if no good exact matches # ... existing ML search logic ``` ## 4. Tối ưu LLM Generation ### 4.1 Prompt Caching ```python # backend/hue_portal/chatbot/llm_integration.py from functools import lru_cache import hashlib class LLMGenerator: def __init__(self, provider: Optional[str] = None): self.provider = provider or LLM_PROVIDER self.prompt_cache = {} # Cache prompts by hash self.response_cache = {} # Cache responses def _get_prompt_hash(self, query: str, documents: List[Any]) -> str: """Generate hash for prompt caching.""" doc_ids = [getattr(doc, 'id', None) for doc in documents[:5]] key = f"{query}|{doc_ids}" return hashlib.md5(key.encode()).hexdigest() def generate_answer(self, query: str, context: Optional[List[Dict]], documents: Optional[List[Any]]): if not self.is_available(): return None # Check cache first prompt_hash = self._get_prompt_hash(query, documents or []) if prompt_hash in self.response_cache: cached_response = self.response_cache[prompt_hash] # Check if cache is still valid (e.g., < 1 hour old) if cached_response.get('timestamp', 0) > time.time() - 3600: return cached_response['response'] # Build prompt (may be cached) prompt = self._build_prompt(query, context, documents) response = self._generate_from_prompt(prompt, context=context) # Cache response if response: self.response_cache[prompt_hash] = { 'response': response, 'timestamp': time.time() } return response ``` ### 4.2 Optimize Local Model Generation ```python def _generate_local(self, prompt: str) -> Optional[str]: # OPTIMIZATION: Use faster generation parameters with torch.no_grad(): outputs = self.local_model.generate( **inputs, max_new_tokens=100, # Reduced from 150 temperature=0.5, # Lower for faster generation top_p=0.8, # Lower top_p do_sample=False, # Greedy decoding (faster) use_cache=True, pad_token_id=self.local_tokenizer.eos_token_id, repetition_penalty=1.1, # OPTIMIZATION: Early stopping eos_token_id=self.local_tokenizer.eos_token_id, ) ``` ### 4.3 Streaming Response (for better UX) ```python # For API endpoints, support streaming def generate_answer_streaming(self, query: str, context, documents): """Generate answer with streaming for better UX.""" if self.provider == LLM_PROVIDER_LOCAL: # Use generate with stream=True for token in self._generate_local_streaming(prompt): yield token elif self.provider == LLM_PROVIDER_OPENAI: # Use OpenAI streaming API for chunk in self.client.chat.completions.create( model="gpt-3.5-turbo", messages=[{"role": "user", "content": prompt}], stream=True ): yield chunk.choices[0].delta.content ``` ## 5. Response Caching Strategy ### 5.1 Multi-level Caching ```python # backend/hue_portal/core/cache_utils.py from functools import lru_cache from django.core.cache import cache import hashlib import json class ChatbotCache: """Multi-level caching for chatbot responses.""" CACHE_TIMEOUT = 3600 # 1 hour @staticmethod def get_cache_key(query: str, intent: str, session_id: str = None) -> str: """Generate cache key.""" key_parts = [query.lower().strip(), intent] if session_id: key_parts.append(session_id) key_str = "|".join(key_parts) return f"chatbot:{hashlib.md5(key_str.encode()).hexdigest()}" @staticmethod def get_cached_response(query: str, intent: str, session_id: str = None): """Get cached response.""" cache_key = ChatbotCache.get_cache_key(query, intent, session_id) return cache.get(cache_key) @staticmethod def set_cached_response(query: str, intent: str, response: dict, session_id: str = None): """Cache response.""" cache_key = ChatbotCache.get_cache_key(query, intent, session_id) cache.set(cache_key, response, ChatbotCache.CACHE_TIMEOUT) @staticmethod def get_cached_search_results(query: str, model_name: str, text_fields: tuple): """Get cached search results.""" key = f"search:{hashlib.md5(f'{query}|{model_name}|{text_fields}'.encode()).hexdigest()}" return cache.get(key) @staticmethod def set_cached_search_results(query: str, model_name: str, text_fields: tuple, results): """Cache search results.""" key = f"search:{hashlib.md5(f'{query}|{model_name}|{text_fields}'.encode()).hexdigest()}" cache.set(key, results, ChatbotCache.CACHE_TIMEOUT) ``` ### 5.2 Integrate vào Chatbot ```python # backend/hue_portal/core/chatbot.py from .cache_utils import ChatbotCache class Chatbot: def generate_response(self, query: str, session_id: str = None) -> Dict[str, Any]: query = query.strip() # Classify intent intent, confidence = self.classify_intent(query) # Check cache first cached_response = ChatbotCache.get_cached_response(query, intent, session_id) if cached_response: return cached_response # ... existing logic # Cache response before returning response = { "message": message, "intent": intent, "confidence": confidence, "results": search_result["results"], "count": search_result["count"] } ChatbotCache.set_cached_response(query, intent, response, session_id) return response ``` ## 6. Tối ưu Query Expansion ### 6.1 Cache Synonyms ```python # backend/hue_portal/core/search_ml.py from django.core.cache import cache @lru_cache(maxsize=1) def get_all_synonyms(): """Get all synonyms (cached).""" return list(Synonym.objects.all()) def expand_query_with_synonyms(query: str) -> List[str]: """Expand query using cached synonyms.""" query_normalized = normalize_text(query) expanded = [query_normalized] # Use cached synonyms synonyms = get_all_synonyms() for synonym in synonyms: keyword = normalize_text(synonym.keyword) alias = normalize_text(synonym.alias) if keyword in query_normalized: expanded.append(query_normalized.replace(keyword, alias)) if alias in query_normalized: expanded.append(query_normalized.replace(alias, keyword)) return list(set(expanded)) ``` ## 7. Database Query Optimization ### 7.1 Use select_related / prefetch_related ```python # backend/hue_portal/core/chatbot.py def search_by_intent(self, intent: str, query: str, limit: int = 5): if intent == "search_fine": qs = Fine.objects.all().select_related('decree') # If has FK # ... rest elif intent == "search_legal": qs = LegalSection.objects.all().select_related('document') # ... rest ``` ### 7.2 Add Database Indexes ```python # backend/hue_portal/core/models.py class Fine(models.Model): name = models.CharField(max_length=500, db_index=True) # Add index code = models.CharField(max_length=50, db_index=True) # Add index class Meta: indexes = [ models.Index(fields=['name', 'code']), models.Index(fields=['min_fine', 'max_fine']), ] ``` ## 8. Tối ưu Frontend ### 8.1 Debounce Search Input ```typescript // frontend/src/pages/Chat.tsx const [input, setInput] = useState('') const debouncedInput = useDebounce(input, 300) // Wait 300ms useEffect(() => { if (debouncedInput) { // Trigger search suggestions } }, [debouncedInput]) ``` ### 8.2 Optimistic UI Updates ```typescript const handleSend = async (messageText?: string) => { // Show message immediately (optimistic) setMessages(prev => [...prev, { role: 'user', content: textToSend, timestamp: new Date() }]) // Then fetch response const response = await chat(textToSend, sessionId) // Update with actual response } ``` ## 9. Monitoring & Metrics ### 9.1 Add Performance Logging ```python # backend/hue_portal/chatbot/views.py import time from django.utils import timezone @api_view(["POST"]) def chat(request: Request) -> Response: start_time = time.time() # ... existing logic # Log performance metrics elapsed = time.time() - start_time logger.info(f"[PERF] Chat response time: {elapsed:.3f}s | Intent: {intent} | Results: {count}") # Track slow queries if elapsed > 2.0: logger.warning(f"[SLOW] Query took {elapsed:.3f}s: {message[:100]}") return Response(response) ``` ### 9.2 Track Cache Hit Rate ```python class ChatbotCache: cache_hits = 0 cache_misses = 0 @staticmethod def get_cached_response(query: str, intent: str, session_id: str = None): cached = cache.get(ChatbotCache.get_cache_key(query, intent, session_id)) if cached: ChatbotCache.cache_hits += 1 return cached ChatbotCache.cache_misses += 1 return None @staticmethod def get_cache_stats(): total = ChatbotCache.cache_hits + ChatbotCache.cache_misses if total == 0: return {"hit_rate": 0, "hits": 0, "misses": 0} return { "hit_rate": ChatbotCache.cache_hits / total, "hits": ChatbotCache.cache_hits, "misses": ChatbotCache.cache_misses } ``` ## 10. Expected Performance Improvements | Optimization | Current | Optimized | Improvement | |-------------|---------|-----------|-------------| | Intent Classification | 5-10ms | 1-3ms | **70% faster** | | Search (small dataset) | 50-100ms | 10-30ms | **70% faster** | | Search (large dataset) | 200-500ms | 50-150ms | **70% faster** | | LLM Generation (cached) | 1-5s | 0.01-0.1s | **99% faster** | | LLM Generation (uncached) | 1-5s | 0.8-4s | **20% faster** | | Total Response (cached) | 100-500ms | 10-50ms | **90% faster** | | Total Response (uncached) | 1-6s | 0.5-3s | **50% faster** | ## 11. Implementation Priority ### Phase 1: Quick Wins (1-2 days) 1. ✅ Add response caching (Django cache) 2. ✅ Pre-compile keyword patterns 3. ✅ Cache synonyms 4. ✅ Add database indexes 5. ✅ Early exit for exact matches ### Phase 2: Medium Impact (3-5 days) 1. ✅ Limit QuerySet before loading 2. ✅ Optimize TF-IDF calculation 3. ✅ Prompt caching for LLM 4. ✅ Optimize local model generation 5. ✅ Add performance logging ### Phase 3: Advanced (1-2 weeks) 1. ✅ Streaming responses 2. ✅ Incremental TF-IDF 3. ✅ Advanced caching strategies 4. ✅ Query result pre-computation ## 12. Testing Performance ```python # backend/scripts/benchmark_chatbot.py import time import statistics def benchmark_chatbot(): chatbot = get_chatbot() test_queries = [ "Mức phạt vượt đèn đỏ là bao nhiêu?", "Thủ tục đăng ký cư trú cần gì?", "Địa chỉ công an phường ở đâu?", # ... more queries ] times = [] for query in test_queries: start = time.time() response = chatbot.generate_response(query) elapsed = time.time() - start times.append(elapsed) print(f"Query: {query[:50]}... | Time: {elapsed:.3f}s") print(f"\nAverage: {statistics.mean(times):.3f}s") print(f"Median: {statistics.median(times):.3f}s") print(f"P95: {statistics.quantiles(times, n=20)[18]:.3f}s") ``` ## Kết luận Với các tối ưu trên, chatbot sẽ: - **Nhanh hơn 50-90%** cho cached queries - **Nhanh hơn 20-70%** cho uncached queries - **Chính xác hơn** với early exit và exact matching - **Scalable hơn** với database indexes và query limiting