| # MnemoCore Pattern Learner β Specification Draft | |
| **Version:** 0.1-draft | |
| **Date:** 2026-02-20 | |
| **Status:** Draft for Review | |
| **Author:** Omega (GLM-5) for Robin Granberg | |
| --- | |
| ## Executive Summary | |
| Pattern Learner Γ€r en MnemoCore-modul som lΓ€r sig frΓ₯n anvΓ€ndarinteraktioner **utan att lagra persondata**. Den extraherar statistiska mΓΆnster, topic clustering och kvalitetsmetrics som kan anvΓ€ndas fΓΆr att fΓΆrbΓ€ttra chatbot-performance ΓΆver tid. | |
| **Key principle:** Learn patterns, forget people. | |
| --- | |
| ## Problem Statement | |
| ### Healthcare Chatbot Challenges | |
| | Utmaning | Konsekvens | | |
| |----------|------------| | |
| | GDPR/HIPAA compliance | Kan inte lagra konversationer | | |
| | Multitenancy | Data fΓ₯r inte lΓ€cka mellan kliniker | | |
| | Quality improvement | BehΓΆver veta vad som fungerar | | |
| | Knowledge gaps | BehΓΆver identifiera vad som saknas i docs | | |
| ### Current Solutions (Limitations) | |
| - **Stateless RAG:** Ingen inlΓ€rning alls | |
| - **Full memory:** GDPR-risk, sekretessproblem | |
| - **Manual analytics:** TidskrΓ€vande, inte real-time | |
| --- | |
| ## Solution: Pattern Learner | |
| ### Core Concept | |
| ``` | |
| User Query βββΊ Anonymize βββΊ Extract Pattern βββΊ Aggregate | |
| β | |
| βββ PII removed before storage | |
| ``` | |
| **What IS stored:** | |
| - Topic clusters (anonymized) | |
| - Query frequency distributions | |
| - Response quality aggregates | |
| - Knowledge gap indicators | |
| **What is NOT stored:** | |
| - User identities | |
| - Clinic associations | |
| - Patient data | |
| - Raw conversations | |
| --- | |
| ## Architecture | |
| ### High-Level Design | |
| ``` | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| β Pattern Learner Module β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€ | |
| β β | |
| β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β | |
| β β Anonymizer βββββΊβTopic ExtractorβββββΊβ Aggregator β β | |
| β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β | |
| β β β β β | |
| β β βΌ βΌ β | |
| β β ββββββββββββββββ ββββββββββββββββ β | |
| β β βTopic Embedderβ β Stats Store β β | |
| β β β (MnemoCore) β β (Encrypted) β β | |
| β β ββββββββββββββββ ββββββββββββββββ β | |
| β β β β β | |
| β βββββββββββββββββββββ΄βββββββββββββββββββββ β | |
| β β β | |
| β βΌ β | |
| β ββββββββββββββββ β | |
| β β Insights APIβ β | |
| β ββββββββββββββββ β | |
| β β | |
| βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | |
| ``` | |
| ### Components | |
| #### 1. Anonymizer | |
| **Purpose:** Remove all PII before processing | |
| **Methods:** | |
| - Named Entity Recognition (NER) for person names | |
| - Pattern matching for phone numbers, addresses | |
| - Clinic/organization detection | |
| - Session ID hashing | |
| ```python | |
| class Anonymizer: | |
| """Remove PII from queries before pattern extraction""" | |
| def __init__(self): | |
| self.ner_model = load_ner_model("sv") # Swedish | |
| self.patterns = { | |
| "phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}", | |
| "email": r"[\w\.-]+@[\w\.-]+\.\w+", | |
| "personal_number": r"\d{6,8}[-\s]?\d{4}", | |
| } | |
| def anonymize(self, text: str) -> str: | |
| """Remove all PII from text""" | |
| # 1. NER for names | |
| entities = self.ner_model.extract(text) | |
| for entity in entities: | |
| if entity.type in ["PER", "ORG"]: | |
| text = text.replace(entity.text, "[ANON]") | |
| # 2. Pattern matching | |
| for pattern_type, pattern in self.patterns.items(): | |
| text = re.sub(pattern, f"[{pattern_type.upper()}]", text) | |
| # 3. Remove clinic names (configurable blacklist) | |
| for clinic_name in self.clinic_blacklist: | |
| text = text.replace(clinic_name, "[KLINIK]") | |
| return text | |
| ``` | |
| --- | |
| #### 2. Topic Extractor | |
| **Purpose:** Extract semantic topics from anonymized queries | |
| **Methods:** | |
| - Keyword extraction (TF-IDF) | |
| - Topic modeling (LDA, BERTopic) | |
| - Embedding-based clustering | |
| ```python | |
| class TopicExtractor: | |
| """Extract topics from anonymized queries""" | |
| def __init__(self, mnemocore_engine): | |
| self.engine = mnemocore_engine | |
| self.topic_threshold = 0.5 | |
| async def extract_topics(self, query: str) -> List[str]: | |
| """Extract topics from anonymized query""" | |
| # 1. Get keywords | |
| keywords = self._extract_keywords(query) | |
| # 2. Find similar topics in MnemoCore | |
| similar = await self.engine.query(query, top_k=5) | |
| # 3. Cluster into topics | |
| topics = [] | |
| for memory_id, similarity in similar: | |
| if similarity > self.topic_threshold: | |
| memory = await self.engine.get_memory(memory_id) | |
| topics.extend(memory.metadata.get("topics", [])) | |
| # 4. Deduplicate | |
| return list(set(topics + keywords)) | |
| def _extract_keywords(self, text: str) -> List[str]: | |
| """Extract keywords using TF-IDF""" | |
| # Simple implementation | |
| words = text.lower().split() | |
| return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV] | |
| ``` | |
| --- | |
| #### 3. Aggregator | |
| **Purpose:** Store statistical patterns without PII | |
| **Data structures:** | |
| ```python | |
| @dataclass | |
| class TopicStats: | |
| """Statistics for a topic""" | |
| topic: str | |
| count: int = 0 | |
| first_seen: datetime = None | |
| last_seen: datetime = None | |
| trend: float = 0.0 # Recent increase/decrease | |
| @dataclass | |
| class ResponseQuality: | |
| """Aggregated response quality (no individual ratings)""" | |
| response_signature: str # Hash of response template | |
| avg_rating: float = 0.5 | |
| sample_count: int = 0 | |
| last_updated: datetime = None | |
| @dataclass | |
| class KnowledgeGap: | |
| """Topics with no good answers""" | |
| topic: str | |
| query_count: int = 0 | |
| failure_rate: float = 1.0 # % of queries that got "I don't know" | |
| suggested_action: str = "" # "add documentation", "improve answer" | |
| ``` | |
| **Storage:** | |
| ```python | |
| class PatternStore: | |
| """Store patterns (encrypted, no PII)""" | |
| def __init__(self, encryption_key: bytes): | |
| self.key = encryption_key | |
| self.topics: Dict[str, TopicStats] = {} | |
| self.qualities: Dict[str, ResponseQuality] = {} | |
| self.gaps: Dict[str, KnowledgeGap] = {} | |
| def record_topic(self, topic: str): | |
| """Record that a topic was queried""" | |
| if topic not in self.topics: | |
| self.topics[topic] = TopicStats( | |
| topic=topic, | |
| first_seen=datetime.utcnow() | |
| ) | |
| stats = self.topics[topic] | |
| stats.count += 1 | |
| stats.last_seen = datetime.utcnow() | |
| def record_quality(self, response_sig: str, rating: int): | |
| """Record response quality (aggregated)""" | |
| if response_sig not in self.qualities: | |
| self.qualities[response_sig] = ResponseQuality( | |
| response_signature=response_sig | |
| ) | |
| q = self.qualities[response_sig] | |
| # Exponential moving average | |
| q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0) | |
| q.sample_count += 1 | |
| q.last_updated = datetime.utcnow() | |
| def record_gap(self, topic: str, had_answer: bool): | |
| """Record knowledge gap""" | |
| if topic not in self.gaps: | |
| self.gaps[topic] = KnowledgeGap(topic=topic) | |
| gap = self.gaps[topic] | |
| gap.query_count += 1 | |
| if not had_answer: | |
| gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count | |
| else: | |
| gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count | |
| ``` | |
| --- | |
| #### 4. Insights API | |
| **Purpose:** Provide actionable insights to admins/developers | |
| **Endpoints:** | |
| ```python | |
| # GET /insights/topics?top_k=10 | |
| { | |
| "topics": [ | |
| {"topic": "implantat", "count": 1250, "trend": 0.15}, | |
| {"topic": "rotfyllning", "count": 980, "trend": -0.02}, | |
| {"topic": "priser", "count": 850, "trend": 0.30} | |
| ], | |
| "period": "30d" | |
| } | |
| # GET /insights/gaps | |
| { | |
| "knowledge_gaps": [ | |
| { | |
| "topic": "tandreglering vuxna", | |
| "query_count": 145, | |
| "failure_rate": 0.85, | |
| "suggested_action": "add documentation" | |
| }, | |
| { | |
| "topic": "akut tandvΓ₯rd", | |
| "query_count": 89, | |
| "failure_rate": 0.72, | |
| "suggested_action": "improve answer" | |
| } | |
| ] | |
| } | |
| # GET /insights/quality | |
| { | |
| "top_responses": [ | |
| {"signature": "abc123", "avg_rating": 4.8, "sample_count": 520}, | |
| {"signature": "def456", "avg_rating": 4.5, "sample_count": 340} | |
| ], | |
| "worst_responses": [ | |
| {"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45} | |
| ] | |
| } | |
| ``` | |
| --- | |
| ## MnemoCore Integration | |
| ### Usage Pattern | |
| ```python | |
| from mnemocore import HAIMEngine | |
| from mnemocore.pattern_learner import PatternLearner | |
| # Initialize MnemoCore (stores topic embeddings) | |
| engine = HAIMEngine(dimension=16384) | |
| await engine.initialize() | |
| # Initialize Pattern Learner | |
| learner = PatternLearner( | |
| engine=engine, | |
| encryption_key=get_encryption_key(), | |
| anonymizer=Anonymizer() | |
| ) | |
| # Process a query (automatic learning) | |
| async def handle_query(user_query: str, tenant_id: str): | |
| # 1. Anonymize | |
| anon_query = learner.anonymize(user_query) | |
| # 2. Extract patterns (no PII) | |
| topics = await learner.extract_topics(anon_query) | |
| # 3. Record topic usage | |
| for topic in topics: | |
| learner.record_topic(topic) | |
| # 4. Get answer from RAG | |
| answer = await rag_lookup(anon_query) | |
| # 5. Record if we had an answer | |
| learner.record_gap( | |
| topic=topics[0] if topics else "unknown", | |
| had_answer=(answer is not None) | |
| ) | |
| return answer | |
| # Get insights (admin only) | |
| async def get_dashboard(): | |
| top_topics = learner.get_top_topics(10) | |
| gaps = learner.get_knowledge_gaps() | |
| quality = learner.get_response_quality() | |
| return { | |
| "popular_topics": top_topics, | |
| "needs_documentation": gaps, | |
| "response_performance": quality | |
| } | |
| ``` | |
| --- | |
| ## GDPR Compliance | |
| ### Data Minimization | |
| | Data Type | Stored? | Justification | | |
| |-----------|---------|---------------| | |
| | Raw queries | β | PII risk | | |
| | User IDs | β | Not needed | | |
| | Session IDs | β | Not needed | | |
| | Clinic IDs | β | Not needed | | |
| | **Topic labels** | β | Anonymized | | |
| | **Topic counts** | β | Statistical | | |
| | **Quality scores** | β | Aggregated | | |
| | **Gap indicators** | β | Anonymized | | |
| ### Right to Erasure (GDPR Art 17) | |
| Since no PII is stored, right to erasure is **automatically satisfied**. | |
| ### Data Retention | |
| ```python | |
| # Configurable retention | |
| retention_policy = { | |
| "topic_stats": "365d", # Keep for 1 year | |
| "quality_scores": "90d", # Keep for 3 months | |
| "gap_indicators": "30d", # Refresh monthly | |
| } | |
| # Automatic cleanup | |
| async def cleanup_old_patterns(): | |
| cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"]) | |
| for topic, stats in learner.topics.items(): | |
| if stats.last_seen < cutoff: | |
| del learner.topics[topic] | |
| ``` | |
| --- | |
| ## Security Considerations | |
| ### Encryption | |
| - All pattern data encrypted at rest (AES-256) | |
| - Encryption keys managed via HSM or Azure Key Vault | |
| - Per-tenant encryption optional (for multi-tenant isolation) | |
| ### Access Control | |
| ```python | |
| # Insights API requires admin role | |
| @app.get("/insights/topics") | |
| @require_role("admin") | |
| async def get_topics(): | |
| return learner.get_top_topics(10) | |
| ``` | |
| ### Audit Logging | |
| ```python | |
| # Log all pattern access (not the patterns themselves) | |
| async def log_access(user_id: str, endpoint: str, timestamp: datetime): | |
| await audit_log.store({ | |
| "user_id": user_id, | |
| "endpoint": endpoint, | |
| "timestamp": timestamp.isoformat(), | |
| # No pattern data logged | |
| }) | |
| ``` | |
| --- | |
| ## Implementation Roadmap | |
| ### Phase 1: MVP (2 weeks) | |
| - [ ] Anonymizer with Swedish NER | |
| - [ ] Basic topic extraction (keywords) | |
| - [ ] Topic counter (no MnemoCore yet) | |
| - [ ] Simple insights API | |
| ### Phase 2: MnemoCore Integration (2 weeks) | |
| - [ ] Topic embedding storage in MnemoCore | |
| - [ ] Semantic topic clustering | |
| - [ ] Gap detection using similarity search | |
| ### Phase 3: Quality Metrics (2 weeks) | |
| - [ ] Response quality tracking | |
| - [ ] Feedback integration | |
| - [ ] Quality dashboard | |
| ### Phase 4: Production Hardening (2 weeks) | |
| - [ ] Encryption at rest | |
| - [ ] Access control | |
| - [ ] Audit logging | |
| - [ ] Performance optimization | |
| --- | |
| ## Business Value | |
| ### For Healthcare Organizations | |
| | Value | Metric | | |
| |-------|--------| | |
| | **Documentation gaps** | Know what to add to knowledge base | | |
| | **Popular topics** | Prioritize documentation efforts | | |
| | **Response quality** | Improve user satisfaction | | |
| | **Trend analysis** | Identify emerging needs | | |
| ### For Opus Dental (Competitive Advantage) | |
| | Advantage | Value | | |
| |-----------|-------| | |
| | **Continuous improvement** | Chatbot gets smarter without storing PII | | |
| | **Customer insights** | Know what clinics need | | |
| | **Compliance by design** | GDPR-safe from day 1 | | |
| | **Unique selling point** | "Learning chatbot" vs competitors | | |
| --- | |
| ## Technical Requirements | |
| ### Dependencies | |
| ``` | |
| mnemocore>=4.5.0 | |
| spacy[sv]>=3.7.0 # Swedish NER | |
| numpy>=1.24.0 | |
| cryptography>=41.0.0 # Encryption | |
| ``` | |
| ### Infrastructure | |
| - MnemoCore instance (can be shared or per-tenant) | |
| - Encrypted storage (Azure SQL, PostgreSQL with TDE) | |
| - Optional: Azure Key Vault for key management | |
| ### Performance | |
| - Topic extraction: <50ms per query | |
| - Insights API: <200ms | |
| - Storage: ~1KB per unique topic (highly efficient) | |
| --- | |
| ## Open Questions | |
| 1. **Topic granularity:** How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer" | |
| 2. **Trend detection:** What time window for trend analysis? 7d? 30d? | |
| 3. **Multi-language:** Support for Finnish/Norwegian in addition to Swedish? | |
| 4. **Tenant isolation:** Should patterns be shared across tenants (anonymized) or kept separate? | |
| 5. **Feedback mechanism:** How to collect ratings? Thumbs up/down? 1-5 stars? | |
| --- | |
| ## Conclusion | |
| Pattern Learner enables **continuous improvement** of healthcare chatbots **without GDPR risk**. It learns what users ask about, which answers work, and where documentation is missing β all without storing any personal data. | |
| **Key innovation:** Transform "memory" into "patterns" β compliance-safe learning. | |
| --- | |
| ## Next Steps | |
| 1. Review this spec | |
| 2. Decide on open questions | |
| 3. Prioritize MVP features | |
| 4. Start implementation | |
| --- | |
| *Draft by Omega (GLM-5) for Robin Granberg* | |
| *2026-02-20* | |