MnemoCore / docs /PATTERN_LEARNER_SPEC.md
Granis87's picture
Upload folder using huggingface_hub
7c8b011 verified
# MnemoCore Pattern Learner β€” Specification Draft
**Version:** 0.1-draft
**Date:** 2026-02-20
**Status:** Draft for Review
**Author:** Omega (GLM-5) for Robin Granberg
---
## Executive Summary
Pattern Learner Γ€r en MnemoCore-modul som lΓ€r sig frΓ₯n anvΓ€ndarinteraktioner **utan att lagra persondata**. Den extraherar statistiska mΓΆnster, topic clustering och kvalitetsmetrics som kan anvΓ€ndas fΓΆr att fΓΆrbΓ€ttra chatbot-performance ΓΆver tid.
**Key principle:** Learn patterns, forget people.
---
## Problem Statement
### Healthcare Chatbot Challenges
| Utmaning | Konsekvens |
|----------|------------|
| GDPR/HIPAA compliance | Kan inte lagra konversationer |
| Multitenancy | Data fΓ₯r inte lΓ€cka mellan kliniker |
| Quality improvement | BehΓΆver veta vad som fungerar |
| Knowledge gaps | BehΓΆver identifiera vad som saknas i docs |
### Current Solutions (Limitations)
- **Stateless RAG:** Ingen inlΓ€rning alls
- **Full memory:** GDPR-risk, sekretessproblem
- **Manual analytics:** TidskrΓ€vande, inte real-time
---
## Solution: Pattern Learner
### Core Concept
```
User Query ──► Anonymize ──► Extract Pattern ──► Aggregate
β”‚
└── PII removed before storage
```
**What IS stored:**
- Topic clusters (anonymized)
- Query frequency distributions
- Response quality aggregates
- Knowledge gap indicators
**What is NOT stored:**
- User identities
- Clinic associations
- Patient data
- Raw conversations
---
## Architecture
### High-Level Design
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pattern Learner Module β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Anonymizer │───►│Topic Extractor│───►│ Aggregator β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β–Ό β–Ό β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ β”‚Topic Embedderβ”‚ β”‚ Stats Store β”‚ β”‚
β”‚ β”‚ β”‚ (MnemoCore) β”‚ β”‚ (Encrypted) β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ β”‚
β”‚ β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Insights APIβ”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### Components
#### 1. Anonymizer
**Purpose:** Remove all PII before processing
**Methods:**
- Named Entity Recognition (NER) for person names
- Pattern matching for phone numbers, addresses
- Clinic/organization detection
- Session ID hashing
```python
class Anonymizer:
"""Remove PII from queries before pattern extraction"""
def __init__(self):
self.ner_model = load_ner_model("sv") # Swedish
self.patterns = {
"phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}",
"email": r"[\w\.-]+@[\w\.-]+\.\w+",
"personal_number": r"\d{6,8}[-\s]?\d{4}",
}
def anonymize(self, text: str) -> str:
"""Remove all PII from text"""
# 1. NER for names
entities = self.ner_model.extract(text)
for entity in entities:
if entity.type in ["PER", "ORG"]:
text = text.replace(entity.text, "[ANON]")
# 2. Pattern matching
for pattern_type, pattern in self.patterns.items():
text = re.sub(pattern, f"[{pattern_type.upper()}]", text)
# 3. Remove clinic names (configurable blacklist)
for clinic_name in self.clinic_blacklist:
text = text.replace(clinic_name, "[KLINIK]")
return text
```
---
#### 2. Topic Extractor
**Purpose:** Extract semantic topics from anonymized queries
**Methods:**
- Keyword extraction (TF-IDF)
- Topic modeling (LDA, BERTopic)
- Embedding-based clustering
```python
class TopicExtractor:
"""Extract topics from anonymized queries"""
def __init__(self, mnemocore_engine):
self.engine = mnemocore_engine
self.topic_threshold = 0.5
async def extract_topics(self, query: str) -> List[str]:
"""Extract topics from anonymized query"""
# 1. Get keywords
keywords = self._extract_keywords(query)
# 2. Find similar topics in MnemoCore
similar = await self.engine.query(query, top_k=5)
# 3. Cluster into topics
topics = []
for memory_id, similarity in similar:
if similarity > self.topic_threshold:
memory = await self.engine.get_memory(memory_id)
topics.extend(memory.metadata.get("topics", []))
# 4. Deduplicate
return list(set(topics + keywords))
def _extract_keywords(self, text: str) -> List[str]:
"""Extract keywords using TF-IDF"""
# Simple implementation
words = text.lower().split()
return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV]
```
---
#### 3. Aggregator
**Purpose:** Store statistical patterns without PII
**Data structures:**
```python
@dataclass
class TopicStats:
"""Statistics for a topic"""
topic: str
count: int = 0
first_seen: datetime = None
last_seen: datetime = None
trend: float = 0.0 # Recent increase/decrease
@dataclass
class ResponseQuality:
"""Aggregated response quality (no individual ratings)"""
response_signature: str # Hash of response template
avg_rating: float = 0.5
sample_count: int = 0
last_updated: datetime = None
@dataclass
class KnowledgeGap:
"""Topics with no good answers"""
topic: str
query_count: int = 0
failure_rate: float = 1.0 # % of queries that got "I don't know"
suggested_action: str = "" # "add documentation", "improve answer"
```
**Storage:**
```python
class PatternStore:
"""Store patterns (encrypted, no PII)"""
def __init__(self, encryption_key: bytes):
self.key = encryption_key
self.topics: Dict[str, TopicStats] = {}
self.qualities: Dict[str, ResponseQuality] = {}
self.gaps: Dict[str, KnowledgeGap] = {}
def record_topic(self, topic: str):
"""Record that a topic was queried"""
if topic not in self.topics:
self.topics[topic] = TopicStats(
topic=topic,
first_seen=datetime.utcnow()
)
stats = self.topics[topic]
stats.count += 1
stats.last_seen = datetime.utcnow()
def record_quality(self, response_sig: str, rating: int):
"""Record response quality (aggregated)"""
if response_sig not in self.qualities:
self.qualities[response_sig] = ResponseQuality(
response_signature=response_sig
)
q = self.qualities[response_sig]
# Exponential moving average
q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0)
q.sample_count += 1
q.last_updated = datetime.utcnow()
def record_gap(self, topic: str, had_answer: bool):
"""Record knowledge gap"""
if topic not in self.gaps:
self.gaps[topic] = KnowledgeGap(topic=topic)
gap = self.gaps[topic]
gap.query_count += 1
if not had_answer:
gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count
else:
gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count
```
---
#### 4. Insights API
**Purpose:** Provide actionable insights to admins/developers
**Endpoints:**
```python
# GET /insights/topics?top_k=10
{
"topics": [
{"topic": "implantat", "count": 1250, "trend": 0.15},
{"topic": "rotfyllning", "count": 980, "trend": -0.02},
{"topic": "priser", "count": 850, "trend": 0.30}
],
"period": "30d"
}
# GET /insights/gaps
{
"knowledge_gaps": [
{
"topic": "tandreglering vuxna",
"query_count": 145,
"failure_rate": 0.85,
"suggested_action": "add documentation"
},
{
"topic": "akut tandvΓ₯rd",
"query_count": 89,
"failure_rate": 0.72,
"suggested_action": "improve answer"
}
]
}
# GET /insights/quality
{
"top_responses": [
{"signature": "abc123", "avg_rating": 4.8, "sample_count": 520},
{"signature": "def456", "avg_rating": 4.5, "sample_count": 340}
],
"worst_responses": [
{"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45}
]
}
```
---
## MnemoCore Integration
### Usage Pattern
```python
from mnemocore import HAIMEngine
from mnemocore.pattern_learner import PatternLearner
# Initialize MnemoCore (stores topic embeddings)
engine = HAIMEngine(dimension=16384)
await engine.initialize()
# Initialize Pattern Learner
learner = PatternLearner(
engine=engine,
encryption_key=get_encryption_key(),
anonymizer=Anonymizer()
)
# Process a query (automatic learning)
async def handle_query(user_query: str, tenant_id: str):
# 1. Anonymize
anon_query = learner.anonymize(user_query)
# 2. Extract patterns (no PII)
topics = await learner.extract_topics(anon_query)
# 3. Record topic usage
for topic in topics:
learner.record_topic(topic)
# 4. Get answer from RAG
answer = await rag_lookup(anon_query)
# 5. Record if we had an answer
learner.record_gap(
topic=topics[0] if topics else "unknown",
had_answer=(answer is not None)
)
return answer
# Get insights (admin only)
async def get_dashboard():
top_topics = learner.get_top_topics(10)
gaps = learner.get_knowledge_gaps()
quality = learner.get_response_quality()
return {
"popular_topics": top_topics,
"needs_documentation": gaps,
"response_performance": quality
}
```
---
## GDPR Compliance
### Data Minimization
| Data Type | Stored? | Justification |
|-----------|---------|---------------|
| Raw queries | ❌ | PII risk |
| User IDs | ❌ | Not needed |
| Session IDs | ❌ | Not needed |
| Clinic IDs | ❌ | Not needed |
| **Topic labels** | βœ… | Anonymized |
| **Topic counts** | βœ… | Statistical |
| **Quality scores** | βœ… | Aggregated |
| **Gap indicators** | βœ… | Anonymized |
### Right to Erasure (GDPR Art 17)
Since no PII is stored, right to erasure is **automatically satisfied**.
### Data Retention
```python
# Configurable retention
retention_policy = {
"topic_stats": "365d", # Keep for 1 year
"quality_scores": "90d", # Keep for 3 months
"gap_indicators": "30d", # Refresh monthly
}
# Automatic cleanup
async def cleanup_old_patterns():
cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"])
for topic, stats in learner.topics.items():
if stats.last_seen < cutoff:
del learner.topics[topic]
```
---
## Security Considerations
### Encryption
- All pattern data encrypted at rest (AES-256)
- Encryption keys managed via HSM or Azure Key Vault
- Per-tenant encryption optional (for multi-tenant isolation)
### Access Control
```python
# Insights API requires admin role
@app.get("/insights/topics")
@require_role("admin")
async def get_topics():
return learner.get_top_topics(10)
```
### Audit Logging
```python
# Log all pattern access (not the patterns themselves)
async def log_access(user_id: str, endpoint: str, timestamp: datetime):
await audit_log.store({
"user_id": user_id,
"endpoint": endpoint,
"timestamp": timestamp.isoformat(),
# No pattern data logged
})
```
---
## Implementation Roadmap
### Phase 1: MVP (2 weeks)
- [ ] Anonymizer with Swedish NER
- [ ] Basic topic extraction (keywords)
- [ ] Topic counter (no MnemoCore yet)
- [ ] Simple insights API
### Phase 2: MnemoCore Integration (2 weeks)
- [ ] Topic embedding storage in MnemoCore
- [ ] Semantic topic clustering
- [ ] Gap detection using similarity search
### Phase 3: Quality Metrics (2 weeks)
- [ ] Response quality tracking
- [ ] Feedback integration
- [ ] Quality dashboard
### Phase 4: Production Hardening (2 weeks)
- [ ] Encryption at rest
- [ ] Access control
- [ ] Audit logging
- [ ] Performance optimization
---
## Business Value
### For Healthcare Organizations
| Value | Metric |
|-------|--------|
| **Documentation gaps** | Know what to add to knowledge base |
| **Popular topics** | Prioritize documentation efforts |
| **Response quality** | Improve user satisfaction |
| **Trend analysis** | Identify emerging needs |
### For Opus Dental (Competitive Advantage)
| Advantage | Value |
|-----------|-------|
| **Continuous improvement** | Chatbot gets smarter without storing PII |
| **Customer insights** | Know what clinics need |
| **Compliance by design** | GDPR-safe from day 1 |
| **Unique selling point** | "Learning chatbot" vs competitors |
---
## Technical Requirements
### Dependencies
```
mnemocore>=4.5.0
spacy[sv]>=3.7.0 # Swedish NER
numpy>=1.24.0
cryptography>=41.0.0 # Encryption
```
### Infrastructure
- MnemoCore instance (can be shared or per-tenant)
- Encrypted storage (Azure SQL, PostgreSQL with TDE)
- Optional: Azure Key Vault for key management
### Performance
- Topic extraction: <50ms per query
- Insights API: <200ms
- Storage: ~1KB per unique topic (highly efficient)
---
## Open Questions
1. **Topic granularity:** How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer"
2. **Trend detection:** What time window for trend analysis? 7d? 30d?
3. **Multi-language:** Support for Finnish/Norwegian in addition to Swedish?
4. **Tenant isolation:** Should patterns be shared across tenants (anonymized) or kept separate?
5. **Feedback mechanism:** How to collect ratings? Thumbs up/down? 1-5 stars?
---
## Conclusion
Pattern Learner enables **continuous improvement** of healthcare chatbots **without GDPR risk**. It learns what users ask about, which answers work, and where documentation is missing β€” all without storing any personal data.
**Key innovation:** Transform "memory" into "patterns" β€” compliance-safe learning.
---
## Next Steps
1. Review this spec
2. Decide on open questions
3. Prioritize MVP features
4. Start implementation
---
*Draft by Omega (GLM-5) for Robin Granberg*
*2026-02-20*