MnemoCore / docs /PATTERN_LEARNER_SPEC.md

Upload folder using huggingface_hub

7c8b011 verified about 1 month ago

16.6 kB

	# MnemoCore Pattern Learner — Specification Draft

	Version: 0.1-draft
	Date: 2026-02-20
	Status: Draft for Review
	Author: Omega (GLM-5) for Robin Granberg

	---

	## Executive Summary

	Pattern Learner är en MnemoCore-modul som lär sig från användarinteraktioner utan att lagra persondata. Den extraherar statistiska mönster, topic clustering och kvalitetsmetrics som kan användas för att förbättra chatbot-performance över tid.

	Key principle: Learn patterns, forget people.

	---

	## Problem Statement

	### Healthcare Chatbot Challenges

	\| Utmaning \| Konsekvens \|
	\|----------\|------------\|
	\| GDPR/HIPAA compliance \| Kan inte lagra konversationer \|
	\| Multitenancy \| Data får inte läcka mellan kliniker \|
	\| Quality improvement \| Behöver veta vad som fungerar \|
	\| Knowledge gaps \| Behöver identifiera vad som saknas i docs \|

	### Current Solutions (Limitations)

	- Stateless RAG: Ingen inlärning alls
	- Full memory: GDPR-risk, sekretessproblem
	- Manual analytics: Tidskrävande, inte real-time

	---

	## Solution: Pattern Learner

	### Core Concept

	```
	User Query ──► Anonymize ──► Extract Pattern ──► Aggregate
	│
	└── PII removed before storage
	```

	What IS stored:
	- Topic clusters (anonymized)
	- Query frequency distributions
	- Response quality aggregates
	- Knowledge gap indicators

	What is NOT stored:
	- User identities
	- Clinic associations
	- Patient data
	- Raw conversations

	---

	## Architecture

	### High-Level Design

	```
	┌─────────────────────────────────────────────────────────────┐
	│ Pattern Learner Module │
	├─────────────────────────────────────────────────────────────┤
	│ │
	│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
	│ │ Anonymizer │───►│Topic Extractor│───►│ Aggregator │ │
	│ └──────────────┘ └──────────────┘ └──────────────┘ │
	│ │ │ │ │
	│ │ ▼ ▼ │
	│ │ ┌──────────────┐ ┌──────────────┐ │
	│ │ │Topic Embedder│ │ Stats Store │ │
	│ │ │ (MnemoCore) │ │ (Encrypted) │ │
	│ │ └──────────────┘ └──────────────┘ │
	│ │ │ │ │
	│ └───────────────────┴────────────────────┘ │
	│ │ │
	│ ▼ │
	│ ┌──────────────┐ │
	│ │ Insights API│ │
	│ └──────────────┘ │
	│ │
	└─────────────────────────────────────────────────────────────┘
	```

	### Components

	#### 1. Anonymizer

	Purpose: Remove all PII before processing

	Methods:
	- Named Entity Recognition (NER) for person names
	- Pattern matching for phone numbers, addresses
	- Clinic/organization detection
	- Session ID hashing

	```python
	class Anonymizer:
	"""Remove PII from queries before pattern extraction"""

	def __init__(self):
	self.ner_model = load_ner_model("sv") # Swedish
	self.patterns = {
	"phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}",
	"email": r"[\w\.-]+@[\w\.-]+\.\w+",
	"personal_number": r"\d{6,8}[-\s]?\d{4}",
	}

	def anonymize(self, text: str) -> str:
	"""Remove all PII from text"""

	# 1. NER for names
	entities = self.ner_model.extract(text)
	for entity in entities:
	if entity.type in ["PER", "ORG"]:
	text = text.replace(entity.text, "[ANON]")

	# 2. Pattern matching
	for pattern_type, pattern in self.patterns.items():
	text = re.sub(pattern, f"[{pattern_type.upper()}]", text)

	# 3. Remove clinic names (configurable blacklist)
	for clinic_name in self.clinic_blacklist:
	text = text.replace(clinic_name, "[KLINIK]")

	return text
	```

	---

	#### 2. Topic Extractor

	Purpose: Extract semantic topics from anonymized queries

	Methods:
	- Keyword extraction (TF-IDF)
	- Topic modeling (LDA, BERTopic)
	- Embedding-based clustering

	```python
	class TopicExtractor:
	"""Extract topics from anonymized queries"""

	def __init__(self, mnemocore_engine):
	self.engine = mnemocore_engine
	self.topic_threshold = 0.5

	async def extract_topics(self, query: str) -> List[str]:
	"""Extract topics from anonymized query"""

	# 1. Get keywords
	keywords = self._extract_keywords(query)

	# 2. Find similar topics in MnemoCore
	similar = await self.engine.query(query, top_k=5)

	# 3. Cluster into topics
	topics = []
	for memory_id, similarity in similar:
	if similarity > self.topic_threshold:
	memory = await self.engine.get_memory(memory_id)
	topics.extend(memory.metadata.get("topics", []))

	# 4. Deduplicate
	return list(set(topics + keywords))

	def _extract_keywords(self, text: str) -> List[str]:
	"""Extract keywords using TF-IDF"""
	# Simple implementation
	words = text.lower().split()
	return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV]
	```

	---

	#### 3. Aggregator

	Purpose: Store statistical patterns without PII

	Data structures:

	```python
	@dataclass
	class TopicStats:
	"""Statistics for a topic"""
	topic: str
	count: int = 0
	first_seen: datetime = None
	last_seen: datetime = None
	trend: float = 0.0 # Recent increase/decrease

	@dataclass
	class ResponseQuality:
	"""Aggregated response quality (no individual ratings)"""
	response_signature: str # Hash of response template
	avg_rating: float = 0.5
	sample_count: int = 0
	last_updated: datetime = None

	@dataclass
	class KnowledgeGap:
	"""Topics with no good answers"""
	topic: str
	query_count: int = 0
	failure_rate: float = 1.0 # % of queries that got "I don't know"
	suggested_action: str = "" # "add documentation", "improve answer"
	```

	Storage:

	```python
	class PatternStore:
	"""Store patterns (encrypted, no PII)"""

	def __init__(self, encryption_key: bytes):
	self.key = encryption_key
	self.topics: Dict[str, TopicStats] = {}
	self.qualities: Dict[str, ResponseQuality] = {}
	self.gaps: Dict[str, KnowledgeGap] = {}

	def record_topic(self, topic: str):
	"""Record that a topic was queried"""
	if topic not in self.topics:
	self.topics[topic] = TopicStats(
	topic=topic,
	first_seen=datetime.utcnow()
	)

	stats = self.topics[topic]
	stats.count += 1
	stats.last_seen = datetime.utcnow()

	def record_quality(self, response_sig: str, rating: int):
	"""Record response quality (aggregated)"""
	if response_sig not in self.qualities:
	self.qualities[response_sig] = ResponseQuality(
	response_signature=response_sig
	)

	q = self.qualities[response_sig]
	# Exponential moving average
	q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0)
	q.sample_count += 1
	q.last_updated = datetime.utcnow()

	def record_gap(self, topic: str, had_answer: bool):
	"""Record knowledge gap"""
	if topic not in self.gaps:
	self.gaps[topic] = KnowledgeGap(topic=topic)

	gap = self.gaps[topic]
	gap.query_count += 1
	if not had_answer:
	gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count
	else:
	gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count
	```

	---

	#### 4. Insights API

	Purpose: Provide actionable insights to admins/developers

	Endpoints:

	```python
	# GET /insights/topics?top_k=10
	{
	"topics": [
	{"topic": "implantat", "count": 1250, "trend": 0.15},
	{"topic": "rotfyllning", "count": 980, "trend": -0.02},
	{"topic": "priser", "count": 850, "trend": 0.30}
	],
	"period": "30d"
	}

	# GET /insights/gaps
	{
	"knowledge_gaps": [
	{
	"topic": "tandreglering vuxna",
	"query_count": 145,
	"failure_rate": 0.85,
	"suggested_action": "add documentation"
	},
	{
	"topic": "akut tandvård",
	"query_count": 89,
	"failure_rate": 0.72,
	"suggested_action": "improve answer"
	}
	]
	}

	# GET /insights/quality
	{
	"top_responses": [
	{"signature": "abc123", "avg_rating": 4.8, "sample_count": 520},
	{"signature": "def456", "avg_rating": 4.5, "sample_count": 340}
	],
	"worst_responses": [
	{"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45}
	]
	}
	```

	---

	## MnemoCore Integration

	### Usage Pattern

	```python
	from mnemocore import HAIMEngine
	from mnemocore.pattern_learner import PatternLearner

	# Initialize MnemoCore (stores topic embeddings)
	engine = HAIMEngine(dimension=16384)
	await engine.initialize()

	# Initialize Pattern Learner
	learner = PatternLearner(
	engine=engine,
	encryption_key=get_encryption_key(),
	anonymizer=Anonymizer()
	)

	# Process a query (automatic learning)
	async def handle_query(user_query: str, tenant_id: str):
	# 1. Anonymize
	anon_query = learner.anonymize(user_query)

	# 2. Extract patterns (no PII)
	topics = await learner.extract_topics(anon_query)

	# 3. Record topic usage
	for topic in topics:
	learner.record_topic(topic)

	# 4. Get answer from RAG
	answer = await rag_lookup(anon_query)

	# 5. Record if we had an answer
	learner.record_gap(
	topic=topics[0] if topics else "unknown",
	had_answer=(answer is not None)
	)

	return answer

	# Get insights (admin only)
	async def get_dashboard():
	top_topics = learner.get_top_topics(10)
	gaps = learner.get_knowledge_gaps()
	quality = learner.get_response_quality()

	return {
	"popular_topics": top_topics,
	"needs_documentation": gaps,
	"response_performance": quality
	}
	```

	---

	## GDPR Compliance

	### Data Minimization

	\| Data Type \| Stored? \| Justification \|
	\|-----------\|---------\|---------------\|
	\| Raw queries \| ❌ \| PII risk \|
	\| User IDs \| ❌ \| Not needed \|
	\| Session IDs \| ❌ \| Not needed \|
	\| Clinic IDs \| ❌ \| Not needed \|
	\| Topic labels \| ✅ \| Anonymized \|
	\| Topic counts \| ✅ \| Statistical \|
	\| Quality scores \| ✅ \| Aggregated \|
	\| Gap indicators \| ✅ \| Anonymized \|

	### Right to Erasure (GDPR Art 17)

	Since no PII is stored, right to erasure is automatically satisfied.

	### Data Retention

	```python
	# Configurable retention
	retention_policy = {
	"topic_stats": "365d", # Keep for 1 year
	"quality_scores": "90d", # Keep for 3 months
	"gap_indicators": "30d", # Refresh monthly
	}

	# Automatic cleanup
	async def cleanup_old_patterns():
	cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"])
	for topic, stats in learner.topics.items():
	if stats.last_seen < cutoff:
	del learner.topics[topic]
	```

	---

	## Security Considerations

	### Encryption

	- All pattern data encrypted at rest (AES-256)
	- Encryption keys managed via HSM or Azure Key Vault
	- Per-tenant encryption optional (for multi-tenant isolation)

	### Access Control

	```python
	# Insights API requires admin role
	@app.get("/insights/topics")
	@require_role("admin")
	async def get_topics():
	return learner.get_top_topics(10)
	```

	### Audit Logging

	```python
	# Log all pattern access (not the patterns themselves)
	async def log_access(user_id: str, endpoint: str, timestamp: datetime):
	await audit_log.store({
	"user_id": user_id,
	"endpoint": endpoint,
	"timestamp": timestamp.isoformat(),
	# No pattern data logged
	})
	```

	---

	## Implementation Roadmap

	### Phase 1: MVP (2 weeks)

	- [ ] Anonymizer with Swedish NER
	- [ ] Basic topic extraction (keywords)
	- [ ] Topic counter (no MnemoCore yet)
	- [ ] Simple insights API

	### Phase 2: MnemoCore Integration (2 weeks)

	- [ ] Topic embedding storage in MnemoCore
	- [ ] Semantic topic clustering
	- [ ] Gap detection using similarity search

	### Phase 3: Quality Metrics (2 weeks)

	- [ ] Response quality tracking
	- [ ] Feedback integration
	- [ ] Quality dashboard

	### Phase 4: Production Hardening (2 weeks)

	- [ ] Encryption at rest
	- [ ] Access control
	- [ ] Audit logging
	- [ ] Performance optimization

	---

	## Business Value

	### For Healthcare Organizations

	\| Value \| Metric \|
	\|-------\|--------\|
	\| Documentation gaps \| Know what to add to knowledge base \|
	\| Popular topics \| Prioritize documentation efforts \|
	\| Response quality \| Improve user satisfaction \|
	\| Trend analysis \| Identify emerging needs \|

	### For Opus Dental (Competitive Advantage)

	\| Advantage \| Value \|
	\|-----------\|-------\|
	\| Continuous improvement \| Chatbot gets smarter without storing PII \|
	\| Customer insights \| Know what clinics need \|
	\| Compliance by design \| GDPR-safe from day 1 \|
	\| Unique selling point \| "Learning chatbot" vs competitors \|

	---

	## Technical Requirements

	### Dependencies

	```
	mnemocore>=4.5.0
	spacy[sv]>=3.7.0 # Swedish NER
	numpy>=1.24.0
	cryptography>=41.0.0 # Encryption
	```

	### Infrastructure

	- MnemoCore instance (can be shared or per-tenant)
	- Encrypted storage (Azure SQL, PostgreSQL with TDE)
	- Optional: Azure Key Vault for key management

	### Performance

	- Topic extraction: <50ms per query
	- Insights API: <200ms
	- Storage: ~1KB per unique topic (highly efficient)

	---

	## Open Questions

	1. Topic granularity: How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer"

	2. Trend detection: What time window for trend analysis? 7d? 30d?

	3. Multi-language: Support for Finnish/Norwegian in addition to Swedish?

	4. Tenant isolation: Should patterns be shared across tenants (anonymized) or kept separate?

	5. Feedback mechanism: How to collect ratings? Thumbs up/down? 1-5 stars?

	---

	## Conclusion

	Pattern Learner enables continuous improvement of healthcare chatbots without GDPR risk. It learns what users ask about, which answers work, and where documentation is missing — all without storing any personal data.

	Key innovation: Transform "memory" into "patterns" — compliance-safe learning.

	---

	## Next Steps

	1. Review this spec
	2. Decide on open questions
	3. Prioritize MVP features
	4. Start implementation

	---

	Draft by Omega (GLM-5) for Robin Granberg
	2026-02-20