# MnemoCore Pattern Learner — Specification Draft

**Version:** 0.1-draft  
**Date:** 2026-02-20  
**Status:** Draft for Review  
**Author:** Omega (GLM-5) for Robin Granberg

---

## Executive Summary

Pattern Learner är en MnemoCore-modul som lär sig från användarinteraktioner **utan att lagra persondata**. Den extraherar statistiska mönster, topic clustering och kvalitetsmetrics som kan användas för att förbättra chatbot-performance över tid.

**Key principle:** Learn patterns, forget people.

---

## Problem Statement

### Healthcare Chatbot Challenges

| Utmaning | Konsekvens |
|----------|------------|
| GDPR/HIPAA compliance | Kan inte lagra konversationer |
| Multitenancy | Data får inte läcka mellan kliniker |
| Quality improvement | Behöver veta vad som fungerar |
| Knowledge gaps | Behöver identifiera vad som saknas i docs |

### Current Solutions (Limitations)

- **Stateless RAG:** Ingen inlärning alls
- **Full memory:** GDPR-risk, sekretessproblem
- **Manual analytics:** Tidskrävande, inte real-time

---

## Solution: Pattern Learner

### Core Concept

```
User Query ──► Anonymize ──► Extract Pattern ──► Aggregate
                  │
                  └── PII removed before storage
```

**What IS stored:**
- Topic clusters (anonymized)
- Query frequency distributions
- Response quality aggregates
- Knowledge gap indicators

**What is NOT stored:**
- User identities
- Clinic associations
- Patient data
- Raw conversations

---

## Architecture

### High-Level Design

```
┌─────────────────────────────────────────────────────────────┐
│                    Pattern Learner Module                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │   Anonymizer │───►│Topic Extractor│───►│  Aggregator  │  │
│  └──────────────┘    └──────────────┘    └──────────────┘  │
│         │                   │                    │          │
│         │                   ▼                    ▼          │
│         │           ┌──────────────┐    ┌──────────────┐   │
│         │           │Topic Embedder│    │ Stats Store  │   │
│         │           │  (MnemoCore) │    │  (Encrypted) │   │
│         │           └──────────────┘    └──────────────┘   │
│         │                   │                    │          │
│         └───────────────────┴────────────────────┘          │
│                             │                               │
│                             ▼                               │
│                    ┌──────────────┐                        │
│                    │  Insights API│                        │
│                    └──────────────┘                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘
```

### Components

#### 1. Anonymizer

**Purpose:** Remove all PII before processing

**Methods:**
- Named Entity Recognition (NER) for person names
- Pattern matching for phone numbers, addresses
- Clinic/organization detection
- Session ID hashing

```python
class Anonymizer:
    """Remove PII from queries before pattern extraction"""
    
    def __init__(self):
        self.ner_model = load_ner_model("sv")  # Swedish
        self.patterns = {
            "phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}",
            "email": r"[\w\.-]+@[\w\.-]+\.\w+",
            "personal_number": r"\d{6,8}[-\s]?\d{4}",
        }
    
    def anonymize(self, text: str) -> str:
        """Remove all PII from text"""
        
        # 1. NER for names
        entities = self.ner_model.extract(text)
        for entity in entities:
            if entity.type in ["PER", "ORG"]:
                text = text.replace(entity.text, "[ANON]")
        
        # 2. Pattern matching
        for pattern_type, pattern in self.patterns.items():
            text = re.sub(pattern, f"[{pattern_type.upper()}]", text)
        
        # 3. Remove clinic names (configurable blacklist)
        for clinic_name in self.clinic_blacklist:
            text = text.replace(clinic_name, "[KLINIK]")
        
        return text
```

---

#### 2. Topic Extractor

**Purpose:** Extract semantic topics from anonymized queries

**Methods:**
- Keyword extraction (TF-IDF)
- Topic modeling (LDA, BERTopic)
- Embedding-based clustering

```python
class TopicExtractor:
    """Extract topics from anonymized queries"""
    
    def __init__(self, mnemocore_engine):
        self.engine = mnemocore_engine
        self.topic_threshold = 0.5
    
    async def extract_topics(self, query: str) -> List[str]:
        """Extract topics from anonymized query"""
        
        # 1. Get keywords
        keywords = self._extract_keywords(query)
        
        # 2. Find similar topics in MnemoCore
        similar = await self.engine.query(query, top_k=5)
        
        # 3. Cluster into topics
        topics = []
        for memory_id, similarity in similar:
            if similarity > self.topic_threshold:
                memory = await self.engine.get_memory(memory_id)
                topics.extend(memory.metadata.get("topics", []))
        
        # 4. Deduplicate
        return list(set(topics + keywords))
    
    def _extract_keywords(self, text: str) -> List[str]:
        """Extract keywords using TF-IDF"""
        # Simple implementation
        words = text.lower().split()
        return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV]
```

---

#### 3. Aggregator

**Purpose:** Store statistical patterns without PII

**Data structures:**

```python
@dataclass
class TopicStats:
    """Statistics for a topic"""
    topic: str
    count: int = 0
    first_seen: datetime = None
    last_seen: datetime = None
    trend: float = 0.0  # Recent increase/decrease

@dataclass
class ResponseQuality:
    """Aggregated response quality (no individual ratings)"""
    response_signature: str  # Hash of response template
    avg_rating: float = 0.5
    sample_count: int = 0
    last_updated: datetime = None

@dataclass
class KnowledgeGap:
    """Topics with no good answers"""
    topic: str
    query_count: int = 0
    failure_rate: float = 1.0  # % of queries that got "I don't know"
    suggested_action: str = ""  # "add documentation", "improve answer"
```

**Storage:**

```python
class PatternStore:
    """Store patterns (encrypted, no PII)"""
    
    def __init__(self, encryption_key: bytes):
        self.key = encryption_key
        self.topics: Dict[str, TopicStats] = {}
        self.qualities: Dict[str, ResponseQuality] = {}
        self.gaps: Dict[str, KnowledgeGap] = {}
    
    def record_topic(self, topic: str):
        """Record that a topic was queried"""
        if topic not in self.topics:
            self.topics[topic] = TopicStats(
                topic=topic,
                first_seen=datetime.utcnow()
            )
        
        stats = self.topics[topic]
        stats.count += 1
        stats.last_seen = datetime.utcnow()
    
    def record_quality(self, response_sig: str, rating: int):
        """Record response quality (aggregated)"""
        if response_sig not in self.qualities:
            self.qualities[response_sig] = ResponseQuality(
                response_signature=response_sig
            )
        
        q = self.qualities[response_sig]
        # Exponential moving average
        q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0)
        q.sample_count += 1
        q.last_updated = datetime.utcnow()
    
    def record_gap(self, topic: str, had_answer: bool):
        """Record knowledge gap"""
        if topic not in self.gaps:
            self.gaps[topic] = KnowledgeGap(topic=topic)
        
        gap = self.gaps[topic]
        gap.query_count += 1
        if not had_answer:
            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count
        else:
            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count
```

---

#### 4. Insights API

**Purpose:** Provide actionable insights to admins/developers

**Endpoints:**

```python
# GET /insights/topics?top_k=10
{
    "topics": [
        {"topic": "implantat", "count": 1250, "trend": 0.15},
        {"topic": "rotfyllning", "count": 980, "trend": -0.02},
        {"topic": "priser", "count": 850, "trend": 0.30}
    ],
    "period": "30d"
}

# GET /insights/gaps
{
    "knowledge_gaps": [
        {
            "topic": "tandreglering vuxna",
            "query_count": 145,
            "failure_rate": 0.85,
            "suggested_action": "add documentation"
        },
        {
            "topic": "akut tandvård",
            "query_count": 89,
            "failure_rate": 0.72,
            "suggested_action": "improve answer"
        }
    ]
}

# GET /insights/quality
{
    "top_responses": [
        {"signature": "abc123", "avg_rating": 4.8, "sample_count": 520},
        {"signature": "def456", "avg_rating": 4.5, "sample_count": 340}
    ],
    "worst_responses": [
        {"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45}
    ]
}
```

---

## MnemoCore Integration

### Usage Pattern

```python
from mnemocore import HAIMEngine
from mnemocore.pattern_learner import PatternLearner

# Initialize MnemoCore (stores topic embeddings)
engine = HAIMEngine(dimension=16384)
await engine.initialize()

# Initialize Pattern Learner
learner = PatternLearner(
    engine=engine,
    encryption_key=get_encryption_key(),
    anonymizer=Anonymizer()
)

# Process a query (automatic learning)
async def handle_query(user_query: str, tenant_id: str):
    # 1. Anonymize
    anon_query = learner.anonymize(user_query)
    
    # 2. Extract patterns (no PII)
    topics = await learner.extract_topics(anon_query)
    
    # 3. Record topic usage
    for topic in topics:
        learner.record_topic(topic)
    
    # 4. Get answer from RAG
    answer = await rag_lookup(anon_query)
    
    # 5. Record if we had an answer
    learner.record_gap(
        topic=topics[0] if topics else "unknown",
        had_answer=(answer is not None)
    )
    
    return answer

# Get insights (admin only)
async def get_dashboard():
    top_topics = learner.get_top_topics(10)
    gaps = learner.get_knowledge_gaps()
    quality = learner.get_response_quality()
    
    return {
        "popular_topics": top_topics,
        "needs_documentation": gaps,
        "response_performance": quality
    }
```

---

## GDPR Compliance

### Data Minimization

| Data Type | Stored? | Justification |
|-----------|---------|---------------|
| Raw queries | ❌ | PII risk |
| User IDs | ❌ | Not needed |
| Session IDs | ❌ | Not needed |
| Clinic IDs | ❌ | Not needed |
| **Topic labels** | ✅ | Anonymized |
| **Topic counts** | ✅ | Statistical |
| **Quality scores** | ✅ | Aggregated |
| **Gap indicators** | ✅ | Anonymized |

### Right to Erasure (GDPR Art 17)

Since no PII is stored, right to erasure is **automatically satisfied**.

### Data Retention

```python
# Configurable retention
retention_policy = {
    "topic_stats": "365d",  # Keep for 1 year
    "quality_scores": "90d",  # Keep for 3 months
    "gap_indicators": "30d",  # Refresh monthly
}

# Automatic cleanup
async def cleanup_old_patterns():
    cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"])
    for topic, stats in learner.topics.items():
        if stats.last_seen < cutoff:
            del learner.topics[topic]
```

---

## Security Considerations

### Encryption

- All pattern data encrypted at rest (AES-256)
- Encryption keys managed via HSM or Azure Key Vault
- Per-tenant encryption optional (for multi-tenant isolation)

### Access Control

```python
# Insights API requires admin role
@app.get("/insights/topics")
@require_role("admin")
async def get_topics():
    return learner.get_top_topics(10)
```

### Audit Logging

```python
# Log all pattern access (not the patterns themselves)
async def log_access(user_id: str, endpoint: str, timestamp: datetime):
    await audit_log.store({
        "user_id": user_id,
        "endpoint": endpoint,
        "timestamp": timestamp.isoformat(),
        # No pattern data logged
    })
```

---

## Implementation Roadmap

### Phase 1: MVP (2 weeks)

- [ ] Anonymizer with Swedish NER
- [ ] Basic topic extraction (keywords)
- [ ] Topic counter (no MnemoCore yet)
- [ ] Simple insights API

### Phase 2: MnemoCore Integration (2 weeks)

- [ ] Topic embedding storage in MnemoCore
- [ ] Semantic topic clustering
- [ ] Gap detection using similarity search

### Phase 3: Quality Metrics (2 weeks)

- [ ] Response quality tracking
- [ ] Feedback integration
- [ ] Quality dashboard

### Phase 4: Production Hardening (2 weeks)

- [ ] Encryption at rest
- [ ] Access control
- [ ] Audit logging
- [ ] Performance optimization

---

## Business Value

### For Healthcare Organizations

| Value | Metric |
|-------|--------|
| **Documentation gaps** | Know what to add to knowledge base |
| **Popular topics** | Prioritize documentation efforts |
| **Response quality** | Improve user satisfaction |
| **Trend analysis** | Identify emerging needs |

### For Opus Dental (Competitive Advantage)

| Advantage | Value |
|-----------|-------|
| **Continuous improvement** | Chatbot gets smarter without storing PII |
| **Customer insights** | Know what clinics need |
| **Compliance by design** | GDPR-safe from day 1 |
| **Unique selling point** | "Learning chatbot" vs competitors |

---

## Technical Requirements

### Dependencies

```
mnemocore>=4.5.0
spacy[sv]>=3.7.0  # Swedish NER
numpy>=1.24.0
cryptography>=41.0.0  # Encryption
```

### Infrastructure

- MnemoCore instance (can be shared or per-tenant)
- Encrypted storage (Azure SQL, PostgreSQL with TDE)
- Optional: Azure Key Vault for key management

### Performance

- Topic extraction: <50ms per query
- Insights API: <200ms
- Storage: ~1KB per unique topic (highly efficient)

---

## Open Questions

1. **Topic granularity:** How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer"

2. **Trend detection:** What time window for trend analysis? 7d? 30d?

3. **Multi-language:** Support for Finnish/Norwegian in addition to Swedish?

4. **Tenant isolation:** Should patterns be shared across tenants (anonymized) or kept separate?

5. **Feedback mechanism:** How to collect ratings? Thumbs up/down? 1-5 stars?

---

## Conclusion

Pattern Learner enables **continuous improvement** of healthcare chatbots **without GDPR risk**. It learns what users ask about, which answers work, and where documentation is missing — all without storing any personal data.

**Key innovation:** Transform "memory" into "patterns" — compliance-safe learning.

---

## Next Steps

1. Review this spec
2. Decide on open questions
3. Prioritize MVP features
4. Start implementation

---

*Draft by Omega (GLM-5) for Robin Granberg*  
*2026-02-20*