File size: 16,616 Bytes
7c8b011
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
# MnemoCore Pattern Learner β€” Specification Draft

**Version:** 0.1-draft  
**Date:** 2026-02-20  
**Status:** Draft for Review  
**Author:** Omega (GLM-5) for Robin Granberg

---

## Executive Summary

Pattern Learner Γ€r en MnemoCore-modul som lΓ€r sig frΓ₯n anvΓ€ndarinteraktioner **utan att lagra persondata**. Den extraherar statistiska mΓΆnster, topic clustering och kvalitetsmetrics som kan anvΓ€ndas fΓΆr att fΓΆrbΓ€ttra chatbot-performance ΓΆver tid.

**Key principle:** Learn patterns, forget people.

---

## Problem Statement

### Healthcare Chatbot Challenges

| Utmaning | Konsekvens |
|----------|------------|
| GDPR/HIPAA compliance | Kan inte lagra konversationer |
| Multitenancy | Data fΓ₯r inte lΓ€cka mellan kliniker |
| Quality improvement | BehΓΆver veta vad som fungerar |
| Knowledge gaps | BehΓΆver identifiera vad som saknas i docs |

### Current Solutions (Limitations)

- **Stateless RAG:** Ingen inlΓ€rning alls
- **Full memory:** GDPR-risk, sekretessproblem
- **Manual analytics:** TidskrΓ€vande, inte real-time

---

## Solution: Pattern Learner

### Core Concept

```

User Query ──► Anonymize ──► Extract Pattern ──► Aggregate

                  β”‚

                  └── PII removed before storage

```

**What IS stored:**
- Topic clusters (anonymized)
- Query frequency distributions
- Response quality aggregates
- Knowledge gap indicators

**What is NOT stored:**
- User identities
- Clinic associations
- Patient data
- Raw conversations

---

## Architecture

### High-Level Design

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚                    Pattern Learner Module                    β”‚

β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€

β”‚                                                              β”‚

β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚

β”‚  β”‚   Anonymizer │───►│Topic Extractor│───►│  Aggregator  β”‚  β”‚

β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚

β”‚         β”‚                   β”‚                    β”‚          β”‚

β”‚         β”‚                   β–Ό                    β–Ό          β”‚

β”‚         β”‚           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚

β”‚         β”‚           β”‚Topic Embedderβ”‚    β”‚ Stats Store  β”‚   β”‚

β”‚         β”‚           β”‚  (MnemoCore) β”‚    β”‚  (Encrypted) β”‚   β”‚

β”‚         β”‚           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚

β”‚         β”‚                   β”‚                    β”‚          β”‚

β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β”‚

β”‚                             β”‚                               β”‚

β”‚                             β–Ό                               β”‚

β”‚                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                        β”‚

β”‚                    β”‚  Insights APIβ”‚                        β”‚

β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                        β”‚

β”‚                                                              β”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

### Components

#### 1. Anonymizer

**Purpose:** Remove all PII before processing

**Methods:**
- Named Entity Recognition (NER) for person names
- Pattern matching for phone numbers, addresses
- Clinic/organization detection
- Session ID hashing

```python

class Anonymizer:

    """Remove PII from queries before pattern extraction"""

    

    def __init__(self):

        self.ner_model = load_ner_model("sv")  # Swedish

        self.patterns = {

            "phone": r"\+?\d{1,3}[\s-]?\d{2,4}[\s-]?\d{2,4}[\s-]?\d{2,4}",

            "email": r"[\w\.-]+@[\w\.-]+\.\w+",

            "personal_number": r"\d{6,8}[-\s]?\d{4}",

        }

    

    def anonymize(self, text: str) -> str:

        """Remove all PII from text"""

        

        # 1. NER for names

        entities = self.ner_model.extract(text)

        for entity in entities:

            if entity.type in ["PER", "ORG"]:

                text = text.replace(entity.text, "[ANON]")

        

        # 2. Pattern matching

        for pattern_type, pattern in self.patterns.items():

            text = re.sub(pattern, f"[{pattern_type.upper()}]", text)

        

        # 3. Remove clinic names (configurable blacklist)

        for clinic_name in self.clinic_blacklist:

            text = text.replace(clinic_name, "[KLINIK]")

        

        return text

```

---

#### 2. Topic Extractor

**Purpose:** Extract semantic topics from anonymized queries

**Methods:**
- Keyword extraction (TF-IDF)
- Topic modeling (LDA, BERTopic)
- Embedding-based clustering

```python

class TopicExtractor:

    """Extract topics from anonymized queries"""

    

    def __init__(self, mnemocore_engine):

        self.engine = mnemocore_engine

        self.topic_threshold = 0.5

    

    async def extract_topics(self, query: str) -> List[str]:

        """Extract topics from anonymized query"""

        

        # 1. Get keywords

        keywords = self._extract_keywords(query)

        

        # 2. Find similar topics in MnemoCore

        similar = await self.engine.query(query, top_k=5)

        

        # 3. Cluster into topics

        topics = []

        for memory_id, similarity in similar:

            if similarity > self.topic_threshold:

                memory = await self.engine.get_memory(memory_id)

                topics.extend(memory.metadata.get("topics", []))

        

        # 4. Deduplicate

        return list(set(topics + keywords))

    

    def _extract_keywords(self, text: str) -> List[str]:

        """Extract keywords using TF-IDF"""

        # Simple implementation

        words = text.lower().split()

        return [w for w in words if len(w) > 3 and w not in STOPWORDS_SV]

```

---

#### 3. Aggregator

**Purpose:** Store statistical patterns without PII

**Data structures:**

```python

@dataclass

class TopicStats:

    """Statistics for a topic"""

    topic: str

    count: int = 0

    first_seen: datetime = None

    last_seen: datetime = None

    trend: float = 0.0  # Recent increase/decrease



@dataclass

class ResponseQuality:

    """Aggregated response quality (no individual ratings)"""

    response_signature: str  # Hash of response template

    avg_rating: float = 0.5

    sample_count: int = 0

    last_updated: datetime = None



@dataclass

class KnowledgeGap:

    """Topics with no good answers"""

    topic: str

    query_count: int = 0

    failure_rate: float = 1.0  # % of queries that got "I don't know"

    suggested_action: str = ""  # "add documentation", "improve answer"

```

**Storage:**

```python

class PatternStore:

    """Store patterns (encrypted, no PII)"""

    

    def __init__(self, encryption_key: bytes):

        self.key = encryption_key

        self.topics: Dict[str, TopicStats] = {}

        self.qualities: Dict[str, ResponseQuality] = {}

        self.gaps: Dict[str, KnowledgeGap] = {}

    

    def record_topic(self, topic: str):

        """Record that a topic was queried"""

        if topic not in self.topics:

            self.topics[topic] = TopicStats(

                topic=topic,

                first_seen=datetime.utcnow()

            )

        

        stats = self.topics[topic]

        stats.count += 1

        stats.last_seen = datetime.utcnow()

    

    def record_quality(self, response_sig: str, rating: int):

        """Record response quality (aggregated)"""

        if response_sig not in self.qualities:

            self.qualities[response_sig] = ResponseQuality(

                response_signature=response_sig

            )

        

        q = self.qualities[response_sig]

        # Exponential moving average

        q.avg_rating = 0.9 * q.avg_rating + 0.1 * (rating / 5.0)

        q.sample_count += 1

        q.last_updated = datetime.utcnow()

    

    def record_gap(self, topic: str, had_answer: bool):

        """Record knowledge gap"""

        if topic not in self.gaps:

            self.gaps[topic] = KnowledgeGap(topic=topic)

        

        gap = self.gaps[topic]

        gap.query_count += 1

        if not had_answer:

            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1) + 1) / gap.query_count

        else:

            gap.failure_rate = (gap.failure_rate * (gap.query_count - 1)) / gap.query_count

```

---

#### 4. Insights API

**Purpose:** Provide actionable insights to admins/developers

**Endpoints:**

```python

# GET /insights/topics?top_k=10

{

    "topics": [

        {"topic": "implantat", "count": 1250, "trend": 0.15},

        {"topic": "rotfyllning", "count": 980, "trend": -0.02},

        {"topic": "priser", "count": 850, "trend": 0.30}

    ],

    "period": "30d"

}



# GET /insights/gaps

{

    "knowledge_gaps": [

        {

            "topic": "tandreglering vuxna",

            "query_count": 145,

            "failure_rate": 0.85,

            "suggested_action": "add documentation"

        },

        {

            "topic": "akut tandvΓ₯rd",

            "query_count": 89,

            "failure_rate": 0.72,

            "suggested_action": "improve answer"

        }

    ]

}



# GET /insights/quality

{

    "top_responses": [

        {"signature": "abc123", "avg_rating": 4.8, "sample_count": 520},

        {"signature": "def456", "avg_rating": 4.5, "sample_count": 340}

    ],

    "worst_responses": [

        {"signature": "xyz789", "avg_rating": 2.1, "sample_count": 45}

    ]

}

```

---

## MnemoCore Integration

### Usage Pattern

```python

from mnemocore import HAIMEngine

from mnemocore.pattern_learner import PatternLearner



# Initialize MnemoCore (stores topic embeddings)

engine = HAIMEngine(dimension=16384)

await engine.initialize()



# Initialize Pattern Learner

learner = PatternLearner(

    engine=engine,

    encryption_key=get_encryption_key(),

    anonymizer=Anonymizer()

)



# Process a query (automatic learning)

async def handle_query(user_query: str, tenant_id: str):

    # 1. Anonymize

    anon_query = learner.anonymize(user_query)

    

    # 2. Extract patterns (no PII)

    topics = await learner.extract_topics(anon_query)

    

    # 3. Record topic usage

    for topic in topics:

        learner.record_topic(topic)

    

    # 4. Get answer from RAG

    answer = await rag_lookup(anon_query)

    

    # 5. Record if we had an answer

    learner.record_gap(

        topic=topics[0] if topics else "unknown",

        had_answer=(answer is not None)

    )

    

    return answer



# Get insights (admin only)

async def get_dashboard():

    top_topics = learner.get_top_topics(10)

    gaps = learner.get_knowledge_gaps()

    quality = learner.get_response_quality()

    

    return {

        "popular_topics": top_topics,

        "needs_documentation": gaps,

        "response_performance": quality

    }

```

---

## GDPR Compliance

### Data Minimization

| Data Type | Stored? | Justification |
|-----------|---------|---------------|
| Raw queries | ❌ | PII risk |
| User IDs | ❌ | Not needed |
| Session IDs | ❌ | Not needed |
| Clinic IDs | ❌ | Not needed |
| **Topic labels** | βœ… | Anonymized |
| **Topic counts** | βœ… | Statistical |
| **Quality scores** | βœ… | Aggregated |
| **Gap indicators** | βœ… | Anonymized |

### Right to Erasure (GDPR Art 17)

Since no PII is stored, right to erasure is **automatically satisfied**.

### Data Retention

```python

# Configurable retention

retention_policy = {

    "topic_stats": "365d",  # Keep for 1 year

    "quality_scores": "90d",  # Keep for 3 months

    "gap_indicators": "30d",  # Refresh monthly

}



# Automatic cleanup

async def cleanup_old_patterns():

    cutoff = datetime.utcnow() - timedelta(days=retention_policy["topic_stats"])

    for topic, stats in learner.topics.items():

        if stats.last_seen < cutoff:

            del learner.topics[topic]

```

---

## Security Considerations

### Encryption

- All pattern data encrypted at rest (AES-256)
- Encryption keys managed via HSM or Azure Key Vault
- Per-tenant encryption optional (for multi-tenant isolation)

### Access Control

```python

# Insights API requires admin role

@app.get("/insights/topics")

@require_role("admin")

async def get_topics():

    return learner.get_top_topics(10)

```

### Audit Logging

```python

# Log all pattern access (not the patterns themselves)

async def log_access(user_id: str, endpoint: str, timestamp: datetime):

    await audit_log.store({

        "user_id": user_id,

        "endpoint": endpoint,

        "timestamp": timestamp.isoformat(),

        # No pattern data logged

    })

```

---

## Implementation Roadmap

### Phase 1: MVP (2 weeks)

- [ ] Anonymizer with Swedish NER
- [ ] Basic topic extraction (keywords)
- [ ] Topic counter (no MnemoCore yet)
- [ ] Simple insights API

### Phase 2: MnemoCore Integration (2 weeks)

- [ ] Topic embedding storage in MnemoCore
- [ ] Semantic topic clustering
- [ ] Gap detection using similarity search

### Phase 3: Quality Metrics (2 weeks)

- [ ] Response quality tracking
- [ ] Feedback integration
- [ ] Quality dashboard

### Phase 4: Production Hardening (2 weeks)

- [ ] Encryption at rest
- [ ] Access control
- [ ] Audit logging
- [ ] Performance optimization

---

## Business Value

### For Healthcare Organizations

| Value | Metric |
|-------|--------|
| **Documentation gaps** | Know what to add to knowledge base |
| **Popular topics** | Prioritize documentation efforts |
| **Response quality** | Improve user satisfaction |
| **Trend analysis** | Identify emerging needs |

### For Opus Dental (Competitive Advantage)

| Advantage | Value |
|-----------|-------|
| **Continuous improvement** | Chatbot gets smarter without storing PII |
| **Customer insights** | Know what clinics need |
| **Compliance by design** | GDPR-safe from day 1 |
| **Unique selling point** | "Learning chatbot" vs competitors |

---

## Technical Requirements

### Dependencies

```

mnemocore>=4.5.0

spacy[sv]>=3.7.0  # Swedish NER

numpy>=1.24.0

cryptography>=41.0.0  # Encryption

```

### Infrastructure

- MnemoCore instance (can be shared or per-tenant)
- Encrypted storage (Azure SQL, PostgreSQL with TDE)
- Optional: Azure Key Vault for key management

### Performance

- Topic extraction: <50ms per query
- Insights API: <200ms
- Storage: ~1KB per unique topic (highly efficient)

---

## Open Questions

1. **Topic granularity:** How specific should topics be? "Implantat" vs "Implantat pris" vs "Implantat komplikationer"

2. **Trend detection:** What time window for trend analysis? 7d? 30d?

3. **Multi-language:** Support for Finnish/Norwegian in addition to Swedish?

4. **Tenant isolation:** Should patterns be shared across tenants (anonymized) or kept separate?

5. **Feedback mechanism:** How to collect ratings? Thumbs up/down? 1-5 stars?

---

## Conclusion

Pattern Learner enables **continuous improvement** of healthcare chatbots **without GDPR risk**. It learns what users ask about, which answers work, and where documentation is missing β€” all without storing any personal data.

**Key innovation:** Transform "memory" into "patterns" β€” compliance-safe learning.

---

## Next Steps

1. Review this spec
2. Decide on open questions
3. Prioritize MVP features
4. Start implementation

---

*Draft by Omega (GLM-5) for Robin Granberg*  
*2026-02-20*