File size: 34,296 Bytes
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46128be
2b36cf6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
# Methods Document

# Concept Atlas: Methods & Application Guide
### German Curriculum Semantic Analysis — Technical & Pedagogical Documentation

> **Space:** [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)  
> **Focus concepts:** *mensch · verhalten · evolution*  
> **Model:** `paraphrase-multilingual-mpnet-base-v2`

---

## Table of Contents

1. [Overview & Motivation](#1-overview--motivation)
2. [Data: The Curriculum Corpus](#2-data-the-curriculum-corpus)
3. [Pipeline Architecture](#3-pipeline-architecture)
4. [Multilingual Sentence Embeddings](#4-multilingual-sentence-embeddings)
5. [Dimensionality Reduction: UMAP](#5-dimensionality-reduction-umap)
6. [Topic Modeling: BERTopic](#6-topic-modeling-bertopic)
7. [Information-Theoretic Measures](#7-information-theoretic-measures)
8. [Graph-Theoretic Analysis](#8-graph-theoretic-analysis)
9. [Cross-Concept Comparison](#9-cross-concept-comparison)
10. [State-Level Variation](#10-state-level-variation)
11. [Caching & Reproducibility](#11-caching--reproducibility)
12. [Educational Applications](#12-educational-applications)
13. [Decentralized Research Model](#13-decentralized-research-model)
14. [Limitations & Ethical Considerations](#14-limitations--ethical-considerations)
15. [Glossary](#15-glossary)

---

## 1. Overview & Motivation

### What this project does

The **Concept Atlas** is a computational tool for exploring how key biological
and humanistic concepts are framed across German school curricula. Rather than
reading curriculum documents manually — a task that scales poorly across 16
federal states (*Bundesländer*), dozens of subjects, and multiple grade levels —
this tool uses modern Natural Language Processing (NLP) to:

- **Map** the semantic landscape of curriculum language around three focus
  concepts: *Mensch* (human), *Verhalten* (behaviour), and *Evolution*
- **Cluster** excerpts into coherent topics without any pre-defined categories
- **Compare** how these concepts relate to each other mathematically
- **Detect** variation in framing between federal states

### Why these three concepts?

*Mensch*, *Verhalten*, and *Evolution* occupy a uniquely contested intersection
in German science education. They appear across Biology, Ethics, Social Studies,
Religion, and Psychology curricula — often with very different emphases
depending on subject context and state. This makes them ideal test cases for
computational curriculum analysis:

| Concept | Why it matters |
|---|---|
| **Mensch** | Bridges biological and humanistic framings; appears in nearly every subject |
| **Verhalten** | Links ethological science to social norms and moral education |
| **Evolution** | Scientifically precise in Biology; contested or reframed in other subjects |

### Who is this for?

- **Curriculum researchers** seeking scalable, reproducible analysis tools
- **Science educators** interested in how their subject's language compares
  across states or disciplines
- **Policy analysts** investigating curricular coherence and equity
- **Graduate students** learning applied NLP for educational research
- **Open science advocates** interested in decentralized, community-driven
  research infrastructure

---

## 2. Data: The Curriculum Corpus

### Source

The corpus consists of text excerpts drawn from publicly available German school
curriculum documents (*Lehrpläne* and *Bildungspläne*) across multiple federal
states. Each excerpt was retrieved by keyword search and stored as a structured
CSV file.

### Structure

Each row in `curriculum_excerpts.csv` represents one curriculum excerpt:

| Column | Description |
|---|---|
| `search_term` | The keyword used to retrieve the excerpt (e.g. `mensch`) |
| `text_excerpt` | The raw curriculum text (sentence to paragraph length) |
| `state` | German federal state (*Bundesland*) |
| `subject` | School subject (e.g. Biologie, Ethik, Sozialkunde) |
| `grade` | Target grade level or band |
| `school_type` | School type (e.g. Gymnasium, Realschule) |

### Preprocessing

Before analysis, the corpus undergoes the following cleaning steps:


Raw CSV
  → Normalise column names (lowercase, underscores)
  → Fill missing values with empty strings
  → Add missing optional columns (state, subject, grade, school_type)
  → Strip whitespace from text_excerpt and search_term
  → Remove excerpts shorter than 20 characters
  → Derive search_term_lower for case-insensitive concept matching


Concept subsets are built by **exact match** on `search_term_lower`, with
automatic fallback to **partial string match** if fewer than 10 exact matches
are found. This handles spelling variants and compound words.

### Scale considerations

The analysis is designed to work with corpora ranging from a few hundred to
tens of thousands of excerpts. Embedding and topic modeling parameters scale
automatically:

- `n_neighbors` in UMAP is capped at `min(15, n_samples - 1)`
- `min_cluster_size` in HDBSCAN is set to `min(5, max(2, n // 10))`

---

## 3. Pipeline Architecture

The full analysis pipeline runs sequentially in a single click. All
computationally expensive steps are cached to disk so that subsequent
exploration is instantaneous.


CSV ingestion


Sentence embeddings (per concept)

      ├──► UMAP 2-D  (visualization)

      ├──► UMAP 3-D  (atlas & joint space)

      └──► BERTopic

                ├──► Topic labels per excerpt
                ├──► Top words per topic
                └──► Topic probability distributions

                            ├──► Shannon entropy
                            ├──► Jensen-Shannon divergence (cross-concept)
                            ├──► Jensen-Shannon divergence (cross-state)
                            └──► Cosine similarity (centroid comparison)

Semantic kNN graph (per concept)

      ├──► Betweenness centrality
      ├──► PageRank
      ├──► Closeness centrality
      ├──► Louvain communities
      ├──► Network density
      └──► Average clustering coefficient

Enriched parquet export

      └──► data/enriched_corpus.csv
           cache/enriched_corpus.parquet


### Caching strategy

Every expensive computation writes a `.npy` (arrays) or `.json` (metadata)
file to `./cache/`, keyed by a combination of:

- The logical name of the artefact (e.g. `emb_mensch`)
- The number of input texts (detects corpus changes)
- An MD5 hash of the full key string (prevents filename collisions)

On re-launch, the pipeline checks for cached files first and skips recomputation
entirely if they exist. This makes the Space fast for end users while keeping
the first-run cost affordable.

---

## 4. Multilingual Sentence Embeddings

### What is an embedding?

An **embedding** is a list of numbers (a vector) that represents the meaning
of a piece of text. Texts with similar meanings produce vectors that are close
together in mathematical space; texts with different meanings produce vectors
that are far apart.

The Concept Atlas uses vectors with **768 dimensions** — each excerpt becomes
a point in a 768-dimensional semantic space.

### Model: `paraphrase-multilingual-mpnet-base-v2`

This model is a **Sentence Transformer** — a neural network fine-tuned
specifically to produce high-quality sentence-level representations. Key
properties relevant to this project:

| Property | Detail |
|---|---|
| Architecture | MPNet-base (Masked and Permuted Pre-training) |
| Training data | Paraphrase pairs in 50+ languages |
| German support | Native — no translation needed |
| Output dimension | 768 |
| Normalization | L2-normalized (unit sphere) → cosine similarity = dot product |
| License | Apache 2.0 |

### Why multilingual?

German curriculum language is domain-specific and contains compound words,
technical terms, and pedagogical jargon that generic German models may handle
poorly. A multilingual model trained on diverse paraphrase data captures
*semantic equivalence* across paraphrases better than a purely language-modeled
baseline, which is exactly what is needed to group thematically similar
curriculum excerpts.

### Practical interpretation

> Two excerpts with a **high cosine similarity** (close to 1.0) make similar
> semantic claims or describe similar content — even if they use different words.
> Two excerpts with **low cosine similarity** (close to 0) occupy different
> regions of conceptual space.

---

## 5. Dimensionality Reduction: UMAP

### The dimensionality problem

768-dimensional vectors cannot be visualized directly. **UMAP** (Uniform
Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while
preserving as much of the neighbourhood structure as possible — meaning that
points that were close in 768-D tend to remain close after reduction.

### Two projections are computed

| Projection | Dimensions | Purpose |
|---|---|---|
| **UMAP 2-D** | 2 | Interactive scatter plots, BERTopic visualization |
| **UMAP 3-D** | 3 | Atlas visualization, joint concept space |

A **joint 3-D UMAP** is also computed across all three concept corpora combined,
placing *mensch*, *verhalten*, and *evolution* excerpts in a single shared
semantic space for direct comparison.

### Key parameters

| Parameter | Value | Effect |
|---|---|---|
| `n_neighbors` | 15 | Controls local vs. global structure balance |
| `min_dist` | 0.1 (3-D) / 0.05 (2-D) | How tightly clusters pack |
| `metric` | cosine | Appropriate for normalized embeddings |
| `random_state` | 42 | Ensures reproducible layouts |

### Accessible interpretation

> Think of UMAP as making a **map of meaning**. Just as a geographic map
> compresses the curved surface of the Earth onto a flat page while preserving
> relative distances between cities, UMAP compresses the high-dimensional
> semantic space onto a 2-D or 3-D canvas while preserving which excerpts are
> semantically "nearby."
>
> **Clusters** of points in a UMAP plot indicate groups of excerpts that
> discuss similar ideas. **Gaps** between clusters indicate distinct conceptual
> sub-areas. The absolute position of a cluster has no meaning — only
> relative distances matter.

---

## 6. Topic Modeling: BERTopic

### What is topic modeling?

Topic modeling is an **unsupervised** method for discovering thematic groups
within a text collection — without being told in advance what the themes are.
Traditional methods (e.g. LDA) work on word co-occurrence statistics.
**BERTopic** uses pre-computed sentence embeddings, which means it understands
*meaning* rather than just *word frequency*.

### Pipeline within BERTopic


Sentence embeddings (768-D)


  UMAP reduction (→ 5-D internal space)


  HDBSCAN clustering


  c-TF-IDF topic representation
  (class-based TF-IDF: finds words that
   distinguish this topic from all others)


  Topic labels + per-document probabilities


### HDBSCAN: density-based clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with
Noise) finds clusters as **dense regions** in the embedding space, separated
by sparser regions. Key advantages for curriculum text:

- Does **not** require specifying the number of clusters in advance
- Naturally handles **outliers** (assigned topic `-1`)
- Finds clusters of **variable size and shape**

### Parameters used

| Parameter | Value | Rationale |
|---|---|---|
| `min_cluster_size` | 5 (adaptive) | Minimum excerpts to form a topic |
| `min_samples` | `max(1, min_cluster_size // 2)` | Controls noise sensitivity |
| `cluster_selection_method` | `eom` | Excess of Mass — finds stable clusters |
| `n_gram_range` | (1, 2) | Single words and two-word phrases as features |

### Reading the results

Each topic is characterized by its **top words** — terms with the highest
c-TF-IDF scores for that cluster. These are words that appear frequently in
the topic *and* rarely in other topics, making them highly discriminating.

**Topic -1** is always the outlier category: excerpts that did not fit
confidently into any discovered cluster. A high outlier rate may indicate
either genuine semantic diversity or insufficient data for that concept.

### Silhouette score

The **silhouette score** measures how well-separated the discovered clusters
are, ranging from -1 (poor) to +1 (excellent):

- **> 0.5**: well-separated, meaningful topics
- **0.2–0.5**: moderate separation — topics exist but overlap
- **< 0.2**: clusters are not well-defined; interpret with caution

---

## 7. Information-Theoretic Measures

Information theory provides a principled mathematical language for measuring
**diversity**, **surprise**, and **difference** in distributions. The Concept
Atlas applies three core measures.

### 7.1 Shannon Entropy

$$H(X) = -\sum_{i} p_i \log_2 p_i \quad \text{(bits)}$$

**What it measures:** How evenly curriculum excerpts are spread across the
discovered topics. High entropy = many topics of roughly equal size (diverse
framing). Low entropy = one or two dominant topics (concentrated framing).

**Interpretation guide:**

| Entropy | Meaning |
|---|---|
| **Low** (< 1 bit) | One topic dominates; concept is used in a narrow, uniform way |
| **Medium** (1–2.5 bits) | Moderate diversity; concept appears in several distinct contexts |
| **High** (> 2.5 bits) | Highly diverse; concept is used across many different framings |

> **Accessible analogy:** Imagine rolling a die. If it always lands on 6
> (entropy = 0), you learn nothing new from each roll. If all faces are equally
> likely (maximum entropy), each roll is maximally informative. Curriculum
> entropy works the same way — high entropy means the concept is used in
> many genuinely different ways.

### 7.2 Jensen-Shannon Divergence (JSD)

$$JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$$

where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence.

**What it measures:** The *similarity* between two probability distributions
over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1],
and always defined (unlike raw KL divergence).

**Used in two contexts:**

1. **Cross-concept JSD:** Do *mensch*, *verhalten*, and *evolution* have
   similar topic distributions? JSD near 0 means yes; near 1 means they
   occupy entirely different topical spaces.

2. **Cross-state JSD:** Do two federal states frame the same concept similarly?
   High JSD between states indicates curricular divergence; low JSD indicates
   convergence.

**Interpretation:**

| JSD | Interpretation |
|---|---|
| 0.0 | Identical topic distributions |
| 0.0–0.2 | Very similar framing |
| 0.2–0.5 | Moderate divergence |
| 0.5–1.0 | Substantially different framing |
| 1.0 | Completely non-overlapping distributions |

### 7.3 Cosine Similarity (Embedding Centroids)

$$\text{sim}(A, B) = \frac{\bar{a} \cdot \bar{b}}{\|\bar{a}\| \|\bar{b}\|}$$

where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of
two concept corpora.

**What it measures:** Whether the *average semantic content* of two concept
corpora occupies the same region of embedding space. This is complementary to
JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD
operates on the topic distribution (post-clustering).

**Interpretation:**

| Cosine sim | Interpretation |
|---|---|
| > 0.9 | Concepts discussed in nearly identical semantic context |
| 0.7–0.9 | Related but distinct semantic regions |
| 0.5–0.7 | Moderate semantic overlap |
| < 0.5 | Concepts occupy largely separate semantic spaces |

---

## 8. Graph-Theoretic Analysis

### Why model curriculum text as a graph?

A graph (network) makes the **relational structure** of a corpus explicit.
Instead of treating each excerpt independently, a graph reveals which excerpts
are semantically central, which bridge different topic areas, and how the
corpus is organized into communities.

### Construction: k-Nearest Neighbour Graph

For each concept corpus, a graph $G = (V, E)$ is constructed where:

- **Nodes** $V$: each curriculum excerpt
- **Edges** $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity
  exceeds a threshold (≥ 0.35) *and* $j$ is among the $k=6$ nearest neighbours
  of $i$
- **Edge weights**: the cosine similarity value

This creates a **sparse similarity graph** that captures local semantic
neighbourhood structure.

### Measures computed

#### Betweenness Centrality
$$C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma(s,t|v)}{\sigma(s,t)}$$

Measures how often a node lies on the shortest path between other nodes.
**High betweenness** = the excerpt is a semantic "bridge" between different
topic areas. In curriculum terms, bridge excerpts often contain integrative
or interdisciplinary language.

#### PageRank
Iteratively assigns importance based on the importance of neighbours.
**High PageRank** = the excerpt is cited (connected to) by many other
important excerpts. PageRank hubs represent semantically central curriculum
statements that many other excerpts are conceptually near.

#### Closeness Centrality
Measures how quickly a node can reach all others via the graph.
**High closeness** = the excerpt is semantically accessible from most others —
a "general" or bridging statement.

#### Network Density
$$d = \frac{2|E|}{|V|(|V|-1)}$$

The fraction of all possible edges that actually exist. Higher density
indicates a more semantically cohesive corpus (most excerpts are near most
others). Lower density indicates a more fragmented semantic space.

#### Average Clustering Coefficient
Measures the tendency of nodes to form tightly connected local groups.
High clustering = the corpus has tight semantic sub-communities.

#### Louvain Community Detection
Partitions the graph into communities that maximize **modularity** — the
degree to which within-community connections exceed what would be expected
by chance. Communities in a curriculum semantic graph often correspond to
distinct disciplinary or contextual framings of a concept.

### Accessible interpretation

> Think of the semantic graph as a **social network of curriculum excerpts**,
> where two excerpts are "friends" if they discuss similar ideas. 
>
> - **Hubs** (high PageRank) are the popular, central ideas that most other 
>   ideas are related to.
> - **Bridges** (high betweenness) are the connectors — excerpts that link 
>   otherwise separate clusters of ideas.
> - **Communities** are cliques of mutually similar excerpts — effectively 
>   the curriculum's implicit sub-topics.

---

## 9. Cross-Concept Comparison

The cross-concept analysis addresses the core research question:
**How do *mensch*, *verhalten*, and *evolution* relate to each other in
German curriculum language?**

Three complementary lenses are applied:

### Lens 1: Geometric (Cosine Similarity Matrix)

Computes the cosine similarity between the **mean embedding vectors** of each
concept corpus. This is a purely geometric measure — it asks whether the
average document in one concept space is semantically close to the average
document in another.

### Lens 2: Distributional (Jensen-Shannon Divergence Matrix)

Computes JSD between the **topic probability distributions** of each concept
pair. This asks whether the *thematic structure* (the pattern of which topics
are prominent) is similar across concepts, regardless of the raw embedding
geometry.

### Lens 3: Comparative Statistics

Side-by-side comparison of:
- Shannon entropy per concept (conceptual breadth)
- Corpus size (representation in curricula)
- Number of discovered topics (thematic complexity)
- Silhouette score (cluster quality)

### Why use multiple lenses?

Two concepts could be **geometrically close** (similar average embedding) but
**distributionally different** (very different topic structures). For example,
*mensch* and *verhalten* might both appear in social-science contexts (close
centroids) but *mensch* might span biology, ethics, and philosophy while
*verhalten* concentrates in behavioral science (different topic distributions).
Using both measures together gives a richer picture.

---

## 10. State-Level Variation

Germany's federal education system means that curriculum design is the
responsibility of each *Bundesland*. This creates natural variation that is
itself a research object.

### Bubble chart: Entropy × State

For each state-concept pair, Shannon entropy is computed over the topic
distribution of excerpts from that state. Plotting entropy against state
with bubble size proportional to corpus size reveals:

- Which states frame a concept more *uniformly* (low entropy)
- Which states frame it more *diversely* (high entropy)
- Whether small-corpus states should be interpreted cautiously

### State-pairwise JSD heatmaps

For each concept, a symmetric matrix of JSD values is computed between all
pairs of states. This reveals:

- **Clusters of similar states** (low inter-state JSD)
- **Outlier states** with distinctive curriculum framing
- **Concept-specific patterns**: states may converge on *evolution* 
  (where scientific consensus constrains framing) but diverge on *mensch*
  (where philosophical tradition varies more)

---

## 11. Caching & Reproducibility

### Deterministic cache keys

Every artefact is identified by a key encoding:
1. The artefact type (e.g. `emb_mensch`)
2. The number of input texts (detects corpus changes)
3. An MD5 hash of the full key string

This means that if the corpus changes size, all downstream caches are
automatically invalidated (different key → different filename → recomputed).

### Artefact manifest

| File pattern | Content | Format |
|---|---|---|
| `emb_{concept}_{n}_*.npy` | Sentence embeddings | NumPy float32 |
| `umap3d_{concept}_{n}_*.npy` | 3-D UMAP coordinates | NumPy float32 |
| `umap2d_{concept}_{n}_*.npy` | 2-D UMAP coordinates | NumPy float32 |
| `bertopic_{concept}_{n}_topics_*.json` | Topic assignment per excerpt | JSON list |
| `bertopic_{concept}_{n}_probs_*.npy` | Topic probabilities | NumPy float32 |
| `bertopic_{concept}_{n}_words_*.json` | Top words per topic | JSON dict |
| `bertopic_{concept}_{n}_info_*.json` | Full topic info table | JSON list |
| `joint_all_embs_*.npy` | Joint corpus embeddings | NumPy float32 |
| `joint_umap3d_*.npy` | Joint 3-D UMAP | NumPy float32 |
| `enriched_corpus.parquet` | Full enriched dataset | Parquet |
| `data/enriched_corpus.csv` | Full enriched dataset | CSV |

### Pushing to HuggingFace

# Authenticate
huggingface-cli login

# Upload computed artefacts to the Space
huggingface-cli upload deirdosh/curriculum_analysis_german \
  ./cache cache --repo-type=space

huggingface-cli upload deirdosh/curriculum_analysis_german \
  ./data data --repo-type=space


The `enriched_corpus.csv` adds BERTopic `topic_id` and UMAP coordinates
(`umap2_x`, `umap2_y`, `umap3_x`, `umap3_y`, `umap3_z`) to every excerpt,
making the enriched dataset independently useful for downstream analysis
without re-running the pipeline.



## 12. Educational Applications

### For curriculum researchers

The Concept Atlas operationalizes several research questions that have
historically required manual content analysis:

**Q: Is the concept of *evolution* framed consistently across German states?**  
→ Examine the state-pairwise JSD heatmap for evolution. States with high JSD
scores frame the concept in fundamentally different ways — follow up by reading
the excerpts in the high-JSD clusters.

**Q: How does *mensch* differ between Biology and Ethics curricula?**  
→ Filter the UMAP scatter by subject metadata and look for spatial separation
between subject-coloured clusters.

**Q: Which curriculum excerpts are semantically central to the concept
of *verhalten*?**  
→ Consult the PageRank and betweenness centrality rankings in the Graph
Theory tab to find the hub and bridge excerpts.

**Q: Are any states outliers in how they frame all three concepts?**  
→ Compare state-level entropy across all three concept bubble charts. A state
that consistently shows either very low or very high entropy across all three
concepts may have a distinctive curriculum philosophy.

### For teachers and educators

You do not need to understand the mathematics to use the Concept Atlas
productively. Here is an accessible guide:

**The UMAP scatter plot** is a *map of meaning*. Points close together mean
the curriculum uses similar language in those excerpts. Click on any point to
read the actual excerpt. Ask: Do the clusters make sense intuitively? Are
there surprising neighbors?

**The topic word clouds** show you what each cluster is "about" — the most
distinctive words for that group of excerpts. Use these to name the implicit
sub-topics in your subject area's curriculum.

**The entropy score** is a single number summarizing how *diverse* curriculum
language is for that concept. Compare it across states: does your state have
higher or lower entropy than average? What might that mean for teaching?

**The state JSD heatmap** is a curriculum comparison tool. Find your state on
both axes and read across: which states treat this concept most similarly to
yours? Most differently? This can be a starting point for cross-state
curriculum exchange or dialogue.

### Classroom use

The Concept Atlas can support several classroom activities:

- **Curriculum literacy seminars**: pre-service teachers can explore how their
  subject area frames key concepts, developing meta-awareness of curriculum
  language
- **Cross-disciplinary projects**: students can investigate how *mensch* is
  framed differently in Biology vs. Religion, using the UMAP and topic plots
  as primary evidence
- **Federalism and education policy**: social studies courses can use the
  state comparison features to discuss German educational federalism concretely
- **Philosophy of science**: the *evolution* concept analysis can ground
  discussions of how scientific concepts travel (or don't) across subject
  boundaries

---

## 13. Decentralized Research Model

### The case for community-driven curriculum analysis

Curriculum analysis has traditionally been the province of large research
institutes with dedicated funding and staff. This creates several problems:

- Coverage is selective and often lags policy changes by years
- Methods are rarely shared, making replication difficult
- Researchers in smaller institutions or non-German-speaking countries face
  high access barriers
- Teachers, whose expertise is most relevant, are rarely involved as researchers

The Concept Atlas is designed to support a different model: **lightweight,
reproducible, community-extensible analysis hosted on free public infrastructure.**

### HuggingFace Spaces as research infrastructure

HuggingFace Spaces provides:

| Feature | Research value |
|---|---|
| Free GPU/CPU hosting | Zero infrastructure cost for deployment |
| Git-based version control | Full reproducibility and change history |
| Public dataset repository | Findable, citable, reusable data |
| Community discussion | Peer feedback without formal publication gatekeeping |
| Fork-and-extend | Others can build on the analysis with one click |

### How to extend this work

**Adding more concepts:**  
Edit the `FOCUS_CONCEPTS` list in `app.py`. The pipeline will automatically
process new concepts if matching rows exist in the CSV.

**Adding more states or subjects:**  
Extend the `curriculum_excerpts.csv` with new rows following the same column
structure and re-run the pipeline.

**Using a different embedding model:**  
Change the `MODEL_NAME` constant. Any model on the
[SentenceTransformers model hub](https://www.sbert.net/docs/pretrained_models.html)
with multilingual support can be substituted. Clear the embedding cache
(`cache/emb_*.npy`) to force recomputation.

**Comparative cross-national analysis:**  
The pipeline is language-agnostic (the embedding model supports 50+ languages).
Providing curriculum excerpts from Austria, Switzerland, or other countries
in the same CSV format enables direct cross-national comparison.

**Contributing back:**  
- Open an issue or discussion on the HuggingFace Space
- Fork the Space and submit a PR with your extension
- Publish your enriched corpus as a separate HuggingFace dataset with a
  link back to this Space

### Minimal technical requirements for contributors

To run the pipeline locally or contribute new analysis:

# Clone the Space
git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german
cd curriculum_analysis_german

# Install dependencies
pip install -r requirements.txt

# Run locally
python app.py


A standard laptop (8 GB RAM, no GPU) can run the full pipeline in
approximately 15–20 minutes on first run. GPU acceleration reduces this to
2–5 minutes. All subsequent runs load from cache in under 10 seconds.

---

## 14. Limitations & Ethical Considerations

### Technical limitations

**Embedding model biases**  
The `paraphrase-multilingual-mpnet-base-v2` model was trained primarily on
paraphrase pairs and may not capture domain-specific curriculum jargon as
accurately as a model fine-tuned on educational text. Terms with
curriculum-specific meanings (e.g. *Kompetenz* in pedagogical vs. general
usage) may be represented according to their general-language distribution.

**UMAP non-determinism across runs**  
While `random_state=42` ensures reproducibility within a session, UMAP
projections are not globally canonical — a different seed or different
`n_neighbors` value will produce a different (though structurally similar)
layout. Conclusions should not depend on the absolute position of clusters,
only on their relative proximity.

**BERTopic outlier sensitivity**  
HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1).
With small corpora or very diverse text, the outlier rate can be high (>50%).
This is a signal that the data may be too heterogeneous for reliable topic
discovery rather than a failure of the method — but it limits interpretability.

**Corpus completeness**  
The current corpus may not include all German states, all school types, or
all grade levels. Gaps in coverage mean that low entropy or low JSD for a
state may reflect missing data rather than genuine curricular convergence.

**No temporal dimension**  
The current analysis treats curricula as static documents. It does not capture
revision histories or how concept framing has changed over time. A longitudinal
extension would require time-stamped corpus versions.

### Ethical considerations

**Curriculum documents are public, but context matters**  
German curriculum documents are publicly available administrative texts.
However, analysis that identifies specific states as "outliers" or frames
curriculum differences in evaluative terms should be handled carefully.
The goal of this tool is descriptive analysis, not ranking or judgment.

**Automated analysis does not replace reading**  
Computational methods reveal patterns at scale but cannot replace close
reading of the actual texts. Any finding from the Concept Atlas should be
verified by examining the underlying excerpts before drawing policy conclusions.

**Representation of marginalized perspectives**  
If curriculum documents systematically underrepresent certain voices
(e.g. indigenous knowledge systems, minority cultural frameworks), those
absences will not appear in the semantic analysis — which only reflects
what is present in the text. The Concept Atlas can reveal *what is there*
but not *what is missing*.

**Open-source does not mean unbiased**  
The choice of focus concepts, the threshold parameters, and the framing of
results all reflect research decisions made by the developers. We encourage
users to interrogate these choices and to adapt the tool to their own
research questions rather than treating the default configuration as neutral.

---

## 15. Glossary

| Term | Definition |
|---|---|
| **BERTopic** | A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text |
| **Betweenness centrality** | A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points |
| **Clustering coefficient** | The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness |
| **Cosine similarity** | A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite |
| **c-TF-IDF** | Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics |
| **Embedding** | A numerical vector representation of text that encodes semantic meaning |
| **Entropy (Shannon)** | A measure of uncertainty or diversity in a probability distribution; measured in bits |
| **HDBSCAN** | Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape |
| **Jensen-Shannon divergence** | A symmetric, bounded measure of similarity between two probability distributions |
| **kNN graph** | A graph where each node is connected to its k nearest neighbours by some distance measure |
| **Louvain algorithm** | A community detection algorithm that maximizes modularity in a network |
| **Modularity** | A measure of the quality of a graph partition into communities |
| **PageRank** | A graph centrality measure that assigns importance based on the importance of connected nodes |
| **Silhouette score** | A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent) |
| **Sentence Transformer** | A neural network architecture optimized for producing sentence-level embeddings |
| **UMAP** | Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure |

---

*Document version: May 2025*  
*Space: [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)*  
*License: Apache 2.0*
```