curriculum_analysis_german / methods_draft.md
deirdosh's picture
Update methods_draft.md
46128be verified
# Methods Document
# Concept Atlas: Methods & Application Guide
### German Curriculum Semantic Analysis — Technical & Pedagogical Documentation
> **Space:** [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)
> **Focus concepts:** *mensch · verhalten · evolution*
> **Model:** `paraphrase-multilingual-mpnet-base-v2`
---
## Table of Contents
1. [Overview & Motivation](#1-overview--motivation)
2. [Data: The Curriculum Corpus](#2-data-the-curriculum-corpus)
3. [Pipeline Architecture](#3-pipeline-architecture)
4. [Multilingual Sentence Embeddings](#4-multilingual-sentence-embeddings)
5. [Dimensionality Reduction: UMAP](#5-dimensionality-reduction-umap)
6. [Topic Modeling: BERTopic](#6-topic-modeling-bertopic)
7. [Information-Theoretic Measures](#7-information-theoretic-measures)
8. [Graph-Theoretic Analysis](#8-graph-theoretic-analysis)
9. [Cross-Concept Comparison](#9-cross-concept-comparison)
10. [State-Level Variation](#10-state-level-variation)
11. [Caching & Reproducibility](#11-caching--reproducibility)
12. [Educational Applications](#12-educational-applications)
13. [Decentralized Research Model](#13-decentralized-research-model)
14. [Limitations & Ethical Considerations](#14-limitations--ethical-considerations)
15. [Glossary](#15-glossary)
---
## 1. Overview & Motivation
### What this project does
The **Concept Atlas** is a computational tool for exploring how key biological
and humanistic concepts are framed across German school curricula. Rather than
reading curriculum documents manually — a task that scales poorly across 16
federal states (*Bundesländer*), dozens of subjects, and multiple grade levels —
this tool uses modern Natural Language Processing (NLP) to:
- **Map** the semantic landscape of curriculum language around three focus
concepts: *Mensch* (human), *Verhalten* (behaviour), and *Evolution*
- **Cluster** excerpts into coherent topics without any pre-defined categories
- **Compare** how these concepts relate to each other mathematically
- **Detect** variation in framing between federal states
### Why these three concepts?
*Mensch*, *Verhalten*, and *Evolution* occupy a uniquely contested intersection
in German science education. They appear across Biology, Ethics, Social Studies,
Religion, and Psychology curricula — often with very different emphases
depending on subject context and state. This makes them ideal test cases for
computational curriculum analysis:
| Concept | Why it matters |
|---|---|
| **Mensch** | Bridges biological and humanistic framings; appears in nearly every subject |
| **Verhalten** | Links ethological science to social norms and moral education |
| **Evolution** | Scientifically precise in Biology; contested or reframed in other subjects |
### Who is this for?
- **Curriculum researchers** seeking scalable, reproducible analysis tools
- **Science educators** interested in how their subject's language compares
across states or disciplines
- **Policy analysts** investigating curricular coherence and equity
- **Graduate students** learning applied NLP for educational research
- **Open science advocates** interested in decentralized, community-driven
research infrastructure
---
## 2. Data: The Curriculum Corpus
### Source
The corpus consists of text excerpts drawn from publicly available German school
curriculum documents (*Lehrpläne* and *Bildungspläne*) across multiple federal
states. Each excerpt was retrieved by keyword search and stored as a structured
CSV file.
### Structure
Each row in `curriculum_excerpts.csv` represents one curriculum excerpt:
| Column | Description |
|---|---|
| `search_term` | The keyword used to retrieve the excerpt (e.g. `mensch`) |
| `text_excerpt` | The raw curriculum text (sentence to paragraph length) |
| `state` | German federal state (*Bundesland*) |
| `subject` | School subject (e.g. Biologie, Ethik, Sozialkunde) |
| `grade` | Target grade level or band |
| `school_type` | School type (e.g. Gymnasium, Realschule) |
### Preprocessing
Before analysis, the corpus undergoes the following cleaning steps:
Raw CSV
→ Normalise column names (lowercase, underscores)
→ Fill missing values with empty strings
→ Add missing optional columns (state, subject, grade, school_type)
→ Strip whitespace from text_excerpt and search_term
→ Remove excerpts shorter than 20 characters
→ Derive search_term_lower for case-insensitive concept matching
Concept subsets are built by **exact match** on `search_term_lower`, with
automatic fallback to **partial string match** if fewer than 10 exact matches
are found. This handles spelling variants and compound words.
### Scale considerations
The analysis is designed to work with corpora ranging from a few hundred to
tens of thousands of excerpts. Embedding and topic modeling parameters scale
automatically:
- `n_neighbors` in UMAP is capped at `min(15, n_samples - 1)`
- `min_cluster_size` in HDBSCAN is set to `min(5, max(2, n // 10))`
---
## 3. Pipeline Architecture
The full analysis pipeline runs sequentially in a single click. All
computationally expensive steps are cached to disk so that subsequent
exploration is instantaneous.
CSV ingestion
Sentence embeddings (per concept)
├──► UMAP 2-D (visualization)
├──► UMAP 3-D (atlas & joint space)
└──► BERTopic
├──► Topic labels per excerpt
├──► Top words per topic
└──► Topic probability distributions
├──► Shannon entropy
├──► Jensen-Shannon divergence (cross-concept)
├──► Jensen-Shannon divergence (cross-state)
└──► Cosine similarity (centroid comparison)
Semantic kNN graph (per concept)
├──► Betweenness centrality
├──► PageRank
├──► Closeness centrality
├──► Louvain communities
├──► Network density
└──► Average clustering coefficient
Enriched parquet export
└──► data/enriched_corpus.csv
cache/enriched_corpus.parquet
### Caching strategy
Every expensive computation writes a `.npy` (arrays) or `.json` (metadata)
file to `./cache/`, keyed by a combination of:
- The logical name of the artefact (e.g. `emb_mensch`)
- The number of input texts (detects corpus changes)
- An MD5 hash of the full key string (prevents filename collisions)
On re-launch, the pipeline checks for cached files first and skips recomputation
entirely if they exist. This makes the Space fast for end users while keeping
the first-run cost affordable.
---
## 4. Multilingual Sentence Embeddings
### What is an embedding?
An **embedding** is a list of numbers (a vector) that represents the meaning
of a piece of text. Texts with similar meanings produce vectors that are close
together in mathematical space; texts with different meanings produce vectors
that are far apart.
The Concept Atlas uses vectors with **768 dimensions** — each excerpt becomes
a point in a 768-dimensional semantic space.
### Model: `paraphrase-multilingual-mpnet-base-v2`
This model is a **Sentence Transformer** — a neural network fine-tuned
specifically to produce high-quality sentence-level representations. Key
properties relevant to this project:
| Property | Detail |
|---|---|
| Architecture | MPNet-base (Masked and Permuted Pre-training) |
| Training data | Paraphrase pairs in 50+ languages |
| German support | Native — no translation needed |
| Output dimension | 768 |
| Normalization | L2-normalized (unit sphere) → cosine similarity = dot product |
| License | Apache 2.0 |
### Why multilingual?
German curriculum language is domain-specific and contains compound words,
technical terms, and pedagogical jargon that generic German models may handle
poorly. A multilingual model trained on diverse paraphrase data captures
*semantic equivalence* across paraphrases better than a purely language-modeled
baseline, which is exactly what is needed to group thematically similar
curriculum excerpts.
### Practical interpretation
> Two excerpts with a **high cosine similarity** (close to 1.0) make similar
> semantic claims or describe similar content — even if they use different words.
> Two excerpts with **low cosine similarity** (close to 0) occupy different
> regions of conceptual space.
---
## 5. Dimensionality Reduction: UMAP
### The dimensionality problem
768-dimensional vectors cannot be visualized directly. **UMAP** (Uniform
Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while
preserving as much of the neighbourhood structure as possible — meaning that
points that were close in 768-D tend to remain close after reduction.
### Two projections are computed
| Projection | Dimensions | Purpose |
|---|---|---|
| **UMAP 2-D** | 2 | Interactive scatter plots, BERTopic visualization |
| **UMAP 3-D** | 3 | Atlas visualization, joint concept space |
A **joint 3-D UMAP** is also computed across all three concept corpora combined,
placing *mensch*, *verhalten*, and *evolution* excerpts in a single shared
semantic space for direct comparison.
### Key parameters
| Parameter | Value | Effect |
|---|---|---|
| `n_neighbors` | 15 | Controls local vs. global structure balance |
| `min_dist` | 0.1 (3-D) / 0.05 (2-D) | How tightly clusters pack |
| `metric` | cosine | Appropriate for normalized embeddings |
| `random_state` | 42 | Ensures reproducible layouts |
### Accessible interpretation
> Think of UMAP as making a **map of meaning**. Just as a geographic map
> compresses the curved surface of the Earth onto a flat page while preserving
> relative distances between cities, UMAP compresses the high-dimensional
> semantic space onto a 2-D or 3-D canvas while preserving which excerpts are
> semantically "nearby."
>
> **Clusters** of points in a UMAP plot indicate groups of excerpts that
> discuss similar ideas. **Gaps** between clusters indicate distinct conceptual
> sub-areas. The absolute position of a cluster has no meaning — only
> relative distances matter.
---
## 6. Topic Modeling: BERTopic
### What is topic modeling?
Topic modeling is an **unsupervised** method for discovering thematic groups
within a text collection — without being told in advance what the themes are.
Traditional methods (e.g. LDA) work on word co-occurrence statistics.
**BERTopic** uses pre-computed sentence embeddings, which means it understands
*meaning* rather than just *word frequency*.
### Pipeline within BERTopic
Sentence embeddings (768-D)
UMAP reduction (→ 5-D internal space)
HDBSCAN clustering
c-TF-IDF topic representation
(class-based TF-IDF: finds words that
distinguish this topic from all others)
Topic labels + per-document probabilities
### HDBSCAN: density-based clustering
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with
Noise) finds clusters as **dense regions** in the embedding space, separated
by sparser regions. Key advantages for curriculum text:
- Does **not** require specifying the number of clusters in advance
- Naturally handles **outliers** (assigned topic `-1`)
- Finds clusters of **variable size and shape**
### Parameters used
| Parameter | Value | Rationale |
|---|---|---|
| `min_cluster_size` | 5 (adaptive) | Minimum excerpts to form a topic |
| `min_samples` | `max(1, min_cluster_size // 2)` | Controls noise sensitivity |
| `cluster_selection_method` | `eom` | Excess of Mass — finds stable clusters |
| `n_gram_range` | (1, 2) | Single words and two-word phrases as features |
### Reading the results
Each topic is characterized by its **top words** — terms with the highest
c-TF-IDF scores for that cluster. These are words that appear frequently in
the topic *and* rarely in other topics, making them highly discriminating.
**Topic -1** is always the outlier category: excerpts that did not fit
confidently into any discovered cluster. A high outlier rate may indicate
either genuine semantic diversity or insufficient data for that concept.
### Silhouette score
The **silhouette score** measures how well-separated the discovered clusters
are, ranging from -1 (poor) to +1 (excellent):
- **> 0.5**: well-separated, meaningful topics
- **0.2–0.5**: moderate separation — topics exist but overlap
- **< 0.2**: clusters are not well-defined; interpret with caution
---
## 7. Information-Theoretic Measures
Information theory provides a principled mathematical language for measuring
**diversity**, **surprise**, and **difference** in distributions. The Concept
Atlas applies three core measures.
### 7.1 Shannon Entropy
$$H(X) = -\sum_{i} p_i \log_2 p_i \quad \text{(bits)}$$
**What it measures:** How evenly curriculum excerpts are spread across the
discovered topics. High entropy = many topics of roughly equal size (diverse
framing). Low entropy = one or two dominant topics (concentrated framing).
**Interpretation guide:**
| Entropy | Meaning |
|---|---|
| **Low** (< 1 bit) | One topic dominates; concept is used in a narrow, uniform way |
| **Medium** (1–2.5 bits) | Moderate diversity; concept appears in several distinct contexts |
| **High** (> 2.5 bits) | Highly diverse; concept is used across many different framings |
> **Accessible analogy:** Imagine rolling a die. If it always lands on 6
> (entropy = 0), you learn nothing new from each roll. If all faces are equally
> likely (maximum entropy), each roll is maximally informative. Curriculum
> entropy works the same way — high entropy means the concept is used in
> many genuinely different ways.
### 7.2 Jensen-Shannon Divergence (JSD)
$$JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$$
where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence.
**What it measures:** The *similarity* between two probability distributions
over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1],
and always defined (unlike raw KL divergence).
**Used in two contexts:**
1. **Cross-concept JSD:** Do *mensch*, *verhalten*, and *evolution* have
similar topic distributions? JSD near 0 means yes; near 1 means they
occupy entirely different topical spaces.
2. **Cross-state JSD:** Do two federal states frame the same concept similarly?
High JSD between states indicates curricular divergence; low JSD indicates
convergence.
**Interpretation:**
| JSD | Interpretation |
|---|---|
| 0.0 | Identical topic distributions |
| 0.0–0.2 | Very similar framing |
| 0.2–0.5 | Moderate divergence |
| 0.5–1.0 | Substantially different framing |
| 1.0 | Completely non-overlapping distributions |
### 7.3 Cosine Similarity (Embedding Centroids)
$$\text{sim}(A, B) = \frac{\bar{a} \cdot \bar{b}}{\|\bar{a}\| \|\bar{b}\|}$$
where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of
two concept corpora.
**What it measures:** Whether the *average semantic content* of two concept
corpora occupies the same region of embedding space. This is complementary to
JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD
operates on the topic distribution (post-clustering).
**Interpretation:**
| Cosine sim | Interpretation |
|---|---|
| > 0.9 | Concepts discussed in nearly identical semantic context |
| 0.7–0.9 | Related but distinct semantic regions |
| 0.5–0.7 | Moderate semantic overlap |
| < 0.5 | Concepts occupy largely separate semantic spaces |
---
## 8. Graph-Theoretic Analysis
### Why model curriculum text as a graph?
A graph (network) makes the **relational structure** of a corpus explicit.
Instead of treating each excerpt independently, a graph reveals which excerpts
are semantically central, which bridge different topic areas, and how the
corpus is organized into communities.
### Construction: k-Nearest Neighbour Graph
For each concept corpus, a graph $G = (V, E)$ is constructed where:
- **Nodes** $V$: each curriculum excerpt
- **Edges** $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity
exceeds a threshold (≥ 0.35) *and* $j$ is among the $k=6$ nearest neighbours
of $i$
- **Edge weights**: the cosine similarity value
This creates a **sparse similarity graph** that captures local semantic
neighbourhood structure.
### Measures computed
#### Betweenness Centrality
$$C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma(s,t|v)}{\sigma(s,t)}$$
Measures how often a node lies on the shortest path between other nodes.
**High betweenness** = the excerpt is a semantic "bridge" between different
topic areas. In curriculum terms, bridge excerpts often contain integrative
or interdisciplinary language.
#### PageRank
Iteratively assigns importance based on the importance of neighbours.
**High PageRank** = the excerpt is cited (connected to) by many other
important excerpts. PageRank hubs represent semantically central curriculum
statements that many other excerpts are conceptually near.
#### Closeness Centrality
Measures how quickly a node can reach all others via the graph.
**High closeness** = the excerpt is semantically accessible from most others —
a "general" or bridging statement.
#### Network Density
$$d = \frac{2|E|}{|V|(|V|-1)}$$
The fraction of all possible edges that actually exist. Higher density
indicates a more semantically cohesive corpus (most excerpts are near most
others). Lower density indicates a more fragmented semantic space.
#### Average Clustering Coefficient
Measures the tendency of nodes to form tightly connected local groups.
High clustering = the corpus has tight semantic sub-communities.
#### Louvain Community Detection
Partitions the graph into communities that maximize **modularity** — the
degree to which within-community connections exceed what would be expected
by chance. Communities in a curriculum semantic graph often correspond to
distinct disciplinary or contextual framings of a concept.
### Accessible interpretation
> Think of the semantic graph as a **social network of curriculum excerpts**,
> where two excerpts are "friends" if they discuss similar ideas.
>
> - **Hubs** (high PageRank) are the popular, central ideas that most other
> ideas are related to.
> - **Bridges** (high betweenness) are the connectors — excerpts that link
> otherwise separate clusters of ideas.
> - **Communities** are cliques of mutually similar excerpts — effectively
> the curriculum's implicit sub-topics.
---
## 9. Cross-Concept Comparison
The cross-concept analysis addresses the core research question:
**How do *mensch*, *verhalten*, and *evolution* relate to each other in
German curriculum language?**
Three complementary lenses are applied:
### Lens 1: Geometric (Cosine Similarity Matrix)
Computes the cosine similarity between the **mean embedding vectors** of each
concept corpus. This is a purely geometric measure — it asks whether the
average document in one concept space is semantically close to the average
document in another.
### Lens 2: Distributional (Jensen-Shannon Divergence Matrix)
Computes JSD between the **topic probability distributions** of each concept
pair. This asks whether the *thematic structure* (the pattern of which topics
are prominent) is similar across concepts, regardless of the raw embedding
geometry.
### Lens 3: Comparative Statistics
Side-by-side comparison of:
- Shannon entropy per concept (conceptual breadth)
- Corpus size (representation in curricula)
- Number of discovered topics (thematic complexity)
- Silhouette score (cluster quality)
### Why use multiple lenses?
Two concepts could be **geometrically close** (similar average embedding) but
**distributionally different** (very different topic structures). For example,
*mensch* and *verhalten* might both appear in social-science contexts (close
centroids) but *mensch* might span biology, ethics, and philosophy while
*verhalten* concentrates in behavioral science (different topic distributions).
Using both measures together gives a richer picture.
---
## 10. State-Level Variation
Germany's federal education system means that curriculum design is the
responsibility of each *Bundesland*. This creates natural variation that is
itself a research object.
### Bubble chart: Entropy × State
For each state-concept pair, Shannon entropy is computed over the topic
distribution of excerpts from that state. Plotting entropy against state
with bubble size proportional to corpus size reveals:
- Which states frame a concept more *uniformly* (low entropy)
- Which states frame it more *diversely* (high entropy)
- Whether small-corpus states should be interpreted cautiously
### State-pairwise JSD heatmaps
For each concept, a symmetric matrix of JSD values is computed between all
pairs of states. This reveals:
- **Clusters of similar states** (low inter-state JSD)
- **Outlier states** with distinctive curriculum framing
- **Concept-specific patterns**: states may converge on *evolution*
(where scientific consensus constrains framing) but diverge on *mensch*
(where philosophical tradition varies more)
---
## 11. Caching & Reproducibility
### Deterministic cache keys
Every artefact is identified by a key encoding:
1. The artefact type (e.g. `emb_mensch`)
2. The number of input texts (detects corpus changes)
3. An MD5 hash of the full key string
This means that if the corpus changes size, all downstream caches are
automatically invalidated (different key → different filename → recomputed).
### Artefact manifest
| File pattern | Content | Format |
|---|---|---|
| `emb_{concept}_{n}_*.npy` | Sentence embeddings | NumPy float32 |
| `umap3d_{concept}_{n}_*.npy` | 3-D UMAP coordinates | NumPy float32 |
| `umap2d_{concept}_{n}_*.npy` | 2-D UMAP coordinates | NumPy float32 |
| `bertopic_{concept}_{n}_topics_*.json` | Topic assignment per excerpt | JSON list |
| `bertopic_{concept}_{n}_probs_*.npy` | Topic probabilities | NumPy float32 |
| `bertopic_{concept}_{n}_words_*.json` | Top words per topic | JSON dict |
| `bertopic_{concept}_{n}_info_*.json` | Full topic info table | JSON list |
| `joint_all_embs_*.npy` | Joint corpus embeddings | NumPy float32 |
| `joint_umap3d_*.npy` | Joint 3-D UMAP | NumPy float32 |
| `enriched_corpus.parquet` | Full enriched dataset | Parquet |
| `data/enriched_corpus.csv` | Full enriched dataset | CSV |
### Pushing to HuggingFace
# Authenticate
huggingface-cli login
# Upload computed artefacts to the Space
huggingface-cli upload deirdosh/curriculum_analysis_german \
./cache cache --repo-type=space
huggingface-cli upload deirdosh/curriculum_analysis_german \
./data data --repo-type=space
The `enriched_corpus.csv` adds BERTopic `topic_id` and UMAP coordinates
(`umap2_x`, `umap2_y`, `umap3_x`, `umap3_y`, `umap3_z`) to every excerpt,
making the enriched dataset independently useful for downstream analysis
without re-running the pipeline.
## 12. Educational Applications
### For curriculum researchers
The Concept Atlas operationalizes several research questions that have
historically required manual content analysis:
**Q: Is the concept of *evolution* framed consistently across German states?**
→ Examine the state-pairwise JSD heatmap for evolution. States with high JSD
scores frame the concept in fundamentally different ways — follow up by reading
the excerpts in the high-JSD clusters.
**Q: How does *mensch* differ between Biology and Ethics curricula?**
→ Filter the UMAP scatter by subject metadata and look for spatial separation
between subject-coloured clusters.
**Q: Which curriculum excerpts are semantically central to the concept
of *verhalten*?**
→ Consult the PageRank and betweenness centrality rankings in the Graph
Theory tab to find the hub and bridge excerpts.
**Q: Are any states outliers in how they frame all three concepts?**
→ Compare state-level entropy across all three concept bubble charts. A state
that consistently shows either very low or very high entropy across all three
concepts may have a distinctive curriculum philosophy.
### For teachers and educators
You do not need to understand the mathematics to use the Concept Atlas
productively. Here is an accessible guide:
**The UMAP scatter plot** is a *map of meaning*. Points close together mean
the curriculum uses similar language in those excerpts. Click on any point to
read the actual excerpt. Ask: Do the clusters make sense intuitively? Are
there surprising neighbors?
**The topic word clouds** show you what each cluster is "about" — the most
distinctive words for that group of excerpts. Use these to name the implicit
sub-topics in your subject area's curriculum.
**The entropy score** is a single number summarizing how *diverse* curriculum
language is for that concept. Compare it across states: does your state have
higher or lower entropy than average? What might that mean for teaching?
**The state JSD heatmap** is a curriculum comparison tool. Find your state on
both axes and read across: which states treat this concept most similarly to
yours? Most differently? This can be a starting point for cross-state
curriculum exchange or dialogue.
### Classroom use
The Concept Atlas can support several classroom activities:
- **Curriculum literacy seminars**: pre-service teachers can explore how their
subject area frames key concepts, developing meta-awareness of curriculum
language
- **Cross-disciplinary projects**: students can investigate how *mensch* is
framed differently in Biology vs. Religion, using the UMAP and topic plots
as primary evidence
- **Federalism and education policy**: social studies courses can use the
state comparison features to discuss German educational federalism concretely
- **Philosophy of science**: the *evolution* concept analysis can ground
discussions of how scientific concepts travel (or don't) across subject
boundaries
---
## 13. Decentralized Research Model
### The case for community-driven curriculum analysis
Curriculum analysis has traditionally been the province of large research
institutes with dedicated funding and staff. This creates several problems:
- Coverage is selective and often lags policy changes by years
- Methods are rarely shared, making replication difficult
- Researchers in smaller institutions or non-German-speaking countries face
high access barriers
- Teachers, whose expertise is most relevant, are rarely involved as researchers
The Concept Atlas is designed to support a different model: **lightweight,
reproducible, community-extensible analysis hosted on free public infrastructure.**
### HuggingFace Spaces as research infrastructure
HuggingFace Spaces provides:
| Feature | Research value |
|---|---|
| Free GPU/CPU hosting | Zero infrastructure cost for deployment |
| Git-based version control | Full reproducibility and change history |
| Public dataset repository | Findable, citable, reusable data |
| Community discussion | Peer feedback without formal publication gatekeeping |
| Fork-and-extend | Others can build on the analysis with one click |
### How to extend this work
**Adding more concepts:**
Edit the `FOCUS_CONCEPTS` list in `app.py`. The pipeline will automatically
process new concepts if matching rows exist in the CSV.
**Adding more states or subjects:**
Extend the `curriculum_excerpts.csv` with new rows following the same column
structure and re-run the pipeline.
**Using a different embedding model:**
Change the `MODEL_NAME` constant. Any model on the
[SentenceTransformers model hub](https://www.sbert.net/docs/pretrained_models.html)
with multilingual support can be substituted. Clear the embedding cache
(`cache/emb_*.npy`) to force recomputation.
**Comparative cross-national analysis:**
The pipeline is language-agnostic (the embedding model supports 50+ languages).
Providing curriculum excerpts from Austria, Switzerland, or other countries
in the same CSV format enables direct cross-national comparison.
**Contributing back:**
- Open an issue or discussion on the HuggingFace Space
- Fork the Space and submit a PR with your extension
- Publish your enriched corpus as a separate HuggingFace dataset with a
link back to this Space
### Minimal technical requirements for contributors
To run the pipeline locally or contribute new analysis:
# Clone the Space
git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german
cd curriculum_analysis_german
# Install dependencies
pip install -r requirements.txt
# Run locally
python app.py
A standard laptop (8 GB RAM, no GPU) can run the full pipeline in
approximately 15–20 minutes on first run. GPU acceleration reduces this to
2–5 minutes. All subsequent runs load from cache in under 10 seconds.
---
## 14. Limitations & Ethical Considerations
### Technical limitations
**Embedding model biases**
The `paraphrase-multilingual-mpnet-base-v2` model was trained primarily on
paraphrase pairs and may not capture domain-specific curriculum jargon as
accurately as a model fine-tuned on educational text. Terms with
curriculum-specific meanings (e.g. *Kompetenz* in pedagogical vs. general
usage) may be represented according to their general-language distribution.
**UMAP non-determinism across runs**
While `random_state=42` ensures reproducibility within a session, UMAP
projections are not globally canonical — a different seed or different
`n_neighbors` value will produce a different (though structurally similar)
layout. Conclusions should not depend on the absolute position of clusters,
only on their relative proximity.
**BERTopic outlier sensitivity**
HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1).
With small corpora or very diverse text, the outlier rate can be high (>50%).
This is a signal that the data may be too heterogeneous for reliable topic
discovery rather than a failure of the method — but it limits interpretability.
**Corpus completeness**
The current corpus may not include all German states, all school types, or
all grade levels. Gaps in coverage mean that low entropy or low JSD for a
state may reflect missing data rather than genuine curricular convergence.
**No temporal dimension**
The current analysis treats curricula as static documents. It does not capture
revision histories or how concept framing has changed over time. A longitudinal
extension would require time-stamped corpus versions.
### Ethical considerations
**Curriculum documents are public, but context matters**
German curriculum documents are publicly available administrative texts.
However, analysis that identifies specific states as "outliers" or frames
curriculum differences in evaluative terms should be handled carefully.
The goal of this tool is descriptive analysis, not ranking or judgment.
**Automated analysis does not replace reading**
Computational methods reveal patterns at scale but cannot replace close
reading of the actual texts. Any finding from the Concept Atlas should be
verified by examining the underlying excerpts before drawing policy conclusions.
**Representation of marginalized perspectives**
If curriculum documents systematically underrepresent certain voices
(e.g. indigenous knowledge systems, minority cultural frameworks), those
absences will not appear in the semantic analysis — which only reflects
what is present in the text. The Concept Atlas can reveal *what is there*
but not *what is missing*.
**Open-source does not mean unbiased**
The choice of focus concepts, the threshold parameters, and the framing of
results all reflect research decisions made by the developers. We encourage
users to interrogate these choices and to adapt the tool to their own
research questions rather than treating the default configuration as neutral.
---
## 15. Glossary
| Term | Definition |
|---|---|
| **BERTopic** | A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text |
| **Betweenness centrality** | A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points |
| **Clustering coefficient** | The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness |
| **Cosine similarity** | A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite |
| **c-TF-IDF** | Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics |
| **Embedding** | A numerical vector representation of text that encodes semantic meaning |
| **Entropy (Shannon)** | A measure of uncertainty or diversity in a probability distribution; measured in bits |
| **HDBSCAN** | Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape |
| **Jensen-Shannon divergence** | A symmetric, bounded measure of similarity between two probability distributions |
| **kNN graph** | A graph where each node is connected to its k nearest neighbours by some distance measure |
| **Louvain algorithm** | A community detection algorithm that maximizes modularity in a network |
| **Modularity** | A measure of the quality of a graph partition into communities |
| **PageRank** | A graph centrality measure that assigns importance based on the importance of connected nodes |
| **Silhouette score** | A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent) |
| **Sentence Transformer** | A neural network architecture optimized for producing sentence-level embeddings |
| **UMAP** | Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure |
---
*Document version: May 2025*
*Space: [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)*
*License: Apache 2.0*
```