| # Methods Document |
|
|
| # Concept Atlas: Methods & Application Guide |
| ### German Curriculum Semantic Analysis — Technical & Pedagogical Documentation |
|
|
| > **Space:** [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german) |
| > **Focus concepts:** *mensch · verhalten · evolution* |
| > **Model:** `paraphrase-multilingual-mpnet-base-v2` |
|
|
| --- |
|
|
| ## Table of Contents |
|
|
| 1. [Overview & Motivation](#1-overview--motivation) |
| 2. [Data: The Curriculum Corpus](#2-data-the-curriculum-corpus) |
| 3. [Pipeline Architecture](#3-pipeline-architecture) |
| 4. [Multilingual Sentence Embeddings](#4-multilingual-sentence-embeddings) |
| 5. [Dimensionality Reduction: UMAP](#5-dimensionality-reduction-umap) |
| 6. [Topic Modeling: BERTopic](#6-topic-modeling-bertopic) |
| 7. [Information-Theoretic Measures](#7-information-theoretic-measures) |
| 8. [Graph-Theoretic Analysis](#8-graph-theoretic-analysis) |
| 9. [Cross-Concept Comparison](#9-cross-concept-comparison) |
| 10. [State-Level Variation](#10-state-level-variation) |
| 11. [Caching & Reproducibility](#11-caching--reproducibility) |
| 12. [Educational Applications](#12-educational-applications) |
| 13. [Decentralized Research Model](#13-decentralized-research-model) |
| 14. [Limitations & Ethical Considerations](#14-limitations--ethical-considerations) |
| 15. [Glossary](#15-glossary) |
|
|
| --- |
|
|
| ## 1. Overview & Motivation |
|
|
| ### What this project does |
|
|
| The **Concept Atlas** is a computational tool for exploring how key biological |
| and humanistic concepts are framed across German school curricula. Rather than |
| reading curriculum documents manually — a task that scales poorly across 16 |
| federal states (*Bundesländer*), dozens of subjects, and multiple grade levels — |
| this tool uses modern Natural Language Processing (NLP) to: |
|
|
| - **Map** the semantic landscape of curriculum language around three focus |
| concepts: *Mensch* (human), *Verhalten* (behaviour), and *Evolution* |
| - **Cluster** excerpts into coherent topics without any pre-defined categories |
| - **Compare** how these concepts relate to each other mathematically |
| - **Detect** variation in framing between federal states |
|
|
| ### Why these three concepts? |
|
|
| *Mensch*, *Verhalten*, and *Evolution* occupy a uniquely contested intersection |
| in German science education. They appear across Biology, Ethics, Social Studies, |
| Religion, and Psychology curricula — often with very different emphases |
| depending on subject context and state. This makes them ideal test cases for |
| computational curriculum analysis: |
|
|
| | Concept | Why it matters | |
| |---|---| |
| | **Mensch** | Bridges biological and humanistic framings; appears in nearly every subject | |
| | **Verhalten** | Links ethological science to social norms and moral education | |
| | **Evolution** | Scientifically precise in Biology; contested or reframed in other subjects | |
|
|
| ### Who is this for? |
|
|
| - **Curriculum researchers** seeking scalable, reproducible analysis tools |
| - **Science educators** interested in how their subject's language compares |
| across states or disciplines |
| - **Policy analysts** investigating curricular coherence and equity |
| - **Graduate students** learning applied NLP for educational research |
| - **Open science advocates** interested in decentralized, community-driven |
| research infrastructure |
|
|
| --- |
|
|
| ## 2. Data: The Curriculum Corpus |
|
|
| ### Source |
|
|
| The corpus consists of text excerpts drawn from publicly available German school |
| curriculum documents (*Lehrpläne* and *Bildungspläne*) across multiple federal |
| states. Each excerpt was retrieved by keyword search and stored as a structured |
| CSV file. |
|
|
| ### Structure |
|
|
| Each row in `curriculum_excerpts.csv` represents one curriculum excerpt: |
|
|
| | Column | Description | |
| |---|---| |
| | `search_term` | The keyword used to retrieve the excerpt (e.g. `mensch`) | |
| | `text_excerpt` | The raw curriculum text (sentence to paragraph length) | |
| | `state` | German federal state (*Bundesland*) | |
| | `subject` | School subject (e.g. Biologie, Ethik, Sozialkunde) | |
| | `grade` | Target grade level or band | |
| | `school_type` | School type (e.g. Gymnasium, Realschule) | |
|
|
| ### Preprocessing |
|
|
| Before analysis, the corpus undergoes the following cleaning steps: |
|
|
|
|
| Raw CSV |
| → Normalise column names (lowercase, underscores) |
| → Fill missing values with empty strings |
| → Add missing optional columns (state, subject, grade, school_type) |
| → Strip whitespace from text_excerpt and search_term |
| → Remove excerpts shorter than 20 characters |
| → Derive search_term_lower for case-insensitive concept matching |
| |
| |
| Concept subsets are built by **exact match** on `search_term_lower`, with |
| automatic fallback to **partial string match** if fewer than 10 exact matches |
| are found. This handles spelling variants and compound words. |
| |
| ### Scale considerations |
| |
| The analysis is designed to work with corpora ranging from a few hundred to |
| tens of thousands of excerpts. Embedding and topic modeling parameters scale |
| automatically: |
| |
| - `n_neighbors` in UMAP is capped at `min(15, n_samples - 1)` |
| - `min_cluster_size` in HDBSCAN is set to `min(5, max(2, n // 10))` |
| |
| --- |
| |
| ## 3. Pipeline Architecture |
| |
| The full analysis pipeline runs sequentially in a single click. All |
| computationally expensive steps are cached to disk so that subsequent |
| exploration is instantaneous. |
| |
| |
| CSV ingestion |
| │ |
| ▼ |
| Sentence embeddings (per concept) |
| │ |
| ├──► UMAP 2-D (visualization) |
| │ |
| ├──► UMAP 3-D (atlas & joint space) |
| │ |
| └──► BERTopic |
| │ |
| ├──► Topic labels per excerpt |
| ├──► Top words per topic |
| └──► Topic probability distributions |
| │ |
| ├──► Shannon entropy |
| ├──► Jensen-Shannon divergence (cross-concept) |
| ├──► Jensen-Shannon divergence (cross-state) |
| └──► Cosine similarity (centroid comparison) |
| |
| Semantic kNN graph (per concept) |
| │ |
| ├──► Betweenness centrality |
| ├──► PageRank |
| ├──► Closeness centrality |
| ├──► Louvain communities |
| ├──► Network density |
| └──► Average clustering coefficient |
| |
| Enriched parquet export |
| │ |
| └──► data/enriched_corpus.csv |
| cache/enriched_corpus.parquet |
| |
|
|
| ### Caching strategy |
|
|
| Every expensive computation writes a `.npy` (arrays) or `.json` (metadata) |
| file to `./cache/`, keyed by a combination of: |
|
|
| - The logical name of the artefact (e.g. `emb_mensch`) |
| - The number of input texts (detects corpus changes) |
| - An MD5 hash of the full key string (prevents filename collisions) |
|
|
| On re-launch, the pipeline checks for cached files first and skips recomputation |
| entirely if they exist. This makes the Space fast for end users while keeping |
| the first-run cost affordable. |
|
|
| --- |
|
|
| ## 4. Multilingual Sentence Embeddings |
|
|
| ### What is an embedding? |
|
|
| An **embedding** is a list of numbers (a vector) that represents the meaning |
| of a piece of text. Texts with similar meanings produce vectors that are close |
| together in mathematical space; texts with different meanings produce vectors |
| that are far apart. |
|
|
| The Concept Atlas uses vectors with **768 dimensions** — each excerpt becomes |
| a point in a 768-dimensional semantic space. |
|
|
| ### Model: `paraphrase-multilingual-mpnet-base-v2` |
|
|
| This model is a **Sentence Transformer** — a neural network fine-tuned |
| specifically to produce high-quality sentence-level representations. Key |
| properties relevant to this project: |
|
|
| | Property | Detail | |
| |---|---| |
| | Architecture | MPNet-base (Masked and Permuted Pre-training) | |
| | Training data | Paraphrase pairs in 50+ languages | |
| | German support | Native — no translation needed | |
| | Output dimension | 768 | |
| | Normalization | L2-normalized (unit sphere) → cosine similarity = dot product | |
| | License | Apache 2.0 | |
|
|
| ### Why multilingual? |
|
|
| German curriculum language is domain-specific and contains compound words, |
| technical terms, and pedagogical jargon that generic German models may handle |
| poorly. A multilingual model trained on diverse paraphrase data captures |
| *semantic equivalence* across paraphrases better than a purely language-modeled |
| baseline, which is exactly what is needed to group thematically similar |
| curriculum excerpts. |
|
|
| ### Practical interpretation |
|
|
| > Two excerpts with a **high cosine similarity** (close to 1.0) make similar |
| > semantic claims or describe similar content — even if they use different words. |
| > Two excerpts with **low cosine similarity** (close to 0) occupy different |
| > regions of conceptual space. |
|
|
| --- |
|
|
| ## 5. Dimensionality Reduction: UMAP |
|
|
| ### The dimensionality problem |
|
|
| 768-dimensional vectors cannot be visualized directly. **UMAP** (Uniform |
| Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while |
| preserving as much of the neighbourhood structure as possible — meaning that |
| points that were close in 768-D tend to remain close after reduction. |
|
|
| ### Two projections are computed |
|
|
| | Projection | Dimensions | Purpose | |
| |---|---|---| |
| | **UMAP 2-D** | 2 | Interactive scatter plots, BERTopic visualization | |
| | **UMAP 3-D** | 3 | Atlas visualization, joint concept space | |
|
|
| A **joint 3-D UMAP** is also computed across all three concept corpora combined, |
| placing *mensch*, *verhalten*, and *evolution* excerpts in a single shared |
| semantic space for direct comparison. |
|
|
| ### Key parameters |
|
|
| | Parameter | Value | Effect | |
| |---|---|---| |
| | `n_neighbors` | 15 | Controls local vs. global structure balance | |
| | `min_dist` | 0.1 (3-D) / 0.05 (2-D) | How tightly clusters pack | |
| | `metric` | cosine | Appropriate for normalized embeddings | |
| | `random_state` | 42 | Ensures reproducible layouts | |
|
|
| ### Accessible interpretation |
|
|
| > Think of UMAP as making a **map of meaning**. Just as a geographic map |
| > compresses the curved surface of the Earth onto a flat page while preserving |
| > relative distances between cities, UMAP compresses the high-dimensional |
| > semantic space onto a 2-D or 3-D canvas while preserving which excerpts are |
| > semantically "nearby." |
| > |
| > **Clusters** of points in a UMAP plot indicate groups of excerpts that |
| > discuss similar ideas. **Gaps** between clusters indicate distinct conceptual |
| > sub-areas. The absolute position of a cluster has no meaning — only |
| > relative distances matter. |
|
|
| --- |
|
|
| ## 6. Topic Modeling: BERTopic |
|
|
| ### What is topic modeling? |
|
|
| Topic modeling is an **unsupervised** method for discovering thematic groups |
| within a text collection — without being told in advance what the themes are. |
| Traditional methods (e.g. LDA) work on word co-occurrence statistics. |
| **BERTopic** uses pre-computed sentence embeddings, which means it understands |
| *meaning* rather than just *word frequency*. |
|
|
| ### Pipeline within BERTopic |
|
|
|
|
| Sentence embeddings (768-D) |
| │ |
| ▼ |
| UMAP reduction (→ 5-D internal space) |
| │ |
| ▼ |
| HDBSCAN clustering |
| │ |
| ▼ |
| c-TF-IDF topic representation |
| (class-based TF-IDF: finds words that |
| distinguish this topic from all others) |
| │ |
| ▼ |
| Topic labels + per-document probabilities |
| |
|
|
| ### HDBSCAN: density-based clustering |
|
|
| HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with |
| Noise) finds clusters as **dense regions** in the embedding space, separated |
| by sparser regions. Key advantages for curriculum text: |
|
|
| - Does **not** require specifying the number of clusters in advance |
| - Naturally handles **outliers** (assigned topic `-1`) |
| - Finds clusters of **variable size and shape** |
|
|
| ### Parameters used |
|
|
| | Parameter | Value | Rationale | |
| |---|---|---| |
| | `min_cluster_size` | 5 (adaptive) | Minimum excerpts to form a topic | |
| | `min_samples` | `max(1, min_cluster_size // 2)` | Controls noise sensitivity | |
| | `cluster_selection_method` | `eom` | Excess of Mass — finds stable clusters | |
| | `n_gram_range` | (1, 2) | Single words and two-word phrases as features | |
|
|
| ### Reading the results |
|
|
| Each topic is characterized by its **top words** — terms with the highest |
| c-TF-IDF scores for that cluster. These are words that appear frequently in |
| the topic *and* rarely in other topics, making them highly discriminating. |
|
|
| **Topic -1** is always the outlier category: excerpts that did not fit |
| confidently into any discovered cluster. A high outlier rate may indicate |
| either genuine semantic diversity or insufficient data for that concept. |
|
|
| ### Silhouette score |
|
|
| The **silhouette score** measures how well-separated the discovered clusters |
| are, ranging from -1 (poor) to +1 (excellent): |
|
|
| - **> 0.5**: well-separated, meaningful topics |
| - **0.2–0.5**: moderate separation — topics exist but overlap |
| - **< 0.2**: clusters are not well-defined; interpret with caution |
|
|
| --- |
|
|
| ## 7. Information-Theoretic Measures |
|
|
| Information theory provides a principled mathematical language for measuring |
| **diversity**, **surprise**, and **difference** in distributions. The Concept |
| Atlas applies three core measures. |
|
|
| ### 7.1 Shannon Entropy |
|
|
| $$H(X) = -\sum_{i} p_i \log_2 p_i \quad \text{(bits)}$$ |
|
|
| **What it measures:** How evenly curriculum excerpts are spread across the |
| discovered topics. High entropy = many topics of roughly equal size (diverse |
| framing). Low entropy = one or two dominant topics (concentrated framing). |
|
|
| **Interpretation guide:** |
|
|
| | Entropy | Meaning | |
| |---|---| |
| | **Low** (< 1 bit) | One topic dominates; concept is used in a narrow, uniform way | |
| | **Medium** (1–2.5 bits) | Moderate diversity; concept appears in several distinct contexts | |
| | **High** (> 2.5 bits) | Highly diverse; concept is used across many different framings | |
|
|
| > **Accessible analogy:** Imagine rolling a die. If it always lands on 6 |
| > (entropy = 0), you learn nothing new from each roll. If all faces are equally |
| > likely (maximum entropy), each roll is maximally informative. Curriculum |
| > entropy works the same way — high entropy means the concept is used in |
| > many genuinely different ways. |
|
|
| ### 7.2 Jensen-Shannon Divergence (JSD) |
|
|
| $$JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$$ |
|
|
| where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence. |
| |
| **What it measures:** The *similarity* between two probability distributions |
| over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1], |
| and always defined (unlike raw KL divergence). |
| |
| **Used in two contexts:** |
| |
| 1. **Cross-concept JSD:** Do *mensch*, *verhalten*, and *evolution* have |
| similar topic distributions? JSD near 0 means yes; near 1 means they |
| occupy entirely different topical spaces. |
| |
| 2. **Cross-state JSD:** Do two federal states frame the same concept similarly? |
| High JSD between states indicates curricular divergence; low JSD indicates |
| convergence. |
| |
| **Interpretation:** |
| |
| | JSD | Interpretation | |
| |---|---| |
| | 0.0 | Identical topic distributions | |
| | 0.0–0.2 | Very similar framing | |
| | 0.2–0.5 | Moderate divergence | |
| | 0.5–1.0 | Substantially different framing | |
| | 1.0 | Completely non-overlapping distributions | |
| |
| ### 7.3 Cosine Similarity (Embedding Centroids) |
| |
| $$\text{sim}(A, B) = \frac{\bar{a} \cdot \bar{b}}{\|\bar{a}\| \|\bar{b}\|}$$ |
| |
| where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of |
| two concept corpora. |
| |
| **What it measures:** Whether the *average semantic content* of two concept |
| corpora occupies the same region of embedding space. This is complementary to |
| JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD |
| operates on the topic distribution (post-clustering). |
| |
| **Interpretation:** |
| |
| | Cosine sim | Interpretation | |
| |---|---| |
| | > 0.9 | Concepts discussed in nearly identical semantic context | |
| | 0.7–0.9 | Related but distinct semantic regions | |
| | 0.5–0.7 | Moderate semantic overlap | |
| | < 0.5 | Concepts occupy largely separate semantic spaces | |
| |
| --- |
| |
| ## 8. Graph-Theoretic Analysis |
| |
| ### Why model curriculum text as a graph? |
| |
| A graph (network) makes the **relational structure** of a corpus explicit. |
| Instead of treating each excerpt independently, a graph reveals which excerpts |
| are semantically central, which bridge different topic areas, and how the |
| corpus is organized into communities. |
| |
| ### Construction: k-Nearest Neighbour Graph |
| |
| For each concept corpus, a graph $G = (V, E)$ is constructed where: |
| |
| - **Nodes** $V$: each curriculum excerpt |
| - **Edges** $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity |
| exceeds a threshold (≥ 0.35) *and* $j$ is among the $k=6$ nearest neighbours |
| of $i$ |
| - **Edge weights**: the cosine similarity value |
| |
| This creates a **sparse similarity graph** that captures local semantic |
| neighbourhood structure. |
| |
| ### Measures computed |
| |
| #### Betweenness Centrality |
| $$C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma(s,t|v)}{\sigma(s,t)}$$ |
| |
| Measures how often a node lies on the shortest path between other nodes. |
| **High betweenness** = the excerpt is a semantic "bridge" between different |
| topic areas. In curriculum terms, bridge excerpts often contain integrative |
| or interdisciplinary language. |
| |
| #### PageRank |
| Iteratively assigns importance based on the importance of neighbours. |
| **High PageRank** = the excerpt is cited (connected to) by many other |
| important excerpts. PageRank hubs represent semantically central curriculum |
| statements that many other excerpts are conceptually near. |
| |
| #### Closeness Centrality |
| Measures how quickly a node can reach all others via the graph. |
| **High closeness** = the excerpt is semantically accessible from most others — |
| a "general" or bridging statement. |
| |
| #### Network Density |
| $$d = \frac{2|E|}{|V|(|V|-1)}$$ |
| |
| The fraction of all possible edges that actually exist. Higher density |
| indicates a more semantically cohesive corpus (most excerpts are near most |
| others). Lower density indicates a more fragmented semantic space. |
| |
| #### Average Clustering Coefficient |
| Measures the tendency of nodes to form tightly connected local groups. |
| High clustering = the corpus has tight semantic sub-communities. |
| |
| #### Louvain Community Detection |
| Partitions the graph into communities that maximize **modularity** — the |
| degree to which within-community connections exceed what would be expected |
| by chance. Communities in a curriculum semantic graph often correspond to |
| distinct disciplinary or contextual framings of a concept. |
| |
| ### Accessible interpretation |
| |
| > Think of the semantic graph as a **social network of curriculum excerpts**, |
| > where two excerpts are "friends" if they discuss similar ideas. |
| > |
| > - **Hubs** (high PageRank) are the popular, central ideas that most other |
| > ideas are related to. |
| > - **Bridges** (high betweenness) are the connectors — excerpts that link |
| > otherwise separate clusters of ideas. |
| > - **Communities** are cliques of mutually similar excerpts — effectively |
| > the curriculum's implicit sub-topics. |
| |
| --- |
| |
| ## 9. Cross-Concept Comparison |
| |
| The cross-concept analysis addresses the core research question: |
| **How do *mensch*, *verhalten*, and *evolution* relate to each other in |
| German curriculum language?** |
| |
| Three complementary lenses are applied: |
| |
| ### Lens 1: Geometric (Cosine Similarity Matrix) |
| |
| Computes the cosine similarity between the **mean embedding vectors** of each |
| concept corpus. This is a purely geometric measure — it asks whether the |
| average document in one concept space is semantically close to the average |
| document in another. |
| |
| ### Lens 2: Distributional (Jensen-Shannon Divergence Matrix) |
| |
| Computes JSD between the **topic probability distributions** of each concept |
| pair. This asks whether the *thematic structure* (the pattern of which topics |
| are prominent) is similar across concepts, regardless of the raw embedding |
| geometry. |
| |
| ### Lens 3: Comparative Statistics |
| |
| Side-by-side comparison of: |
| - Shannon entropy per concept (conceptual breadth) |
| - Corpus size (representation in curricula) |
| - Number of discovered topics (thematic complexity) |
| - Silhouette score (cluster quality) |
| |
| ### Why use multiple lenses? |
| |
| Two concepts could be **geometrically close** (similar average embedding) but |
| **distributionally different** (very different topic structures). For example, |
| *mensch* and *verhalten* might both appear in social-science contexts (close |
| centroids) but *mensch* might span biology, ethics, and philosophy while |
| *verhalten* concentrates in behavioral science (different topic distributions). |
| Using both measures together gives a richer picture. |
| |
| --- |
| |
| ## 10. State-Level Variation |
| |
| Germany's federal education system means that curriculum design is the |
| responsibility of each *Bundesland*. This creates natural variation that is |
| itself a research object. |
| |
| ### Bubble chart: Entropy × State |
| |
| For each state-concept pair, Shannon entropy is computed over the topic |
| distribution of excerpts from that state. Plotting entropy against state |
| with bubble size proportional to corpus size reveals: |
| |
| - Which states frame a concept more *uniformly* (low entropy) |
| - Which states frame it more *diversely* (high entropy) |
| - Whether small-corpus states should be interpreted cautiously |
| |
| ### State-pairwise JSD heatmaps |
| |
| For each concept, a symmetric matrix of JSD values is computed between all |
| pairs of states. This reveals: |
| |
| - **Clusters of similar states** (low inter-state JSD) |
| - **Outlier states** with distinctive curriculum framing |
| - **Concept-specific patterns**: states may converge on *evolution* |
| (where scientific consensus constrains framing) but diverge on *mensch* |
| (where philosophical tradition varies more) |
| |
| --- |
| |
| ## 11. Caching & Reproducibility |
| |
| ### Deterministic cache keys |
| |
| Every artefact is identified by a key encoding: |
| 1. The artefact type (e.g. `emb_mensch`) |
| 2. The number of input texts (detects corpus changes) |
| 3. An MD5 hash of the full key string |
|
|
| This means that if the corpus changes size, all downstream caches are |
| automatically invalidated (different key → different filename → recomputed). |
|
|
| ### Artefact manifest |
|
|
| | File pattern | Content | Format | |
| |---|---|---| |
| | `emb_{concept}_{n}_*.npy` | Sentence embeddings | NumPy float32 | |
| | `umap3d_{concept}_{n}_*.npy` | 3-D UMAP coordinates | NumPy float32 | |
| | `umap2d_{concept}_{n}_*.npy` | 2-D UMAP coordinates | NumPy float32 | |
| | `bertopic_{concept}_{n}_topics_*.json` | Topic assignment per excerpt | JSON list | |
| | `bertopic_{concept}_{n}_probs_*.npy` | Topic probabilities | NumPy float32 | |
| | `bertopic_{concept}_{n}_words_*.json` | Top words per topic | JSON dict | |
| | `bertopic_{concept}_{n}_info_*.json` | Full topic info table | JSON list | |
| | `joint_all_embs_*.npy` | Joint corpus embeddings | NumPy float32 | |
| | `joint_umap3d_*.npy` | Joint 3-D UMAP | NumPy float32 | |
| | `enriched_corpus.parquet` | Full enriched dataset | Parquet | |
| | `data/enriched_corpus.csv` | Full enriched dataset | CSV | |
|
|
| ### Pushing to HuggingFace |
|
|
| # Authenticate |
| huggingface-cli login |
|
|
| # Upload computed artefacts to the Space |
| huggingface-cli upload deirdosh/curriculum_analysis_german \ |
| ./cache cache --repo-type=space |
|
|
| huggingface-cli upload deirdosh/curriculum_analysis_german \ |
| ./data data --repo-type=space |
|
|
|
|
| The `enriched_corpus.csv` adds BERTopic `topic_id` and UMAP coordinates |
| (`umap2_x`, `umap2_y`, `umap3_x`, `umap3_y`, `umap3_z`) to every excerpt, |
| making the enriched dataset independently useful for downstream analysis |
| without re-running the pipeline. |
|
|
|
|
|
|
| ## 12. Educational Applications |
|
|
| ### For curriculum researchers |
|
|
| The Concept Atlas operationalizes several research questions that have |
| historically required manual content analysis: |
|
|
| **Q: Is the concept of *evolution* framed consistently across German states?** |
| → Examine the state-pairwise JSD heatmap for evolution. States with high JSD |
| scores frame the concept in fundamentally different ways — follow up by reading |
| the excerpts in the high-JSD clusters. |
|
|
| **Q: How does *mensch* differ between Biology and Ethics curricula?** |
| → Filter the UMAP scatter by subject metadata and look for spatial separation |
| between subject-coloured clusters. |
|
|
| **Q: Which curriculum excerpts are semantically central to the concept |
| of *verhalten*?** |
| → Consult the PageRank and betweenness centrality rankings in the Graph |
| Theory tab to find the hub and bridge excerpts. |
|
|
| **Q: Are any states outliers in how they frame all three concepts?** |
| → Compare state-level entropy across all three concept bubble charts. A state |
| that consistently shows either very low or very high entropy across all three |
| concepts may have a distinctive curriculum philosophy. |
|
|
| ### For teachers and educators |
|
|
| You do not need to understand the mathematics to use the Concept Atlas |
| productively. Here is an accessible guide: |
|
|
| **The UMAP scatter plot** is a *map of meaning*. Points close together mean |
| the curriculum uses similar language in those excerpts. Click on any point to |
| read the actual excerpt. Ask: Do the clusters make sense intuitively? Are |
| there surprising neighbors? |
|
|
| **The topic word clouds** show you what each cluster is "about" — the most |
| distinctive words for that group of excerpts. Use these to name the implicit |
| sub-topics in your subject area's curriculum. |
|
|
| **The entropy score** is a single number summarizing how *diverse* curriculum |
| language is for that concept. Compare it across states: does your state have |
| higher or lower entropy than average? What might that mean for teaching? |
|
|
| **The state JSD heatmap** is a curriculum comparison tool. Find your state on |
| both axes and read across: which states treat this concept most similarly to |
| yours? Most differently? This can be a starting point for cross-state |
| curriculum exchange or dialogue. |
|
|
| ### Classroom use |
|
|
| The Concept Atlas can support several classroom activities: |
|
|
| - **Curriculum literacy seminars**: pre-service teachers can explore how their |
| subject area frames key concepts, developing meta-awareness of curriculum |
| language |
| - **Cross-disciplinary projects**: students can investigate how *mensch* is |
| framed differently in Biology vs. Religion, using the UMAP and topic plots |
| as primary evidence |
| - **Federalism and education policy**: social studies courses can use the |
| state comparison features to discuss German educational federalism concretely |
| - **Philosophy of science**: the *evolution* concept analysis can ground |
| discussions of how scientific concepts travel (or don't) across subject |
| boundaries |
|
|
| --- |
|
|
| ## 13. Decentralized Research Model |
|
|
| ### The case for community-driven curriculum analysis |
|
|
| Curriculum analysis has traditionally been the province of large research |
| institutes with dedicated funding and staff. This creates several problems: |
|
|
| - Coverage is selective and often lags policy changes by years |
| - Methods are rarely shared, making replication difficult |
| - Researchers in smaller institutions or non-German-speaking countries face |
| high access barriers |
| - Teachers, whose expertise is most relevant, are rarely involved as researchers |
|
|
| The Concept Atlas is designed to support a different model: **lightweight, |
| reproducible, community-extensible analysis hosted on free public infrastructure.** |
|
|
| ### HuggingFace Spaces as research infrastructure |
|
|
| HuggingFace Spaces provides: |
|
|
| | Feature | Research value | |
| |---|---| |
| | Free GPU/CPU hosting | Zero infrastructure cost for deployment | |
| | Git-based version control | Full reproducibility and change history | |
| | Public dataset repository | Findable, citable, reusable data | |
| | Community discussion | Peer feedback without formal publication gatekeeping | |
| | Fork-and-extend | Others can build on the analysis with one click | |
|
|
| ### How to extend this work |
|
|
| **Adding more concepts:** |
| Edit the `FOCUS_CONCEPTS` list in `app.py`. The pipeline will automatically |
| process new concepts if matching rows exist in the CSV. |
|
|
| **Adding more states or subjects:** |
| Extend the `curriculum_excerpts.csv` with new rows following the same column |
| structure and re-run the pipeline. |
|
|
| **Using a different embedding model:** |
| Change the `MODEL_NAME` constant. Any model on the |
| [SentenceTransformers model hub](https://www.sbert.net/docs/pretrained_models.html) |
| with multilingual support can be substituted. Clear the embedding cache |
| (`cache/emb_*.npy`) to force recomputation. |
|
|
| **Comparative cross-national analysis:** |
| The pipeline is language-agnostic (the embedding model supports 50+ languages). |
| Providing curriculum excerpts from Austria, Switzerland, or other countries |
| in the same CSV format enables direct cross-national comparison. |
|
|
| **Contributing back:** |
| - Open an issue or discussion on the HuggingFace Space |
| - Fork the Space and submit a PR with your extension |
| - Publish your enriched corpus as a separate HuggingFace dataset with a |
| link back to this Space |
|
|
| ### Minimal technical requirements for contributors |
|
|
| To run the pipeline locally or contribute new analysis: |
|
|
| # Clone the Space |
| git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german |
| cd curriculum_analysis_german |
|
|
| # Install dependencies |
| pip install -r requirements.txt |
|
|
| # Run locally |
| python app.py |
|
|
|
|
| A standard laptop (8 GB RAM, no GPU) can run the full pipeline in |
| approximately 15–20 minutes on first run. GPU acceleration reduces this to |
| 2–5 minutes. All subsequent runs load from cache in under 10 seconds. |
|
|
| --- |
|
|
| ## 14. Limitations & Ethical Considerations |
|
|
| ### Technical limitations |
|
|
| **Embedding model biases** |
| The `paraphrase-multilingual-mpnet-base-v2` model was trained primarily on |
| paraphrase pairs and may not capture domain-specific curriculum jargon as |
| accurately as a model fine-tuned on educational text. Terms with |
| curriculum-specific meanings (e.g. *Kompetenz* in pedagogical vs. general |
| usage) may be represented according to their general-language distribution. |
|
|
| **UMAP non-determinism across runs** |
| While `random_state=42` ensures reproducibility within a session, UMAP |
| projections are not globally canonical — a different seed or different |
| `n_neighbors` value will produce a different (though structurally similar) |
| layout. Conclusions should not depend on the absolute position of clusters, |
| only on their relative proximity. |
|
|
| **BERTopic outlier sensitivity** |
| HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1). |
| With small corpora or very diverse text, the outlier rate can be high (>50%). |
| This is a signal that the data may be too heterogeneous for reliable topic |
| discovery rather than a failure of the method — but it limits interpretability. |
|
|
| **Corpus completeness** |
| The current corpus may not include all German states, all school types, or |
| all grade levels. Gaps in coverage mean that low entropy or low JSD for a |
| state may reflect missing data rather than genuine curricular convergence. |
|
|
| **No temporal dimension** |
| The current analysis treats curricula as static documents. It does not capture |
| revision histories or how concept framing has changed over time. A longitudinal |
| extension would require time-stamped corpus versions. |
|
|
| ### Ethical considerations |
|
|
| **Curriculum documents are public, but context matters** |
| German curriculum documents are publicly available administrative texts. |
| However, analysis that identifies specific states as "outliers" or frames |
| curriculum differences in evaluative terms should be handled carefully. |
| The goal of this tool is descriptive analysis, not ranking or judgment. |
|
|
| **Automated analysis does not replace reading** |
| Computational methods reveal patterns at scale but cannot replace close |
| reading of the actual texts. Any finding from the Concept Atlas should be |
| verified by examining the underlying excerpts before drawing policy conclusions. |
|
|
| **Representation of marginalized perspectives** |
| If curriculum documents systematically underrepresent certain voices |
| (e.g. indigenous knowledge systems, minority cultural frameworks), those |
| absences will not appear in the semantic analysis — which only reflects |
| what is present in the text. The Concept Atlas can reveal *what is there* |
| but not *what is missing*. |
|
|
| **Open-source does not mean unbiased** |
| The choice of focus concepts, the threshold parameters, and the framing of |
| results all reflect research decisions made by the developers. We encourage |
| users to interrogate these choices and to adapt the tool to their own |
| research questions rather than treating the default configuration as neutral. |
|
|
| --- |
|
|
| ## 15. Glossary |
|
|
| | Term | Definition | |
| |---|---| |
| | **BERTopic** | A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text | |
| | **Betweenness centrality** | A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points | |
| | **Clustering coefficient** | The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness | |
| | **Cosine similarity** | A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite | |
| | **c-TF-IDF** | Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics | |
| | **Embedding** | A numerical vector representation of text that encodes semantic meaning | |
| | **Entropy (Shannon)** | A measure of uncertainty or diversity in a probability distribution; measured in bits | |
| | **HDBSCAN** | Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape | |
| | **Jensen-Shannon divergence** | A symmetric, bounded measure of similarity between two probability distributions | |
| | **kNN graph** | A graph where each node is connected to its k nearest neighbours by some distance measure | |
| | **Louvain algorithm** | A community detection algorithm that maximizes modularity in a network | |
| | **Modularity** | A measure of the quality of a graph partition into communities | |
| | **PageRank** | A graph centrality measure that assigns importance based on the importance of connected nodes | |
| | **Silhouette score** | A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent) | |
| | **Sentence Transformer** | A neural network architecture optimized for producing sentence-level embeddings | |
| | **UMAP** | Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure | |
|
|
| --- |
|
|
| *Document version: May 2025* |
| *Space: [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)* |
| *License: Apache 2.0* |
| ``` |