Spaces:

deirdosh
/

curriculum_analysis_german

Running

App Files Files Community

curriculum_analysis_german / methods_draft.md

deirdosh

Update methods_draft.md

46128be verified 7 days ago

preview code

raw

history blame contribute delete

34.3 kB

	# Methods Document

	# Concept Atlas: Methods & Application Guide
	### German Curriculum Semantic Analysis — Technical & Pedagogical Documentation

	> Space: [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)
	> Focus concepts: mensch · verhalten · evolution
	> Model: `paraphrase-multilingual-mpnet-base-v2`

	---

	## Table of Contents

	1. [Overview & Motivation](#1-overview--motivation)
	2. [Data: The Curriculum Corpus](#2-data-the-curriculum-corpus)
	3. [Pipeline Architecture](#3-pipeline-architecture)
	4. [Multilingual Sentence Embeddings](#4-multilingual-sentence-embeddings)
	5. [Dimensionality Reduction: UMAP](#5-dimensionality-reduction-umap)
	6. [Topic Modeling: BERTopic](#6-topic-modeling-bertopic)
	7. [Information-Theoretic Measures](#7-information-theoretic-measures)
	8. [Graph-Theoretic Analysis](#8-graph-theoretic-analysis)
	9. [Cross-Concept Comparison](#9-cross-concept-comparison)
	10. [State-Level Variation](#10-state-level-variation)
	11. [Caching & Reproducibility](#11-caching--reproducibility)
	12. [Educational Applications](#12-educational-applications)
	13. [Decentralized Research Model](#13-decentralized-research-model)
	14. [Limitations & Ethical Considerations](#14-limitations--ethical-considerations)
	15. [Glossary](#15-glossary)

	---

	## 1. Overview & Motivation

	### What this project does

	The Concept Atlas is a computational tool for exploring how key biological
	and humanistic concepts are framed across German school curricula. Rather than
	reading curriculum documents manually — a task that scales poorly across 16
	federal states (Bundesländer), dozens of subjects, and multiple grade levels —
	this tool uses modern Natural Language Processing (NLP) to:

	- Map the semantic landscape of curriculum language around three focus
	concepts: Mensch (human), Verhalten (behaviour), and Evolution
	- Cluster excerpts into coherent topics without any pre-defined categories
	- Compare how these concepts relate to each other mathematically
	- Detect variation in framing between federal states

	### Why these three concepts?

	Mensch, Verhalten, and Evolution occupy a uniquely contested intersection
	in German science education. They appear across Biology, Ethics, Social Studies,
	Religion, and Psychology curricula — often with very different emphases
	depending on subject context and state. This makes them ideal test cases for
	computational curriculum analysis:

	\| Concept \| Why it matters \|
	\|---\|---\|
	\| Mensch \| Bridges biological and humanistic framings; appears in nearly every subject \|
	\| Verhalten \| Links ethological science to social norms and moral education \|
	\| Evolution \| Scientifically precise in Biology; contested or reframed in other subjects \|

	### Who is this for?

	- Curriculum researchers seeking scalable, reproducible analysis tools
	- Science educators interested in how their subject's language compares
	across states or disciplines
	- Policy analysts investigating curricular coherence and equity
	- Graduate students learning applied NLP for educational research
	- Open science advocates interested in decentralized, community-driven
	research infrastructure

	---

	## 2. Data: The Curriculum Corpus

	### Source

	The corpus consists of text excerpts drawn from publicly available German school
	curriculum documents (Lehrpläne and Bildungspläne) across multiple federal
	states. Each excerpt was retrieved by keyword search and stored as a structured
	CSV file.

	### Structure

	Each row in `curriculum_excerpts.csv` represents one curriculum excerpt:

	\| Column \| Description \|
	\|---\|---\|
	\| `search_term` \| The keyword used to retrieve the excerpt (e.g. `mensch`) \|
	\| `text_excerpt` \| The raw curriculum text (sentence to paragraph length) \|
	\| `state` \| German federal state (Bundesland) \|
	\| `subject` \| School subject (e.g. Biologie, Ethik, Sozialkunde) \|
	\| `grade` \| Target grade level or band \|
	\| `school_type` \| School type (e.g. Gymnasium, Realschule) \|

	### Preprocessing

	Before analysis, the corpus undergoes the following cleaning steps:


	Raw CSV
	→ Normalise column names (lowercase, underscores)
	→ Fill missing values with empty strings
	→ Add missing optional columns (state, subject, grade, school_type)
	→ Strip whitespace from text_excerpt and search_term
	→ Remove excerpts shorter than 20 characters
	→ Derive search_term_lower for case-insensitive concept matching


	Concept subsets are built by exact match on `search_term_lower`, with
	automatic fallback to partial string match if fewer than 10 exact matches
	are found. This handles spelling variants and compound words.

	### Scale considerations

	The analysis is designed to work with corpora ranging from a few hundred to
	tens of thousands of excerpts. Embedding and topic modeling parameters scale
	automatically:

	- `n_neighbors` in UMAP is capped at `min(15, n_samples - 1)`
	- `min_cluster_size` in HDBSCAN is set to `min(5, max(2, n // 10))`

	---

	## 3. Pipeline Architecture

	The full analysis pipeline runs sequentially in a single click. All
	computationally expensive steps are cached to disk so that subsequent
	exploration is instantaneous.


	CSV ingestion
	│
	▼
	Sentence embeddings (per concept)
	│
	├──► UMAP 2-D (visualization)
	│
	├──► UMAP 3-D (atlas & joint space)
	│
	└──► BERTopic
	│
	├──► Topic labels per excerpt
	├──► Top words per topic
	└──► Topic probability distributions
	│
	├──► Shannon entropy
	├──► Jensen-Shannon divergence (cross-concept)
	├──► Jensen-Shannon divergence (cross-state)
	└──► Cosine similarity (centroid comparison)

	Semantic kNN graph (per concept)
	│
	├──► Betweenness centrality
	├──► PageRank
	├──► Closeness centrality
	├──► Louvain communities
	├──► Network density
	└──► Average clustering coefficient

	Enriched parquet export
	│
	└──► data/enriched_corpus.csv
	cache/enriched_corpus.parquet


	### Caching strategy

	Every expensive computation writes a `.npy` (arrays) or `.json` (metadata)
	file to `./cache/`, keyed by a combination of:

	- The logical name of the artefact (e.g. `emb_mensch`)
	- The number of input texts (detects corpus changes)
	- An MD5 hash of the full key string (prevents filename collisions)

	On re-launch, the pipeline checks for cached files first and skips recomputation
	entirely if they exist. This makes the Space fast for end users while keeping
	the first-run cost affordable.

	---

	## 4. Multilingual Sentence Embeddings

	### What is an embedding?

	An embedding is a list of numbers (a vector) that represents the meaning
	of a piece of text. Texts with similar meanings produce vectors that are close
	together in mathematical space; texts with different meanings produce vectors
	that are far apart.

	The Concept Atlas uses vectors with 768 dimensions — each excerpt becomes
	a point in a 768-dimensional semantic space.

	### Model: `paraphrase-multilingual-mpnet-base-v2`

	This model is a Sentence Transformer — a neural network fine-tuned
	specifically to produce high-quality sentence-level representations. Key
	properties relevant to this project:

	\| Property \| Detail \|
	\|---\|---\|
	\| Architecture \| MPNet-base (Masked and Permuted Pre-training) \|
	\| Training data \| Paraphrase pairs in 50+ languages \|
	\| German support \| Native — no translation needed \|
	\| Output dimension \| 768 \|
	\| Normalization \| L2-normalized (unit sphere) → cosine similarity = dot product \|
	\| License \| Apache 2.0 \|

	### Why multilingual?

	German curriculum language is domain-specific and contains compound words,
	technical terms, and pedagogical jargon that generic German models may handle
	poorly. A multilingual model trained on diverse paraphrase data captures
	semantic equivalence across paraphrases better than a purely language-modeled
	baseline, which is exactly what is needed to group thematically similar
	curriculum excerpts.

	### Practical interpretation

	> Two excerpts with a high cosine similarity (close to 1.0) make similar
	> semantic claims or describe similar content — even if they use different words.
	> Two excerpts with low cosine similarity (close to 0) occupy different
	> regions of conceptual space.

	---

	## 5. Dimensionality Reduction: UMAP

	### The dimensionality problem

	768-dimensional vectors cannot be visualized directly. UMAP (Uniform
	Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while
	preserving as much of the neighbourhood structure as possible — meaning that
	points that were close in 768-D tend to remain close after reduction.

	### Two projections are computed

	\| Projection \| Dimensions \| Purpose \|
	\|---\|---\|---\|
	\| UMAP 2-D \| 2 \| Interactive scatter plots, BERTopic visualization \|
	\| UMAP 3-D \| 3 \| Atlas visualization, joint concept space \|

	A joint 3-D UMAP is also computed across all three concept corpora combined,
	placing mensch, verhalten, and evolution excerpts in a single shared
	semantic space for direct comparison.

	### Key parameters

	\| Parameter \| Value \| Effect \|
	\|---\|---\|---\|
	\| `n_neighbors` \| 15 \| Controls local vs. global structure balance \|
	\| `min_dist` \| 0.1 (3-D) / 0.05 (2-D) \| How tightly clusters pack \|
	\| `metric` \| cosine \| Appropriate for normalized embeddings \|
	\| `random_state` \| 42 \| Ensures reproducible layouts \|

	### Accessible interpretation

	> Think of UMAP as making a map of meaning. Just as a geographic map
	> compresses the curved surface of the Earth onto a flat page while preserving
	> relative distances between cities, UMAP compresses the high-dimensional
	> semantic space onto a 2-D or 3-D canvas while preserving which excerpts are
	> semantically "nearby."
	>
	> Clusters of points in a UMAP plot indicate groups of excerpts that
	> discuss similar ideas. Gaps between clusters indicate distinct conceptual
	> sub-areas. The absolute position of a cluster has no meaning — only
	> relative distances matter.

	---

	## 6. Topic Modeling: BERTopic

	### What is topic modeling?

	Topic modeling is an unsupervised method for discovering thematic groups
	within a text collection — without being told in advance what the themes are.
	Traditional methods (e.g. LDA) work on word co-occurrence statistics.
	BERTopic uses pre-computed sentence embeddings, which means it understands
	meaning rather than just word frequency.

	### Pipeline within BERTopic


	Sentence embeddings (768-D)
	│
	▼
	UMAP reduction (→ 5-D internal space)
	│
	▼
	HDBSCAN clustering
	│
	▼
	c-TF-IDF topic representation
	(class-based TF-IDF: finds words that
	distinguish this topic from all others)
	│
	▼
	Topic labels + per-document probabilities


	### HDBSCAN: density-based clustering

	HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with
	Noise) finds clusters as dense regions in the embedding space, separated
	by sparser regions. Key advantages for curriculum text:

	- Does not require specifying the number of clusters in advance
	- Naturally handles outliers (assigned topic `-1`)
	- Finds clusters of variable size and shape

	### Parameters used

	\| Parameter \| Value \| Rationale \|
	\|---\|---\|---\|
	\| `min_cluster_size` \| 5 (adaptive) \| Minimum excerpts to form a topic \|
	\| `min_samples` \| `max(1, min_cluster_size // 2)` \| Controls noise sensitivity \|
	\| `cluster_selection_method` \| `eom` \| Excess of Mass — finds stable clusters \|
	\| `n_gram_range` \| (1, 2) \| Single words and two-word phrases as features \|

	### Reading the results

	Each topic is characterized by its top words — terms with the highest
	c-TF-IDF scores for that cluster. These are words that appear frequently in
	the topic and rarely in other topics, making them highly discriminating.

	Topic -1 is always the outlier category: excerpts that did not fit
	confidently into any discovered cluster. A high outlier rate may indicate
	either genuine semantic diversity or insufficient data for that concept.

	### Silhouette score

	The silhouette score measures how well-separated the discovered clusters
	are, ranging from -1 (poor) to +1 (excellent):

	- > 0.5: well-separated, meaningful topics
	- 0.2–0.5: moderate separation — topics exist but overlap
	- < 0.2: clusters are not well-defined; interpret with caution

	---

	## 7. Information-Theoretic Measures

	Information theory provides a principled mathematical language for measuring
	diversity, surprise, and difference in distributions. The Concept
	Atlas applies three core measures.

	### 7.1 Shannon Entropy

	$$H(X) = -\sum_{i} p_i \log_2 p_i \quad \text{(bits)}$$

	What it measures: How evenly curriculum excerpts are spread across the
	discovered topics. High entropy = many topics of roughly equal size (diverse
	framing). Low entropy = one or two dominant topics (concentrated framing).

	Interpretation guide:

	\| Entropy \| Meaning \|
	\|---\|---\|
	\| Low (< 1 bit) \| One topic dominates; concept is used in a narrow, uniform way \|
	\| Medium (1–2.5 bits) \| Moderate diversity; concept appears in several distinct contexts \|
	\| High (> 2.5 bits) \| Highly diverse; concept is used across many different framings \|

	> Accessible analogy: Imagine rolling a die. If it always lands on 6
	> (entropy = 0), you learn nothing new from each roll. If all faces are equally
	> likely (maximum entropy), each roll is maximally informative. Curriculum
	> entropy works the same way — high entropy means the concept is used in
	> many genuinely different ways.

	### 7.2 Jensen-Shannon Divergence (JSD)

	$$JSD(P \\| Q) = \frac{1}{2} D_{KL}(P \\| M) + \frac{1}{2} D_{KL}(Q \\| M)$$

	where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence.

	What it measures: The similarity between two probability distributions
	over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1],
	and always defined (unlike raw KL divergence).

	Used in two contexts:

	1. Cross-concept JSD: Do mensch, verhalten, and evolution have
	similar topic distributions? JSD near 0 means yes; near 1 means they
	occupy entirely different topical spaces.

	2. Cross-state JSD: Do two federal states frame the same concept similarly?
	High JSD between states indicates curricular divergence; low JSD indicates
	convergence.

	Interpretation:

	\| JSD \| Interpretation \|
	\|---\|---\|
	\| 0.0 \| Identical topic distributions \|
	\| 0.0–0.2 \| Very similar framing \|
	\| 0.2–0.5 \| Moderate divergence \|
	\| 0.5–1.0 \| Substantially different framing \|
	\| 1.0 \| Completely non-overlapping distributions \|

	### 7.3 Cosine Similarity (Embedding Centroids)

	$$\text{sim}(A, B) = \frac{\bar{a} \cdot \bar{b}}{\\|\bar{a}\\| \\|\bar{b}\\|}$$

	where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of
	two concept corpora.

	What it measures: Whether the average semantic content of two concept
	corpora occupies the same region of embedding space. This is complementary to
	JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD
	operates on the topic distribution (post-clustering).

	Interpretation:

	\| Cosine sim \| Interpretation \|
	\|---\|---\|
	\| > 0.9 \| Concepts discussed in nearly identical semantic context \|
	\| 0.7–0.9 \| Related but distinct semantic regions \|
	\| 0.5–0.7 \| Moderate semantic overlap \|
	\| < 0.5 \| Concepts occupy largely separate semantic spaces \|

	---

	## 8. Graph-Theoretic Analysis

	### Why model curriculum text as a graph?

	A graph (network) makes the relational structure of a corpus explicit.
	Instead of treating each excerpt independently, a graph reveals which excerpts
	are semantically central, which bridge different topic areas, and how the
	corpus is organized into communities.

	### Construction: k-Nearest Neighbour Graph

	For each concept corpus, a graph $G = (V, E)$ is constructed where:

	- Nodes $V$: each curriculum excerpt
	- Edges $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity
	exceeds a threshold (≥ 0.35) and $j$ is among the $k=6$ nearest neighbours
	of $i$
	- Edge weights: the cosine similarity value

	This creates a sparse similarity graph that captures local semantic
	neighbourhood structure.

	### Measures computed

	#### Betweenness Centrality
	$$C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma(s,t\|v)}{\sigma(s,t)}$$

	Measures how often a node lies on the shortest path between other nodes.
	High betweenness = the excerpt is a semantic "bridge" between different
	topic areas. In curriculum terms, bridge excerpts often contain integrative
	or interdisciplinary language.

	#### PageRank
	Iteratively assigns importance based on the importance of neighbours.
	High PageRank = the excerpt is cited (connected to) by many other
	important excerpts. PageRank hubs represent semantically central curriculum
	statements that many other excerpts are conceptually near.

	#### Closeness Centrality
	Measures how quickly a node can reach all others via the graph.
	High closeness = the excerpt is semantically accessible from most others —
	a "general" or bridging statement.

	#### Network Density
	$$d = \frac{2\|E\|}{\|V\|(\|V\|-1)}$$

	The fraction of all possible edges that actually exist. Higher density
	indicates a more semantically cohesive corpus (most excerpts are near most
	others). Lower density indicates a more fragmented semantic space.

	#### Average Clustering Coefficient
	Measures the tendency of nodes to form tightly connected local groups.
	High clustering = the corpus has tight semantic sub-communities.

	#### Louvain Community Detection
	Partitions the graph into communities that maximize modularity — the
	degree to which within-community connections exceed what would be expected
	by chance. Communities in a curriculum semantic graph often correspond to
	distinct disciplinary or contextual framings of a concept.

	### Accessible interpretation

	> Think of the semantic graph as a social network of curriculum excerpts,
	> where two excerpts are "friends" if they discuss similar ideas.
	>
	> - Hubs (high PageRank) are the popular, central ideas that most other
	> ideas are related to.
	> - Bridges (high betweenness) are the connectors — excerpts that link
	> otherwise separate clusters of ideas.
	> - Communities are cliques of mutually similar excerpts — effectively
	> the curriculum's implicit sub-topics.

	---

	## 9. Cross-Concept Comparison

	The cross-concept analysis addresses the core research question:
	*How do mensch, verhalten, and evolution* relate to each other in
	German curriculum language?**

	Three complementary lenses are applied:

	### Lens 1: Geometric (Cosine Similarity Matrix)

	Computes the cosine similarity between the mean embedding vectors of each
	concept corpus. This is a purely geometric measure — it asks whether the
	average document in one concept space is semantically close to the average
	document in another.

	### Lens 2: Distributional (Jensen-Shannon Divergence Matrix)

	Computes JSD between the topic probability distributions of each concept
	pair. This asks whether the thematic structure (the pattern of which topics
	are prominent) is similar across concepts, regardless of the raw embedding
	geometry.

	### Lens 3: Comparative Statistics

	Side-by-side comparison of:
	- Shannon entropy per concept (conceptual breadth)
	- Corpus size (representation in curricula)
	- Number of discovered topics (thematic complexity)
	- Silhouette score (cluster quality)

	### Why use multiple lenses?

	Two concepts could be geometrically close (similar average embedding) but
	distributionally different (very different topic structures). For example,
	mensch and verhalten might both appear in social-science contexts (close
	centroids) but mensch might span biology, ethics, and philosophy while
	verhalten concentrates in behavioral science (different topic distributions).
	Using both measures together gives a richer picture.

	---

	## 10. State-Level Variation

	Germany's federal education system means that curriculum design is the
	responsibility of each Bundesland. This creates natural variation that is
	itself a research object.

	### Bubble chart: Entropy × State

	For each state-concept pair, Shannon entropy is computed over the topic
	distribution of excerpts from that state. Plotting entropy against state
	with bubble size proportional to corpus size reveals:

	- Which states frame a concept more uniformly (low entropy)
	- Which states frame it more diversely (high entropy)
	- Whether small-corpus states should be interpreted cautiously

	### State-pairwise JSD heatmaps

	For each concept, a symmetric matrix of JSD values is computed between all
	pairs of states. This reveals:

	- Clusters of similar states (low inter-state JSD)
	- Outlier states with distinctive curriculum framing
	- Concept-specific patterns: states may converge on evolution
	(where scientific consensus constrains framing) but diverge on mensch
	(where philosophical tradition varies more)

	---

	## 11. Caching & Reproducibility

	### Deterministic cache keys

	Every artefact is identified by a key encoding:
	1. The artefact type (e.g. `emb_mensch`)
	2. The number of input texts (detects corpus changes)
	3. An MD5 hash of the full key string

	This means that if the corpus changes size, all downstream caches are
	automatically invalidated (different key → different filename → recomputed).

	### Artefact manifest

	\| File pattern \| Content \| Format \|
	\|---\|---\|---\|
	\| `emb_{concept}_{n}_*.npy` \| Sentence embeddings \| NumPy float32 \|
	\| `umap3d_{concept}_{n}_*.npy` \| 3-D UMAP coordinates \| NumPy float32 \|
	\| `umap2d_{concept}_{n}_*.npy` \| 2-D UMAP coordinates \| NumPy float32 \|
	\| `bertopic_{concept}_{n}_topics_*.json` \| Topic assignment per excerpt \| JSON list \|
	\| `bertopic_{concept}_{n}_probs_*.npy` \| Topic probabilities \| NumPy float32 \|
	\| `bertopic_{concept}_{n}_words_*.json` \| Top words per topic \| JSON dict \|
	\| `bertopic_{concept}_{n}_info_*.json` \| Full topic info table \| JSON list \|
	\| `joint_all_embs_*.npy` \| Joint corpus embeddings \| NumPy float32 \|
	\| `joint_umap3d_*.npy` \| Joint 3-D UMAP \| NumPy float32 \|
	\| `enriched_corpus.parquet` \| Full enriched dataset \| Parquet \|
	\| `data/enriched_corpus.csv` \| Full enriched dataset \| CSV \|

	### Pushing to HuggingFace

	# Authenticate
	huggingface-cli login

	# Upload computed artefacts to the Space
	huggingface-cli upload deirdosh/curriculum_analysis_german \
	./cache cache --repo-type=space

	huggingface-cli upload deirdosh/curriculum_analysis_german \
	./data data --repo-type=space


	The `enriched_corpus.csv` adds BERTopic `topic_id` and UMAP coordinates
	(`umap2_x`, `umap2_y`, `umap3_x`, `umap3_y`, `umap3_z`) to every excerpt,
	making the enriched dataset independently useful for downstream analysis
	without re-running the pipeline.



	## 12. Educational Applications

	### For curriculum researchers

	The Concept Atlas operationalizes several research questions that have
	historically required manual content analysis:

	*Q: Is the concept of evolution* framed consistently across German states?**
	→ Examine the state-pairwise JSD heatmap for evolution. States with high JSD
	scores frame the concept in fundamentally different ways — follow up by reading
	the excerpts in the high-JSD clusters.

	*Q: How does mensch* differ between Biology and Ethics curricula?**
	→ Filter the UMAP scatter by subject metadata and look for spatial separation
	between subject-coloured clusters.

	**Q: Which curriculum excerpts are semantically central to the concept
	of verhalten?**
	→ Consult the PageRank and betweenness centrality rankings in the Graph
	Theory tab to find the hub and bridge excerpts.

	Q: Are any states outliers in how they frame all three concepts?
	→ Compare state-level entropy across all three concept bubble charts. A state
	that consistently shows either very low or very high entropy across all three
	concepts may have a distinctive curriculum philosophy.

	### For teachers and educators

	You do not need to understand the mathematics to use the Concept Atlas
	productively. Here is an accessible guide:

	The UMAP scatter plot is a map of meaning. Points close together mean
	the curriculum uses similar language in those excerpts. Click on any point to
	read the actual excerpt. Ask: Do the clusters make sense intuitively? Are
	there surprising neighbors?

	The topic word clouds show you what each cluster is "about" — the most
	distinctive words for that group of excerpts. Use these to name the implicit
	sub-topics in your subject area's curriculum.

	The entropy score is a single number summarizing how diverse curriculum
	language is for that concept. Compare it across states: does your state have
	higher or lower entropy than average? What might that mean for teaching?

	The state JSD heatmap is a curriculum comparison tool. Find your state on
	both axes and read across: which states treat this concept most similarly to
	yours? Most differently? This can be a starting point for cross-state
	curriculum exchange or dialogue.

	### Classroom use

	The Concept Atlas can support several classroom activities:

	- Curriculum literacy seminars: pre-service teachers can explore how their
	subject area frames key concepts, developing meta-awareness of curriculum
	language
	- Cross-disciplinary projects: students can investigate how mensch is
	framed differently in Biology vs. Religion, using the UMAP and topic plots
	as primary evidence
	- Federalism and education policy: social studies courses can use the
	state comparison features to discuss German educational federalism concretely
	- Philosophy of science: the evolution concept analysis can ground
	discussions of how scientific concepts travel (or don't) across subject
	boundaries

	---

	## 13. Decentralized Research Model

	### The case for community-driven curriculum analysis

	Curriculum analysis has traditionally been the province of large research
	institutes with dedicated funding and staff. This creates several problems:

	- Coverage is selective and often lags policy changes by years
	- Methods are rarely shared, making replication difficult
	- Researchers in smaller institutions or non-German-speaking countries face
	high access barriers
	- Teachers, whose expertise is most relevant, are rarely involved as researchers

	The Concept Atlas is designed to support a different model: **lightweight,
	reproducible, community-extensible analysis hosted on free public infrastructure.**

	### HuggingFace Spaces as research infrastructure

	HuggingFace Spaces provides:

	\| Feature \| Research value \|
	\|---\|---\|
	\| Free GPU/CPU hosting \| Zero infrastructure cost for deployment \|
	\| Git-based version control \| Full reproducibility and change history \|
	\| Public dataset repository \| Findable, citable, reusable data \|
	\| Community discussion \| Peer feedback without formal publication gatekeeping \|
	\| Fork-and-extend \| Others can build on the analysis with one click \|

	### How to extend this work

	Adding more concepts:
	Edit the `FOCUS_CONCEPTS` list in `app.py`. The pipeline will automatically
	process new concepts if matching rows exist in the CSV.

	Adding more states or subjects:
	Extend the `curriculum_excerpts.csv` with new rows following the same column
	structure and re-run the pipeline.

	Using a different embedding model:
	Change the `MODEL_NAME` constant. Any model on the
	[SentenceTransformers model hub](https://www.sbert.net/docs/pretrained_models.html)
	with multilingual support can be substituted. Clear the embedding cache
	(`cache/emb_*.npy`) to force recomputation.

	Comparative cross-national analysis:
	The pipeline is language-agnostic (the embedding model supports 50+ languages).
	Providing curriculum excerpts from Austria, Switzerland, or other countries
	in the same CSV format enables direct cross-national comparison.

	Contributing back:
	- Open an issue or discussion on the HuggingFace Space
	- Fork the Space and submit a PR with your extension
	- Publish your enriched corpus as a separate HuggingFace dataset with a
	link back to this Space

	### Minimal technical requirements for contributors

	To run the pipeline locally or contribute new analysis:

	# Clone the Space
	git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german
	cd curriculum_analysis_german

	# Install dependencies
	pip install -r requirements.txt

	# Run locally
	python app.py


	A standard laptop (8 GB RAM, no GPU) can run the full pipeline in
	approximately 15–20 minutes on first run. GPU acceleration reduces this to
	2–5 minutes. All subsequent runs load from cache in under 10 seconds.

	---

	## 14. Limitations & Ethical Considerations

	### Technical limitations

	Embedding model biases
	The `paraphrase-multilingual-mpnet-base-v2` model was trained primarily on
	paraphrase pairs and may not capture domain-specific curriculum jargon as
	accurately as a model fine-tuned on educational text. Terms with
	curriculum-specific meanings (e.g. Kompetenz in pedagogical vs. general
	usage) may be represented according to their general-language distribution.

	UMAP non-determinism across runs
	While `random_state=42` ensures reproducibility within a session, UMAP
	projections are not globally canonical — a different seed or different
	`n_neighbors` value will produce a different (though structurally similar)
	layout. Conclusions should not depend on the absolute position of clusters,
	only on their relative proximity.

	BERTopic outlier sensitivity
	HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1).
	With small corpora or very diverse text, the outlier rate can be high (>50%).
	This is a signal that the data may be too heterogeneous for reliable topic
	discovery rather than a failure of the method — but it limits interpretability.

	Corpus completeness
	The current corpus may not include all German states, all school types, or
	all grade levels. Gaps in coverage mean that low entropy or low JSD for a
	state may reflect missing data rather than genuine curricular convergence.

	No temporal dimension
	The current analysis treats curricula as static documents. It does not capture
	revision histories or how concept framing has changed over time. A longitudinal
	extension would require time-stamped corpus versions.

	### Ethical considerations

	Curriculum documents are public, but context matters
	German curriculum documents are publicly available administrative texts.
	However, analysis that identifies specific states as "outliers" or frames
	curriculum differences in evaluative terms should be handled carefully.
	The goal of this tool is descriptive analysis, not ranking or judgment.

	Automated analysis does not replace reading
	Computational methods reveal patterns at scale but cannot replace close
	reading of the actual texts. Any finding from the Concept Atlas should be
	verified by examining the underlying excerpts before drawing policy conclusions.

	Representation of marginalized perspectives
	If curriculum documents systematically underrepresent certain voices
	(e.g. indigenous knowledge systems, minority cultural frameworks), those
	absences will not appear in the semantic analysis — which only reflects
	what is present in the text. The Concept Atlas can reveal what is there
	but not what is missing.

	Open-source does not mean unbiased
	The choice of focus concepts, the threshold parameters, and the framing of
	results all reflect research decisions made by the developers. We encourage
	users to interrogate these choices and to adapt the tool to their own
	research questions rather than treating the default configuration as neutral.

	---

	## 15. Glossary

	\| Term \| Definition \|
	\|---\|---\|
	\| BERTopic \| A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text \|
	\| Betweenness centrality \| A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points \|
	\| Clustering coefficient \| The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness \|
	\| Cosine similarity \| A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite \|
	\| c-TF-IDF \| Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics \|
	\| Embedding \| A numerical vector representation of text that encodes semantic meaning \|
	\| Entropy (Shannon) \| A measure of uncertainty or diversity in a probability distribution; measured in bits \|
	\| HDBSCAN \| Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape \|
	\| Jensen-Shannon divergence \| A symmetric, bounded measure of similarity between two probability distributions \|
	\| kNN graph \| A graph where each node is connected to its k nearest neighbours by some distance measure \|
	\| Louvain algorithm \| A community detection algorithm that maximizes modularity in a network \|
	\| Modularity \| A measure of the quality of a graph partition into communities \|
	\| PageRank \| A graph centrality measure that assigns importance based on the importance of connected nodes \|
	\| Silhouette score \| A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent) \|
	\| Sentence Transformer \| A neural network architecture optimized for producing sentence-level embeddings \|
	\| UMAP \| Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure \|

	---

	Document version: May 2025
	Space: [deirdosh/curriculum_analysis_german](https://huggingface.co/spaces/deirdosh/curriculum_analysis_german)
	License: Apache 2.0
	```