Spaces:

deirdosh
/

curriculum_analysis_german

Running

App Files Files Community

curriculum_analysis_german / methods_draft.md

deirdosh

Update methods_draft.md

46128be verified 7 days ago

preview code

raw

history blame contribute delete

34.3 kB

Methods Document

Concept Atlas: Methods & Application Guide

German Curriculum Semantic Analysis — Technical & Pedagogical Documentation

Space: deirdosh/curriculum_analysis_german
Focus concepts: mensch · verhalten · evolution
Model: paraphrase-multilingual-mpnet-base-v2

Overview & Motivation
Data: The Curriculum Corpus
Pipeline Architecture
Multilingual Sentence Embeddings
Dimensionality Reduction: UMAP
Topic Modeling: BERTopic
Information-Theoretic Measures
Graph-Theoretic Analysis
Cross-Concept Comparison
State-Level Variation
Caching & Reproducibility
Educational Applications
Decentralized Research Model
Limitations & Ethical Considerations
Glossary

1. Overview & Motivation

What this project does

The Concept Atlas is a computational tool for exploring how key biological and humanistic concepts are framed across German school curricula. Rather than reading curriculum documents manually — a task that scales poorly across 16 federal states (Bundesländer), dozens of subjects, and multiple grade levels — this tool uses modern Natural Language Processing (NLP) to:

Map the semantic landscape of curriculum language around three focus concepts: Mensch (human), Verhalten (behaviour), and Evolution
Cluster excerpts into coherent topics without any pre-defined categories
Compare how these concepts relate to each other mathematically
Detect variation in framing between federal states

Why these three concepts?

Mensch, Verhalten, and Evolution occupy a uniquely contested intersection in German science education. They appear across Biology, Ethics, Social Studies, Religion, and Psychology curricula — often with very different emphases depending on subject context and state. This makes them ideal test cases for computational curriculum analysis:

Concept	Why it matters
Mensch	Bridges biological and humanistic framings; appears in nearly every subject
Verhalten	Links ethological science to social norms and moral education
Evolution	Scientifically precise in Biology; contested or reframed in other subjects

Who is this for?

Curriculum researchers seeking scalable, reproducible analysis tools
Science educators interested in how their subject's language compares across states or disciplines
Policy analysts investigating curricular coherence and equity
Graduate students learning applied NLP for educational research
Open science advocates interested in decentralized, community-driven research infrastructure

2. Data: The Curriculum Corpus

Source

The corpus consists of text excerpts drawn from publicly available German school curriculum documents (Lehrpläne and Bildungspläne) across multiple federal states. Each excerpt was retrieved by keyword search and stored as a structured CSV file.

Structure

Each row in curriculum_excerpts.csv represents one curriculum excerpt:

Column	Description
`search_term`	The keyword used to retrieve the excerpt (e.g. `mensch`)
`text_excerpt`	The raw curriculum text (sentence to paragraph length)
`state`	German federal state (Bundesland)
`subject`	School subject (e.g. Biologie, Ethik, Sozialkunde)
`grade`	Target grade level or band
`school_type`	School type (e.g. Gymnasium, Realschule)

Preprocessing

Before analysis, the corpus undergoes the following cleaning steps:

Raw CSV → Normalise column names (lowercase, underscores) → Fill missing values with empty strings → Add missing optional columns (state, subject, grade, school_type) → Strip whitespace from text_excerpt and search_term → Remove excerpts shorter than 20 characters → Derive search_term_lower for case-insensitive concept matching

Concept subsets are built by exact match on search_term_lower, with automatic fallback to partial string match if fewer than 10 exact matches are found. This handles spelling variants and compound words.

Scale considerations

The analysis is designed to work with corpora ranging from a few hundred to tens of thousands of excerpts. Embedding and topic modeling parameters scale automatically:

n_neighbors in UMAP is capped at min(15, n_samples - 1)
min_cluster_size in HDBSCAN is set to min(5, max(2, n // 10))

3. Pipeline Architecture

The full analysis pipeline runs sequentially in a single click. All computationally expensive steps are cached to disk so that subsequent exploration is instantaneous.

CSV ingestion │ ▼ Sentence embeddings (per concept) │ ├──► UMAP 2-D (visualization) │ ├──► UMAP 3-D (atlas & joint space) │ └──► BERTopic │ ├──► Topic labels per excerpt ├──► Top words per topic └──► Topic probability distributions │ ├──► Shannon entropy ├──► Jensen-Shannon divergence (cross-concept) ├──► Jensen-Shannon divergence (cross-state) └──► Cosine similarity (centroid comparison)

Semantic kNN graph (per concept) │ ├──► Betweenness centrality ├──► PageRank ├──► Closeness centrality ├──► Louvain communities ├──► Network density └──► Average clustering coefficient

Enriched parquet export │ └──► data/enriched_corpus.csv cache/enriched_corpus.parquet

Caching strategy

Every expensive computation writes a .npy (arrays) or .json (metadata) file to ./cache/, keyed by a combination of:

The logical name of the artefact (e.g. emb_mensch)
The number of input texts (detects corpus changes)
An MD5 hash of the full key string (prevents filename collisions)

On re-launch, the pipeline checks for cached files first and skips recomputation entirely if they exist. This makes the Space fast for end users while keeping the first-run cost affordable.

4. Multilingual Sentence Embeddings

What is an embedding?

An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings produce vectors that are close together in mathematical space; texts with different meanings produce vectors that are far apart.

The Concept Atlas uses vectors with 768 dimensions — each excerpt becomes a point in a 768-dimensional semantic space.

Model: `paraphrase-multilingual-mpnet-base-v2`

This model is a Sentence Transformer — a neural network fine-tuned specifically to produce high-quality sentence-level representations. Key properties relevant to this project:

Property	Detail
Architecture	MPNet-base (Masked and Permuted Pre-training)
Training data	Paraphrase pairs in 50+ languages
German support	Native — no translation needed
Output dimension	768
Normalization	L2-normalized (unit sphere) → cosine similarity = dot product
License	Apache 2.0

Why multilingual?

German curriculum language is domain-specific and contains compound words, technical terms, and pedagogical jargon that generic German models may handle poorly. A multilingual model trained on diverse paraphrase data captures semantic equivalence across paraphrases better than a purely language-modeled baseline, which is exactly what is needed to group thematically similar curriculum excerpts.

Practical interpretation

Two excerpts with a high cosine similarity (close to 1.0) make similar semantic claims or describe similar content — even if they use different words. Two excerpts with low cosine similarity (close to 0) occupy different regions of conceptual space.

5. Dimensionality Reduction: UMAP

The dimensionality problem

768-dimensional vectors cannot be visualized directly. UMAP (Uniform Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while preserving as much of the neighbourhood structure as possible — meaning that points that were close in 768-D tend to remain close after reduction.

Two projections are computed

Projection	Dimensions	Purpose
UMAP 2-D	2	Interactive scatter plots, BERTopic visualization
UMAP 3-D	3	Atlas visualization, joint concept space

A joint 3-D UMAP is also computed across all three concept corpora combined, placing mensch, verhalten, and evolution excerpts in a single shared semantic space for direct comparison.

Key parameters

Parameter	Value	Effect
`n_neighbors`	15	Controls local vs. global structure balance
`min_dist`	0.1 (3-D) / 0.05 (2-D)	How tightly clusters pack
`metric`	cosine	Appropriate for normalized embeddings
`random_state`	42	Ensures reproducible layouts

Accessible interpretation

Think of UMAP as making a map of meaning. Just as a geographic map compresses the curved surface of the Earth onto a flat page while preserving relative distances between cities, UMAP compresses the high-dimensional semantic space onto a 2-D or 3-D canvas while preserving which excerpts are semantically "nearby."

Clusters of points in a UMAP plot indicate groups of excerpts that discuss similar ideas. Gaps between clusters indicate distinct conceptual sub-areas. The absolute position of a cluster has no meaning — only relative distances matter.

6. Topic Modeling: BERTopic

What is topic modeling?

Topic modeling is an unsupervised method for discovering thematic groups within a text collection — without being told in advance what the themes are. Traditional methods (e.g. LDA) work on word co-occurrence statistics. BERTopic uses pre-computed sentence embeddings, which means it understands meaning rather than just word frequency.

Pipeline within BERTopic

Sentence embeddings (768-D) │ ▼ UMAP reduction (→ 5-D internal space) │ ▼ HDBSCAN clustering │ ▼ c-TF-IDF topic representation (class-based TF-IDF: finds words that distinguish this topic from all others) │ ▼ Topic labels + per-document probabilities

HDBSCAN: density-based clustering

HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions in the embedding space, separated by sparser regions. Key advantages for curriculum text:

Does not require specifying the number of clusters in advance
Naturally handles outliers (assigned topic -1)
Finds clusters of variable size and shape

Parameters used

Parameter	Value	Rationale
`min_cluster_size`	5 (adaptive)	Minimum excerpts to form a topic
`min_samples`	`max(1, min_cluster_size // 2)`	Controls noise sensitivity
`cluster_selection_method`	`eom`	Excess of Mass — finds stable clusters
`n_gram_range`	(1, 2)	Single words and two-word phrases as features

Reading the results

Each topic is characterized by its top words — terms with the highest c-TF-IDF scores for that cluster. These are words that appear frequently in the topic and rarely in other topics, making them highly discriminating.

Topic -1 is always the outlier category: excerpts that did not fit confidently into any discovered cluster. A high outlier rate may indicate either genuine semantic diversity or insufficient data for that concept.

Silhouette score

The silhouette score measures how well-separated the discovered clusters are, ranging from -1 (poor) to +1 (excellent):

> 0.5: well-separated, meaningful topics
0.2–0.5: moderate separation — topics exist but overlap
< 0.2: clusters are not well-defined; interpret with caution

7. Information-Theoretic Measures

Information theory provides a principled mathematical language for measuring diversity, surprise, and difference in distributions. The Concept Atlas applies three core measures.

7.1 Shannon Entropy

$H(X) = -\sum_{i} p_i \log_2 p_i \quad \text{(bits)}$

What it measures: How evenly curriculum excerpts are spread across the discovered topics. High entropy = many topics of roughly equal size (diverse framing). Low entropy = one or two dominant topics (concentrated framing).

Interpretation guide:

Entropy	Meaning
Low (< 1 bit)	One topic dominates; concept is used in a narrow, uniform way
Medium (1–2.5 bits)	Moderate diversity; concept appears in several distinct contexts
High (> 2.5 bits)	Highly diverse; concept is used across many different framings

Accessible analogy: Imagine rolling a die. If it always lands on 6 (entropy = 0), you learn nothing new from each roll. If all faces are equally likely (maximum entropy), each roll is maximally informative. Curriculum entropy works the same way — high entropy means the concept is used in many genuinely different ways.

7.2 Jensen-Shannon Divergence (JSD)

$JSD(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$

where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence.

What it measures: The similarity between two probability distributions over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1], and always defined (unlike raw KL divergence).

Used in two contexts:

Cross-concept JSD: Do mensch, verhalten, and evolution have similar topic distributions? JSD near 0 means yes; near 1 means they occupy entirely different topical spaces.
Cross-state JSD: Do two federal states frame the same concept similarly? High JSD between states indicates curricular divergence; low JSD indicates convergence.

Interpretation:

JSD	Interpretation
0.0	Identical topic distributions
0.0–0.2	Very similar framing
0.2–0.5	Moderate divergence
0.5–1.0	Substantially different framing
1.0	Completely non-overlapping distributions

7.3 Cosine Similarity (Embedding Centroids)

$\text{sim}(A, B) = \frac{\bar{a} \cdot \bar{b}}{\|\bar{a}\| \|\bar{b}\|}$

where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of two concept corpora.

What it measures: Whether the average semantic content of two concept corpora occupies the same region of embedding space. This is complementary to JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD operates on the topic distribution (post-clustering).

Interpretation:

Cosine sim	Interpretation
> 0.9	Concepts discussed in nearly identical semantic context
0.7–0.9	Related but distinct semantic regions
0.5–0.7	Moderate semantic overlap
< 0.5	Concepts occupy largely separate semantic spaces

8. Graph-Theoretic Analysis

Why model curriculum text as a graph?

A graph (network) makes the relational structure of a corpus explicit. Instead of treating each excerpt independently, a graph reveals which excerpts are semantically central, which bridge different topic areas, and how the corpus is organized into communities.

Construction: k-Nearest Neighbour Graph

For each concept corpus, a graph $G = (V, E)$ is constructed where:

Nodes $V$: each curriculum excerpt
Edges $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity exceeds a threshold (≥ 0.35) and $j$ is among the $k=6$ nearest neighbours of $i$
Edge weights: the cosine similarity value

This creates a sparse similarity graph that captures local semantic neighbourhood structure.

Measures computed

Betweenness Centrality

$C_B(v) = \sum_{s \neq v \neq t} \frac{\sigma(s,t|v)}{\sigma(s,t)}$

Measures how often a node lies on the shortest path between other nodes. High betweenness = the excerpt is a semantic "bridge" between different topic areas. In curriculum terms, bridge excerpts often contain integrative or interdisciplinary language.

PageRank

Iteratively assigns importance based on the importance of neighbours. High PageRank = the excerpt is cited (connected to) by many other important excerpts. PageRank hubs represent semantically central curriculum statements that many other excerpts are conceptually near.

Closeness Centrality

Measures how quickly a node can reach all others via the graph. High closeness = the excerpt is semantically accessible from most others — a "general" or bridging statement.

Network Density

$d = \frac{2|E|}{|V|(|V|-1)}$

The fraction of all possible edges that actually exist. Higher density indicates a more semantically cohesive corpus (most excerpts are near most others). Lower density indicates a more fragmented semantic space.

Average Clustering Coefficient

Measures the tendency of nodes to form tightly connected local groups. High clustering = the corpus has tight semantic sub-communities.

Louvain Community Detection

Partitions the graph into communities that maximize modularity — the degree to which within-community connections exceed what would be expected by chance. Communities in a curriculum semantic graph often correspond to distinct disciplinary or contextual framings of a concept.

Accessible interpretation

Think of the semantic graph as a social network of curriculum excerpts, where two excerpts are "friends" if they discuss similar ideas.

Hubs (high PageRank) are the popular, central ideas that most other ideas are related to.

Bridges (high betweenness) are the connectors — excerpts that link otherwise separate clusters of ideas.

Communities are cliques of mutually similar excerpts — effectively the curriculum's implicit sub-topics.

9. Cross-Concept Comparison

The cross-concept analysis addresses the core research question: How do mensch, verhalten, and evolution relate to each other in German curriculum language?

Three complementary lenses are applied:

Lens 1: Geometric (Cosine Similarity Matrix)

Computes the cosine similarity between the mean embedding vectors of each concept corpus. This is a purely geometric measure — it asks whether the average document in one concept space is semantically close to the average document in another.

Lens 2: Distributional (Jensen-Shannon Divergence Matrix)

Computes JSD between the topic probability distributions of each concept pair. This asks whether the thematic structure (the pattern of which topics are prominent) is similar across concepts, regardless of the raw embedding geometry.

Lens 3: Comparative Statistics

Side-by-side comparison of:

Shannon entropy per concept (conceptual breadth)
Corpus size (representation in curricula)
Number of discovered topics (thematic complexity)
Silhouette score (cluster quality)

Why use multiple lenses?

Two concepts could be geometrically close (similar average embedding) but distributionally different (very different topic structures). For example, mensch and verhalten might both appear in social-science contexts (close centroids) but mensch might span biology, ethics, and philosophy while verhalten concentrates in behavioral science (different topic distributions). Using both measures together gives a richer picture.

10. State-Level Variation

Germany's federal education system means that curriculum design is the responsibility of each Bundesland. This creates natural variation that is itself a research object.

Bubble chart: Entropy × State

For each state-concept pair, Shannon entropy is computed over the topic distribution of excerpts from that state. Plotting entropy against state with bubble size proportional to corpus size reveals:

Which states frame a concept more uniformly (low entropy)
Which states frame it more diversely (high entropy)
Whether small-corpus states should be interpreted cautiously

State-pairwise JSD heatmaps

For each concept, a symmetric matrix of JSD values is computed between all pairs of states. This reveals:

Clusters of similar states (low inter-state JSD)
Outlier states with distinctive curriculum framing
Concept-specific patterns: states may converge on evolution (where scientific consensus constrains framing) but diverge on mensch (where philosophical tradition varies more)

11. Caching & Reproducibility

Deterministic cache keys

Every artefact is identified by a key encoding:

The artefact type (e.g. emb_mensch)
The number of input texts (detects corpus changes)
An MD5 hash of the full key string

This means that if the corpus changes size, all downstream caches are automatically invalidated (different key → different filename → recomputed).

Artefact manifest

File pattern	Content	Format
`emb_{concept}_{n}_*.npy`	Sentence embeddings	NumPy float32
`umap3d_{concept}_{n}_*.npy`	3-D UMAP coordinates	NumPy float32
`umap2d_{concept}_{n}_*.npy`	2-D UMAP coordinates	NumPy float32
`bertopic_{concept}_{n}_topics_*.json`	Topic assignment per excerpt	JSON list
`bertopic_{concept}_{n}_probs_*.npy`	Topic probabilities	NumPy float32
`bertopic_{concept}_{n}_words_*.json`	Top words per topic	JSON dict
`bertopic_{concept}_{n}_info_*.json`	Full topic info table	JSON list
`joint_all_embs_*.npy`	Joint corpus embeddings	NumPy float32
`joint_umap3d_*.npy`	Joint 3-D UMAP	NumPy float32
`enriched_corpus.parquet`	Full enriched dataset	Parquet
`data/enriched_corpus.csv`	Full enriched dataset	CSV

Pushing to HuggingFace

Authenticate

huggingface-cli login

Upload computed artefacts to the Space

huggingface-cli upload deirdosh/curriculum_analysis_german
./cache cache --repo-type=space

huggingface-cli upload deirdosh/curriculum_analysis_german
./data data --repo-type=space

The enriched_corpus.csv adds BERTopic topic_id and UMAP coordinates (umap2_x, umap2_y, umap3_x, umap3_y, umap3_z) to every excerpt, making the enriched dataset independently useful for downstream analysis without re-running the pipeline.

12. Educational Applications

For curriculum researchers

The Concept Atlas operationalizes several research questions that have historically required manual content analysis:

Q: Is the concept of evolution framed consistently across German states?
→ Examine the state-pairwise JSD heatmap for evolution. States with high JSD scores frame the concept in fundamentally different ways — follow up by reading the excerpts in the high-JSD clusters.

Q: How does mensch differ between Biology and Ethics curricula?
→ Filter the UMAP scatter by subject metadata and look for spatial separation between subject-coloured clusters.

Q: Which curriculum excerpts are semantically central to the concept of verhalten?
→ Consult the PageRank and betweenness centrality rankings in the Graph Theory tab to find the hub and bridge excerpts.

Q: Are any states outliers in how they frame all three concepts?
→ Compare state-level entropy across all three concept bubble charts. A state that consistently shows either very low or very high entropy across all three concepts may have a distinctive curriculum philosophy.

For teachers and educators

You do not need to understand the mathematics to use the Concept Atlas productively. Here is an accessible guide:

The UMAP scatter plot is a map of meaning. Points close together mean the curriculum uses similar language in those excerpts. Click on any point to read the actual excerpt. Ask: Do the clusters make sense intuitively? Are there surprising neighbors?

The topic word clouds show you what each cluster is "about" — the most distinctive words for that group of excerpts. Use these to name the implicit sub-topics in your subject area's curriculum.

The entropy score is a single number summarizing how diverse curriculum language is for that concept. Compare it across states: does your state have higher or lower entropy than average? What might that mean for teaching?

The state JSD heatmap is a curriculum comparison tool. Find your state on both axes and read across: which states treat this concept most similarly to yours? Most differently? This can be a starting point for cross-state curriculum exchange or dialogue.

Classroom use

The Concept Atlas can support several classroom activities:

Curriculum literacy seminars: pre-service teachers can explore how their subject area frames key concepts, developing meta-awareness of curriculum language
Cross-disciplinary projects: students can investigate how mensch is framed differently in Biology vs. Religion, using the UMAP and topic plots as primary evidence
Federalism and education policy: social studies courses can use the state comparison features to discuss German educational federalism concretely
Philosophy of science: the evolution concept analysis can ground discussions of how scientific concepts travel (or don't) across subject boundaries

13. Decentralized Research Model

The case for community-driven curriculum analysis

Curriculum analysis has traditionally been the province of large research institutes with dedicated funding and staff. This creates several problems:

Coverage is selective and often lags policy changes by years
Methods are rarely shared, making replication difficult
Researchers in smaller institutions or non-German-speaking countries face high access barriers
Teachers, whose expertise is most relevant, are rarely involved as researchers

The Concept Atlas is designed to support a different model: lightweight, reproducible, community-extensible analysis hosted on free public infrastructure.

HuggingFace Spaces as research infrastructure

HuggingFace Spaces provides:

Feature	Research value
Free GPU/CPU hosting	Zero infrastructure cost for deployment
Git-based version control	Full reproducibility and change history
Public dataset repository	Findable, citable, reusable data
Community discussion	Peer feedback without formal publication gatekeeping
Fork-and-extend	Others can build on the analysis with one click

How to extend this work

Adding more concepts:
Edit the FOCUS_CONCEPTS list in app.py. The pipeline will automatically process new concepts if matching rows exist in the CSV.

Adding more states or subjects:
Extend the curriculum_excerpts.csv with new rows following the same column structure and re-run the pipeline.

Using a different embedding model:
Change the MODEL_NAME constant. Any model on the SentenceTransformers model hub with multilingual support can be substituted. Clear the embedding cache (cache/emb_*.npy) to force recomputation.

Comparative cross-national analysis:
The pipeline is language-agnostic (the embedding model supports 50+ languages). Providing curriculum excerpts from Austria, Switzerland, or other countries in the same CSV format enables direct cross-national comparison.

Contributing back:

Open an issue or discussion on the HuggingFace Space
Fork the Space and submit a PR with your extension
Publish your enriched corpus as a separate HuggingFace dataset with a link back to this Space

Minimal technical requirements for contributors

To run the pipeline locally or contribute new analysis:

Clone the Space

git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german cd curriculum_analysis_german

Install dependencies

pip install -r requirements.txt

Run locally

python app.py

A standard laptop (8 GB RAM, no GPU) can run the full pipeline in approximately 15–20 minutes on first run. GPU acceleration reduces this to 2–5 minutes. All subsequent runs load from cache in under 10 seconds.

14. Limitations & Ethical Considerations

Technical limitations

Embedding model biases
The paraphrase-multilingual-mpnet-base-v2 model was trained primarily on paraphrase pairs and may not capture domain-specific curriculum jargon as accurately as a model fine-tuned on educational text. Terms with curriculum-specific meanings (e.g. Kompetenz in pedagogical vs. general usage) may be represented according to their general-language distribution.

UMAP non-determinism across runs
While random_state=42 ensures reproducibility within a session, UMAP projections are not globally canonical — a different seed or different n_neighbors value will produce a different (though structurally similar) layout. Conclusions should not depend on the absolute position of clusters, only on their relative proximity.

BERTopic outlier sensitivity
HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1). With small corpora or very diverse text, the outlier rate can be high (>50%). This is a signal that the data may be too heterogeneous for reliable topic discovery rather than a failure of the method — but it limits interpretability.

Corpus completeness
The current corpus may not include all German states, all school types, or all grade levels. Gaps in coverage mean that low entropy or low JSD for a state may reflect missing data rather than genuine curricular convergence.

No temporal dimension
The current analysis treats curricula as static documents. It does not capture revision histories or how concept framing has changed over time. A longitudinal extension would require time-stamped corpus versions.

Ethical considerations

Curriculum documents are public, but context matters
German curriculum documents are publicly available administrative texts. However, analysis that identifies specific states as "outliers" or frames curriculum differences in evaluative terms should be handled carefully. The goal of this tool is descriptive analysis, not ranking or judgment.

Automated analysis does not replace reading
Computational methods reveal patterns at scale but cannot replace close reading of the actual texts. Any finding from the Concept Atlas should be verified by examining the underlying excerpts before drawing policy conclusions.

Representation of marginalized perspectives
If curriculum documents systematically underrepresent certain voices (e.g. indigenous knowledge systems, minority cultural frameworks), those absences will not appear in the semantic analysis — which only reflects what is present in the text. The Concept Atlas can reveal what is there but not what is missing.

Open-source does not mean unbiased
The choice of focus concepts, the threshold parameters, and the framing of results all reflect research decisions made by the developers. We encourage users to interrogate these choices and to adapt the tool to their own research questions rather than treating the default configuration as neutral.

15. Glossary

Term	Definition
BERTopic	A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text
Betweenness centrality	A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points
Clustering coefficient	The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness
Cosine similarity	A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite
c-TF-IDF	Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics
Embedding	A numerical vector representation of text that encodes semantic meaning
Entropy (Shannon)	A measure of uncertainty or diversity in a probability distribution; measured in bits
HDBSCAN	Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape
Jensen-Shannon divergence	A symmetric, bounded measure of similarity between two probability distributions
kNN graph	A graph where each node is connected to its k nearest neighbours by some distance measure
Louvain algorithm	A community detection algorithm that maximizes modularity in a network
Modularity	A measure of the quality of a graph partition into communities
PageRank	A graph centrality measure that assigns importance based on the importance of connected nodes
Silhouette score	A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent)
Sentence Transformer	A neural network architecture optimized for producing sentence-level embeddings
UMAP	Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure

Document version: May 2025
Space: deirdosh/curriculum_analysis_german
License: Apache 2.0 ```

Methods Document

Concept Atlas: Methods & Application Guide

German Curriculum Semantic Analysis — Technical & Pedagogical Documentation

Table of Contents

1. Overview & Motivation

What this project does

Why these three concepts?

Who is this for?

2. Data: The Curriculum Corpus

Source

Structure

Preprocessing

Scale considerations

3. Pipeline Architecture

Caching strategy

4. Multilingual Sentence Embeddings

What is an embedding?

Model: paraphrase-multilingual-mpnet-base-v2

Why multilingual?

Practical interpretation

5. Dimensionality Reduction: UMAP

The dimensionality problem

Two projections are computed

Key parameters

Accessible interpretation

6. Topic Modeling: BERTopic

What is topic modeling?

Pipeline within BERTopic

HDBSCAN: density-based clustering

Parameters used

Reading the results

Silhouette score

7. Information-Theoretic Measures

7.1 Shannon Entropy

7.2 Jensen-Shannon Divergence (JSD)

7.3 Cosine Similarity (Embedding Centroids)

8. Graph-Theoretic Analysis

Why model curriculum text as a graph?

Construction: k-Nearest Neighbour Graph

Measures computed

Betweenness Centrality

PageRank

Closeness Centrality

Network Density

Average Clustering Coefficient

Louvain Community Detection

Accessible interpretation

9. Cross-Concept Comparison

Lens 1: Geometric (Cosine Similarity Matrix)

Lens 2: Distributional (Jensen-Shannon Divergence Matrix)

Lens 3: Comparative Statistics

Why use multiple lenses?

10. State-Level Variation

Bubble chart: Entropy × State

State-pairwise JSD heatmaps

11. Caching & Reproducibility

Deterministic cache keys

Artefact manifest

Pushing to HuggingFace

Authenticate

Upload computed artefacts to the Space

12. Educational Applications

For curriculum researchers

For teachers and educators

Classroom use

13. Decentralized Research Model

The case for community-driven curriculum analysis

HuggingFace Spaces as research infrastructure

How to extend this work

Minimal technical requirements for contributors

Clone the Space

Install dependencies

Run locally

14. Limitations & Ethical Considerations

Technical limitations

Ethical considerations

15. Glossary

Model: `paraphrase-multilingual-mpnet-base-v2`