Methods Document
Concept Atlas: Methods & Application Guide
German Curriculum Semantic Analysis — Technical & Pedagogical Documentation
Space: deirdosh/curriculum_analysis_german
Focus concepts: mensch · verhalten · evolution
Model:paraphrase-multilingual-mpnet-base-v2
Table of Contents
- Overview & Motivation
- Data: The Curriculum Corpus
- Pipeline Architecture
- Multilingual Sentence Embeddings
- Dimensionality Reduction: UMAP
- Topic Modeling: BERTopic
- Information-Theoretic Measures
- Graph-Theoretic Analysis
- Cross-Concept Comparison
- State-Level Variation
- Caching & Reproducibility
- Educational Applications
- Decentralized Research Model
- Limitations & Ethical Considerations
- Glossary
1. Overview & Motivation
What this project does
The Concept Atlas is a computational tool for exploring how key biological and humanistic concepts are framed across German school curricula. Rather than reading curriculum documents manually — a task that scales poorly across 16 federal states (Bundesländer), dozens of subjects, and multiple grade levels — this tool uses modern Natural Language Processing (NLP) to:
- Map the semantic landscape of curriculum language around three focus concepts: Mensch (human), Verhalten (behaviour), and Evolution
- Cluster excerpts into coherent topics without any pre-defined categories
- Compare how these concepts relate to each other mathematically
- Detect variation in framing between federal states
Why these three concepts?
Mensch, Verhalten, and Evolution occupy a uniquely contested intersection in German science education. They appear across Biology, Ethics, Social Studies, Religion, and Psychology curricula — often with very different emphases depending on subject context and state. This makes them ideal test cases for computational curriculum analysis:
| Concept | Why it matters |
|---|---|
| Mensch | Bridges biological and humanistic framings; appears in nearly every subject |
| Verhalten | Links ethological science to social norms and moral education |
| Evolution | Scientifically precise in Biology; contested or reframed in other subjects |
Who is this for?
- Curriculum researchers seeking scalable, reproducible analysis tools
- Science educators interested in how their subject's language compares across states or disciplines
- Policy analysts investigating curricular coherence and equity
- Graduate students learning applied NLP for educational research
- Open science advocates interested in decentralized, community-driven research infrastructure
2. Data: The Curriculum Corpus
Source
The corpus consists of text excerpts drawn from publicly available German school curriculum documents (Lehrpläne and Bildungspläne) across multiple federal states. Each excerpt was retrieved by keyword search and stored as a structured CSV file.
Structure
Each row in curriculum_excerpts.csv represents one curriculum excerpt:
| Column | Description |
|---|---|
search_term |
The keyword used to retrieve the excerpt (e.g. mensch) |
text_excerpt |
The raw curriculum text (sentence to paragraph length) |
state |
German federal state (Bundesland) |
subject |
School subject (e.g. Biologie, Ethik, Sozialkunde) |
grade |
Target grade level or band |
school_type |
School type (e.g. Gymnasium, Realschule) |
Preprocessing
Before analysis, the corpus undergoes the following cleaning steps:
Raw CSV → Normalise column names (lowercase, underscores) → Fill missing values with empty strings → Add missing optional columns (state, subject, grade, school_type) → Strip whitespace from text_excerpt and search_term → Remove excerpts shorter than 20 characters → Derive search_term_lower for case-insensitive concept matching
Concept subsets are built by exact match on search_term_lower, with
automatic fallback to partial string match if fewer than 10 exact matches
are found. This handles spelling variants and compound words.
Scale considerations
The analysis is designed to work with corpora ranging from a few hundred to tens of thousands of excerpts. Embedding and topic modeling parameters scale automatically:
n_neighborsin UMAP is capped atmin(15, n_samples - 1)min_cluster_sizein HDBSCAN is set tomin(5, max(2, n // 10))
3. Pipeline Architecture
The full analysis pipeline runs sequentially in a single click. All computationally expensive steps are cached to disk so that subsequent exploration is instantaneous.
CSV ingestion │ ▼ Sentence embeddings (per concept) │ ├──► UMAP 2-D (visualization) │ ├──► UMAP 3-D (atlas & joint space) │ └──► BERTopic │ ├──► Topic labels per excerpt ├──► Top words per topic └──► Topic probability distributions │ ├──► Shannon entropy ├──► Jensen-Shannon divergence (cross-concept) ├──► Jensen-Shannon divergence (cross-state) └──► Cosine similarity (centroid comparison)
Semantic kNN graph (per concept) │ ├──► Betweenness centrality ├──► PageRank ├──► Closeness centrality ├──► Louvain communities ├──► Network density └──► Average clustering coefficient
Enriched parquet export │ └──► data/enriched_corpus.csv cache/enriched_corpus.parquet
Caching strategy
Every expensive computation writes a .npy (arrays) or .json (metadata)
file to ./cache/, keyed by a combination of:
- The logical name of the artefact (e.g.
emb_mensch) - The number of input texts (detects corpus changes)
- An MD5 hash of the full key string (prevents filename collisions)
On re-launch, the pipeline checks for cached files first and skips recomputation entirely if they exist. This makes the Space fast for end users while keeping the first-run cost affordable.
4. Multilingual Sentence Embeddings
What is an embedding?
An embedding is a list of numbers (a vector) that represents the meaning of a piece of text. Texts with similar meanings produce vectors that are close together in mathematical space; texts with different meanings produce vectors that are far apart.
The Concept Atlas uses vectors with 768 dimensions — each excerpt becomes a point in a 768-dimensional semantic space.
Model: paraphrase-multilingual-mpnet-base-v2
This model is a Sentence Transformer — a neural network fine-tuned specifically to produce high-quality sentence-level representations. Key properties relevant to this project:
| Property | Detail |
|---|---|
| Architecture | MPNet-base (Masked and Permuted Pre-training) |
| Training data | Paraphrase pairs in 50+ languages |
| German support | Native — no translation needed |
| Output dimension | 768 |
| Normalization | L2-normalized (unit sphere) → cosine similarity = dot product |
| License | Apache 2.0 |
Why multilingual?
German curriculum language is domain-specific and contains compound words, technical terms, and pedagogical jargon that generic German models may handle poorly. A multilingual model trained on diverse paraphrase data captures semantic equivalence across paraphrases better than a purely language-modeled baseline, which is exactly what is needed to group thematically similar curriculum excerpts.
Practical interpretation
Two excerpts with a high cosine similarity (close to 1.0) make similar semantic claims or describe similar content — even if they use different words. Two excerpts with low cosine similarity (close to 0) occupy different regions of conceptual space.
5. Dimensionality Reduction: UMAP
The dimensionality problem
768-dimensional vectors cannot be visualized directly. UMAP (Uniform Manifold Approximation and Projection) reduces them to 2 or 3 dimensions while preserving as much of the neighbourhood structure as possible — meaning that points that were close in 768-D tend to remain close after reduction.
Two projections are computed
| Projection | Dimensions | Purpose |
|---|---|---|
| UMAP 2-D | 2 | Interactive scatter plots, BERTopic visualization |
| UMAP 3-D | 3 | Atlas visualization, joint concept space |
A joint 3-D UMAP is also computed across all three concept corpora combined, placing mensch, verhalten, and evolution excerpts in a single shared semantic space for direct comparison.
Key parameters
| Parameter | Value | Effect |
|---|---|---|
n_neighbors |
15 | Controls local vs. global structure balance |
min_dist |
0.1 (3-D) / 0.05 (2-D) | How tightly clusters pack |
metric |
cosine | Appropriate for normalized embeddings |
random_state |
42 | Ensures reproducible layouts |
Accessible interpretation
Think of UMAP as making a map of meaning. Just as a geographic map compresses the curved surface of the Earth onto a flat page while preserving relative distances between cities, UMAP compresses the high-dimensional semantic space onto a 2-D or 3-D canvas while preserving which excerpts are semantically "nearby."
Clusters of points in a UMAP plot indicate groups of excerpts that discuss similar ideas. Gaps between clusters indicate distinct conceptual sub-areas. The absolute position of a cluster has no meaning — only relative distances matter.
6. Topic Modeling: BERTopic
What is topic modeling?
Topic modeling is an unsupervised method for discovering thematic groups within a text collection — without being told in advance what the themes are. Traditional methods (e.g. LDA) work on word co-occurrence statistics. BERTopic uses pre-computed sentence embeddings, which means it understands meaning rather than just word frequency.
Pipeline within BERTopic
Sentence embeddings (768-D) │ ▼ UMAP reduction (→ 5-D internal space) │ ▼ HDBSCAN clustering │ ▼ c-TF-IDF topic representation (class-based TF-IDF: finds words that distinguish this topic from all others) │ ▼ Topic labels + per-document probabilities
HDBSCAN: density-based clustering
HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) finds clusters as dense regions in the embedding space, separated by sparser regions. Key advantages for curriculum text:
- Does not require specifying the number of clusters in advance
- Naturally handles outliers (assigned topic
-1) - Finds clusters of variable size and shape
Parameters used
| Parameter | Value | Rationale |
|---|---|---|
min_cluster_size |
5 (adaptive) | Minimum excerpts to form a topic |
min_samples |
max(1, min_cluster_size // 2) |
Controls noise sensitivity |
cluster_selection_method |
eom |
Excess of Mass — finds stable clusters |
n_gram_range |
(1, 2) | Single words and two-word phrases as features |
Reading the results
Each topic is characterized by its top words — terms with the highest c-TF-IDF scores for that cluster. These are words that appear frequently in the topic and rarely in other topics, making them highly discriminating.
Topic -1 is always the outlier category: excerpts that did not fit confidently into any discovered cluster. A high outlier rate may indicate either genuine semantic diversity or insufficient data for that concept.
Silhouette score
The silhouette score measures how well-separated the discovered clusters are, ranging from -1 (poor) to +1 (excellent):
- > 0.5: well-separated, meaningful topics
- 0.2–0.5: moderate separation — topics exist but overlap
- < 0.2: clusters are not well-defined; interpret with caution
7. Information-Theoretic Measures
Information theory provides a principled mathematical language for measuring diversity, surprise, and difference in distributions. The Concept Atlas applies three core measures.
7.1 Shannon Entropy
What it measures: How evenly curriculum excerpts are spread across the discovered topics. High entropy = many topics of roughly equal size (diverse framing). Low entropy = one or two dominant topics (concentrated framing).
Interpretation guide:
| Entropy | Meaning |
|---|---|
| Low (< 1 bit) | One topic dominates; concept is used in a narrow, uniform way |
| Medium (1–2.5 bits) | Moderate diversity; concept appears in several distinct contexts |
| High (> 2.5 bits) | Highly diverse; concept is used across many different framings |
Accessible analogy: Imagine rolling a die. If it always lands on 6 (entropy = 0), you learn nothing new from each roll. If all faces are equally likely (maximum entropy), each roll is maximally informative. Curriculum entropy works the same way — high entropy means the concept is used in many genuinely different ways.
7.2 Jensen-Shannon Divergence (JSD)
where $M = \frac{1}{2}(P + Q)$ and $D_{KL}$ is the Kullback-Leibler divergence.
What it measures: The similarity between two probability distributions over topics. JSD is symmetric (order doesn't matter), bounded in [0, 1], and always defined (unlike raw KL divergence).
Used in two contexts:
Cross-concept JSD: Do mensch, verhalten, and evolution have similar topic distributions? JSD near 0 means yes; near 1 means they occupy entirely different topical spaces.
Cross-state JSD: Do two federal states frame the same concept similarly? High JSD between states indicates curricular divergence; low JSD indicates convergence.
Interpretation:
| JSD | Interpretation |
|---|---|
| 0.0 | Identical topic distributions |
| 0.0–0.2 | Very similar framing |
| 0.2–0.5 | Moderate divergence |
| 0.5–1.0 | Substantially different framing |
| 1.0 | Completely non-overlapping distributions |
7.3 Cosine Similarity (Embedding Centroids)
where $\bar{a}$ and $\bar{b}$ are the mean embedding vectors (centroids) of two concept corpora.
What it measures: Whether the average semantic content of two concept corpora occupies the same region of embedding space. This is complementary to JSD: cosine similarity operates on raw embeddings (pre-clustering), while JSD operates on the topic distribution (post-clustering).
Interpretation:
| Cosine sim | Interpretation |
|---|---|
| > 0.9 | Concepts discussed in nearly identical semantic context |
| 0.7–0.9 | Related but distinct semantic regions |
| 0.5–0.7 | Moderate semantic overlap |
| < 0.5 | Concepts occupy largely separate semantic spaces |
8. Graph-Theoretic Analysis
Why model curriculum text as a graph?
A graph (network) makes the relational structure of a corpus explicit. Instead of treating each excerpt independently, a graph reveals which excerpts are semantically central, which bridge different topic areas, and how the corpus is organized into communities.
Construction: k-Nearest Neighbour Graph
For each concept corpus, a graph $G = (V, E)$ is constructed where:
- Nodes $V$: each curriculum excerpt
- Edges $E$: connect excerpt $i$ to excerpt $j$ if their cosine similarity exceeds a threshold (≥ 0.35) and $j$ is among the $k=6$ nearest neighbours of $i$
- Edge weights: the cosine similarity value
This creates a sparse similarity graph that captures local semantic neighbourhood structure.
Measures computed
Betweenness Centrality
Measures how often a node lies on the shortest path between other nodes. High betweenness = the excerpt is a semantic "bridge" between different topic areas. In curriculum terms, bridge excerpts often contain integrative or interdisciplinary language.
PageRank
Iteratively assigns importance based on the importance of neighbours. High PageRank = the excerpt is cited (connected to) by many other important excerpts. PageRank hubs represent semantically central curriculum statements that many other excerpts are conceptually near.
Closeness Centrality
Measures how quickly a node can reach all others via the graph. High closeness = the excerpt is semantically accessible from most others — a "general" or bridging statement.
Network Density
The fraction of all possible edges that actually exist. Higher density indicates a more semantically cohesive corpus (most excerpts are near most others). Lower density indicates a more fragmented semantic space.
Average Clustering Coefficient
Measures the tendency of nodes to form tightly connected local groups. High clustering = the corpus has tight semantic sub-communities.
Louvain Community Detection
Partitions the graph into communities that maximize modularity — the degree to which within-community connections exceed what would be expected by chance. Communities in a curriculum semantic graph often correspond to distinct disciplinary or contextual framings of a concept.
Accessible interpretation
Think of the semantic graph as a social network of curriculum excerpts, where two excerpts are "friends" if they discuss similar ideas.
- Hubs (high PageRank) are the popular, central ideas that most other ideas are related to.
- Bridges (high betweenness) are the connectors — excerpts that link otherwise separate clusters of ideas.
- Communities are cliques of mutually similar excerpts — effectively the curriculum's implicit sub-topics.
9. Cross-Concept Comparison
The cross-concept analysis addresses the core research question: How do mensch, verhalten, and evolution relate to each other in German curriculum language?
Three complementary lenses are applied:
Lens 1: Geometric (Cosine Similarity Matrix)
Computes the cosine similarity between the mean embedding vectors of each concept corpus. This is a purely geometric measure — it asks whether the average document in one concept space is semantically close to the average document in another.
Lens 2: Distributional (Jensen-Shannon Divergence Matrix)
Computes JSD between the topic probability distributions of each concept pair. This asks whether the thematic structure (the pattern of which topics are prominent) is similar across concepts, regardless of the raw embedding geometry.
Lens 3: Comparative Statistics
Side-by-side comparison of:
- Shannon entropy per concept (conceptual breadth)
- Corpus size (representation in curricula)
- Number of discovered topics (thematic complexity)
- Silhouette score (cluster quality)
Why use multiple lenses?
Two concepts could be geometrically close (similar average embedding) but distributionally different (very different topic structures). For example, mensch and verhalten might both appear in social-science contexts (close centroids) but mensch might span biology, ethics, and philosophy while verhalten concentrates in behavioral science (different topic distributions). Using both measures together gives a richer picture.
10. State-Level Variation
Germany's federal education system means that curriculum design is the responsibility of each Bundesland. This creates natural variation that is itself a research object.
Bubble chart: Entropy × State
For each state-concept pair, Shannon entropy is computed over the topic distribution of excerpts from that state. Plotting entropy against state with bubble size proportional to corpus size reveals:
- Which states frame a concept more uniformly (low entropy)
- Which states frame it more diversely (high entropy)
- Whether small-corpus states should be interpreted cautiously
State-pairwise JSD heatmaps
For each concept, a symmetric matrix of JSD values is computed between all pairs of states. This reveals:
- Clusters of similar states (low inter-state JSD)
- Outlier states with distinctive curriculum framing
- Concept-specific patterns: states may converge on evolution (where scientific consensus constrains framing) but diverge on mensch (where philosophical tradition varies more)
11. Caching & Reproducibility
Deterministic cache keys
Every artefact is identified by a key encoding:
- The artefact type (e.g.
emb_mensch) - The number of input texts (detects corpus changes)
- An MD5 hash of the full key string
This means that if the corpus changes size, all downstream caches are automatically invalidated (different key → different filename → recomputed).
Artefact manifest
| File pattern | Content | Format |
|---|---|---|
emb_{concept}_{n}_*.npy |
Sentence embeddings | NumPy float32 |
umap3d_{concept}_{n}_*.npy |
3-D UMAP coordinates | NumPy float32 |
umap2d_{concept}_{n}_*.npy |
2-D UMAP coordinates | NumPy float32 |
bertopic_{concept}_{n}_topics_*.json |
Topic assignment per excerpt | JSON list |
bertopic_{concept}_{n}_probs_*.npy |
Topic probabilities | NumPy float32 |
bertopic_{concept}_{n}_words_*.json |
Top words per topic | JSON dict |
bertopic_{concept}_{n}_info_*.json |
Full topic info table | JSON list |
joint_all_embs_*.npy |
Joint corpus embeddings | NumPy float32 |
joint_umap3d_*.npy |
Joint 3-D UMAP | NumPy float32 |
enriched_corpus.parquet |
Full enriched dataset | Parquet |
data/enriched_corpus.csv |
Full enriched dataset | CSV |
Pushing to HuggingFace
Authenticate
huggingface-cli login
Upload computed artefacts to the Space
huggingface-cli upload deirdosh/curriculum_analysis_german
./cache cache --repo-type=space
huggingface-cli upload deirdosh/curriculum_analysis_german
./data data --repo-type=space
The enriched_corpus.csv adds BERTopic topic_id and UMAP coordinates
(umap2_x, umap2_y, umap3_x, umap3_y, umap3_z) to every excerpt,
making the enriched dataset independently useful for downstream analysis
without re-running the pipeline.
12. Educational Applications
For curriculum researchers
The Concept Atlas operationalizes several research questions that have historically required manual content analysis:
Q: Is the concept of evolution framed consistently across German states?
→ Examine the state-pairwise JSD heatmap for evolution. States with high JSD
scores frame the concept in fundamentally different ways — follow up by reading
the excerpts in the high-JSD clusters.
Q: How does mensch differ between Biology and Ethics curricula?
→ Filter the UMAP scatter by subject metadata and look for spatial separation
between subject-coloured clusters.
Q: Which curriculum excerpts are semantically central to the concept
of verhalten?
→ Consult the PageRank and betweenness centrality rankings in the Graph
Theory tab to find the hub and bridge excerpts.
Q: Are any states outliers in how they frame all three concepts?
→ Compare state-level entropy across all three concept bubble charts. A state
that consistently shows either very low or very high entropy across all three
concepts may have a distinctive curriculum philosophy.
For teachers and educators
You do not need to understand the mathematics to use the Concept Atlas productively. Here is an accessible guide:
The UMAP scatter plot is a map of meaning. Points close together mean the curriculum uses similar language in those excerpts. Click on any point to read the actual excerpt. Ask: Do the clusters make sense intuitively? Are there surprising neighbors?
The topic word clouds show you what each cluster is "about" — the most distinctive words for that group of excerpts. Use these to name the implicit sub-topics in your subject area's curriculum.
The entropy score is a single number summarizing how diverse curriculum language is for that concept. Compare it across states: does your state have higher or lower entropy than average? What might that mean for teaching?
The state JSD heatmap is a curriculum comparison tool. Find your state on both axes and read across: which states treat this concept most similarly to yours? Most differently? This can be a starting point for cross-state curriculum exchange or dialogue.
Classroom use
The Concept Atlas can support several classroom activities:
- Curriculum literacy seminars: pre-service teachers can explore how their subject area frames key concepts, developing meta-awareness of curriculum language
- Cross-disciplinary projects: students can investigate how mensch is framed differently in Biology vs. Religion, using the UMAP and topic plots as primary evidence
- Federalism and education policy: social studies courses can use the state comparison features to discuss German educational federalism concretely
- Philosophy of science: the evolution concept analysis can ground discussions of how scientific concepts travel (or don't) across subject boundaries
13. Decentralized Research Model
The case for community-driven curriculum analysis
Curriculum analysis has traditionally been the province of large research institutes with dedicated funding and staff. This creates several problems:
- Coverage is selective and often lags policy changes by years
- Methods are rarely shared, making replication difficult
- Researchers in smaller institutions or non-German-speaking countries face high access barriers
- Teachers, whose expertise is most relevant, are rarely involved as researchers
The Concept Atlas is designed to support a different model: lightweight, reproducible, community-extensible analysis hosted on free public infrastructure.
HuggingFace Spaces as research infrastructure
HuggingFace Spaces provides:
| Feature | Research value |
|---|---|
| Free GPU/CPU hosting | Zero infrastructure cost for deployment |
| Git-based version control | Full reproducibility and change history |
| Public dataset repository | Findable, citable, reusable data |
| Community discussion | Peer feedback without formal publication gatekeeping |
| Fork-and-extend | Others can build on the analysis with one click |
How to extend this work
Adding more concepts:
Edit the FOCUS_CONCEPTS list in app.py. The pipeline will automatically
process new concepts if matching rows exist in the CSV.
Adding more states or subjects:
Extend the curriculum_excerpts.csv with new rows following the same column
structure and re-run the pipeline.
Using a different embedding model:
Change the MODEL_NAME constant. Any model on the
SentenceTransformers model hub
with multilingual support can be substituted. Clear the embedding cache
(cache/emb_*.npy) to force recomputation.
Comparative cross-national analysis:
The pipeline is language-agnostic (the embedding model supports 50+ languages).
Providing curriculum excerpts from Austria, Switzerland, or other countries
in the same CSV format enables direct cross-national comparison.
Contributing back:
- Open an issue or discussion on the HuggingFace Space
- Fork the Space and submit a PR with your extension
- Publish your enriched corpus as a separate HuggingFace dataset with a link back to this Space
Minimal technical requirements for contributors
To run the pipeline locally or contribute new analysis:
Clone the Space
git clone https://huggingface.co/spaces/deirdosh/curriculum_analysis_german cd curriculum_analysis_german
Install dependencies
pip install -r requirements.txt
Run locally
python app.py
A standard laptop (8 GB RAM, no GPU) can run the full pipeline in approximately 15–20 minutes on first run. GPU acceleration reduces this to 2–5 minutes. All subsequent runs load from cache in under 10 seconds.
14. Limitations & Ethical Considerations
Technical limitations
Embedding model biases
The paraphrase-multilingual-mpnet-base-v2 model was trained primarily on
paraphrase pairs and may not capture domain-specific curriculum jargon as
accurately as a model fine-tuned on educational text. Terms with
curriculum-specific meanings (e.g. Kompetenz in pedagogical vs. general
usage) may be represented according to their general-language distribution.
UMAP non-determinism across runs
While random_state=42 ensures reproducibility within a session, UMAP
projections are not globally canonical — a different seed or different
n_neighbors value will produce a different (though structurally similar)
layout. Conclusions should not depend on the absolute position of clusters,
only on their relative proximity.
BERTopic outlier sensitivity
HDBSCAN classifies excerpts that do not fit any cluster as outliers (topic -1).
With small corpora or very diverse text, the outlier rate can be high (>50%).
This is a signal that the data may be too heterogeneous for reliable topic
discovery rather than a failure of the method — but it limits interpretability.
Corpus completeness
The current corpus may not include all German states, all school types, or
all grade levels. Gaps in coverage mean that low entropy or low JSD for a
state may reflect missing data rather than genuine curricular convergence.
No temporal dimension
The current analysis treats curricula as static documents. It does not capture
revision histories or how concept framing has changed over time. A longitudinal
extension would require time-stamped corpus versions.
Ethical considerations
Curriculum documents are public, but context matters
German curriculum documents are publicly available administrative texts.
However, analysis that identifies specific states as "outliers" or frames
curriculum differences in evaluative terms should be handled carefully.
The goal of this tool is descriptive analysis, not ranking or judgment.
Automated analysis does not replace reading
Computational methods reveal patterns at scale but cannot replace close
reading of the actual texts. Any finding from the Concept Atlas should be
verified by examining the underlying excerpts before drawing policy conclusions.
Representation of marginalized perspectives
If curriculum documents systematically underrepresent certain voices
(e.g. indigenous knowledge systems, minority cultural frameworks), those
absences will not appear in the semantic analysis — which only reflects
what is present in the text. The Concept Atlas can reveal what is there
but not what is missing.
Open-source does not mean unbiased
The choice of focus concepts, the threshold parameters, and the framing of
results all reflect research decisions made by the developers. We encourage
users to interrogate these choices and to adapt the tool to their own
research questions rather than treating the default configuration as neutral.
15. Glossary
| Term | Definition |
|---|---|
| BERTopic | A topic modeling framework that uses pre-trained language model embeddings and density-based clustering to discover topics in text |
| Betweenness centrality | A graph measure of how often a node lies on the shortest path between other nodes; identifies semantic bridge points |
| Clustering coefficient | The tendency of a node's neighbours to also be connected to each other; measures local cohesiveness |
| Cosine similarity | A measure of the angle between two vectors; 1 = identical direction, 0 = orthogonal, -1 = opposite |
| c-TF-IDF | Class-based TF-IDF; identifies words that are distinctive for a topic relative to all other topics |
| Embedding | A numerical vector representation of text that encodes semantic meaning |
| Entropy (Shannon) | A measure of uncertainty or diversity in a probability distribution; measured in bits |
| HDBSCAN | Hierarchical Density-Based Spatial Clustering of Applications with Noise; finds clusters of arbitrary shape |
| Jensen-Shannon divergence | A symmetric, bounded measure of similarity between two probability distributions |
| kNN graph | A graph where each node is connected to its k nearest neighbours by some distance measure |
| Louvain algorithm | A community detection algorithm that maximizes modularity in a network |
| Modularity | A measure of the quality of a graph partition into communities |
| PageRank | A graph centrality measure that assigns importance based on the importance of connected nodes |
| Silhouette score | A measure of how well-separated clusters are; ranges from -1 (poor) to +1 (excellent) |
| Sentence Transformer | A neural network architecture optimized for producing sentence-level embeddings |
| UMAP | Uniform Manifold Approximation and Projection; a dimensionality reduction method that preserves neighbourhood structure |
Document version: May 2025
Space: deirdosh/curriculum_analysis_german
License: Apache 2.0
```