dwd / docs /URL_ARCHITECTURE.md
alexdum's picture
docs: update metadata regeneration paths and directory references in URL_ARCHITECTURE.md
53de293
# DWD Clean URL Architecture & SEO System
This document describes the path-based URL system implemented for the DWD section of Climate Explorer. It serves as a **reference template** for implementing clean URLs on other sections of the site.
## URL Structure
```
/dwd/{resolution}/{state?}/{station?}/?{view}&{start}&{end}
```
### Path Segments
| Segment | Required | Example | Description |
|---------|----------|---------|-------------|
| `resolution` | Yes (defaults to `daily`) | `hourly` | Time resolution slug |
| `state` | No | `bayern` | German state (Bundesland) slug |
| `station` | No | `muenchen-flughafen` | Station name slug |
### Query Parameters (UI state only β€” not indexed)
| Param | Default | Example | Description |
|-------|---------|---------|-------------|
| `view` | `map` | `dashboard-plots` | Active tab |
| `start` | Resolution default | `2020-01-01` | Date range start |
| `end` | Resolution default | `2026-04-26` | Date range end |
### URL Examples
```
# Base landing page (defaults to Daily)
/dwd/
# Resolution pages
/dwd/daily/
/dwd/hourly/
/dwd/10-minutes/
/dwd/monthly/
/dwd/annual/
# State pages
/dwd/daily/bayern/
/dwd/hourly/sachsen/
/dwd/10-minutes/nordrhein-westfalen/
# Station pages
/dwd/daily/bayern/muenchen-flughafen/
/dwd/hourly/sachsen/leipzig-holzhausen/
# With UI state (query params)
/dwd/daily/bayern/muenchen-flughafen/?view=dashboard-plots&start=2020-01-01&end=2026-04-26
```
## Resolution Slugs
| UI Label | URL Slug | Shiny Internal Value |
|----------|----------|---------------------|
| 10 Minutes | `10-minutes` | `10_minutes` |
| Hourly | `hourly` | `hourly` |
| Daily | `daily` | `daily` |
| Monthly | `monthly` | `monthly` |
| Annual | `annual` | `annual` |
## Slugify Algorithm
State and station names are slugified using the same algorithm across all three layers (R, JS, Edge Function):
```
1. Replace German umlauts: ΓΌβ†’ue, ΓΆβ†’oe, Γ€β†’ae, Γœβ†’ue, Γ–β†’oe, Γ„β†’ae, ΓŸβ†’ss
2. Lowercase
3. Strip diacritics (R uses iconv ASCII//TRANSLIT; JS/TS use NFD + regex)
4. Replace non-alphanumeric chars with hyphens
5. Trim leading/trailing hyphens
```
Examples:
- `MΓΌnchen-Flughafen` β†’ `muenchen-flughafen`
- `Nordrhein-Westfalen` β†’ `nordrhein-westfalen`
- `ThΓΌringen` β†’ `thueringen`
- `Baden-WΓΌrttemberg` β†’ `baden-wuerttemberg`
> **Critical**: The slugify function must produce identical output in R (`scripts/export_seo_metadata.R`), JavaScript (`dwd-page.js`), and TypeScript (`rewrite-meta.ts`). Any mismatch causes 404s or broken links. Note that R uses `iconv(..., to = "ASCII//TRANSLIT")` while JS/TS use `NFD normalize + strip combining marks` β€” both produce the same result for German text.
## System Architecture
The URL system spans four layers:
```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. SEO Metadata (Build Time) β”‚
β”‚ R script β†’ dwd-seo-metadata.json β”‚
│ Generates slug→metadata mappings for all │
β”‚ stations, states, and resolutions β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 2. Edge Function (Request Time) β”‚
β”‚ rewrite-meta.ts β”‚
β”‚ Parses URL β†’ injects HTML body content, β”‚
β”‚ meta tags, JSON-LD, canonical URL β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 3. Parent Page JS (Client Side) β”‚
β”‚ dwd-page.js β”‚
β”‚ Parses URL β†’ configures iframe, β”‚
β”‚ listens to Shiny broadcasts β†’ updates URL, β”‚
β”‚ title, and dynamic context block β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ 4. Shiny App (Iframe) β”‚
β”‚ server.R β”‚
β”‚ Receives URL params β†’ broadcasts state β”‚
β”‚ changes via postMessage to parent page β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```
### 1. SEO Metadata Generation (Build Time)
**Script**: `scripts/export_seo_metadata.R` (in the DWD project)
**Output**: `dwd-seo-metadata.json` (in `climateexplorer/netlify/edge-functions/`)
This R script reads all 5 resolution RDS cache files and generates a JSON file containing:
- **Stations**: `{resolution}/{state-slug}/{station-slug}` β†’ `{id, name, state, stateSlug, elevation, lat, lon, resolution, resolutionLabel, resolutionSlug, overallStart, overallEnd, availableParams}`
- **States**: `{resolution}/{state-slug}` β†’ `{state, stateSlug, resolution, resolutionLabel, resolutionSlug, stationCount, activeStationCount}`
- **Resolutions**: `{resolution-slug}` β†’ `{key, label, slug, stationCount, activeStationCount}`
- **Slug map**: display name β†’ slug (for legacy URL redirect lookups)
To regenerate the metadata:
```bash
# Run from the DWD app root directory (clima/2025/dwd/)
Rscript scripts/export_seo_metadata.R
# Copy the output to the climateexplorer project (clima/2024/climateexplorer/)
cp dwd-seo-metadata.json ../../2024/climateexplorer/netlify/edge-functions/
```
### 2. Edge Function (Request Time)
**File**: `climateexplorer/netlify/edge-functions/rewrite-meta.ts`
When a request hits `/dwd/{resolution}/{state?}/{station?}/`:
1. `parseDwdPath()` extracts path segments
2. Looks up metadata from `dwd-seo-metadata.json`
3. Injects into the HTML response:
- **`<title>`** β€” e.g., `"MΓΌnchen-Flughafen, Bayern – Daily Climate Data | DWD Explorer"`
- **`<meta name="description">`** β€” station-specific description
- **`<link rel="canonical">`** β€” canonical URL
- **OG/Twitter meta tags**
- **JSON-LD breadcrumb** β€” structured data for Google
- **Body content** (`<div id="dynamic-context">`) β€” rich HTML with station details, state lists, or country overview
- **`window.__DWD_RESOLVED__`** β€” resolved metadata for the JS layer
This is **server-side rendered** β€” Google sees full content without executing JavaScript.
### 3. Parent Page JavaScript (Client Side)
**File**: `climateexplorer/dwd/dwd-page.js`
On page load:
1. `parsePathParams()` extracts resolution/state/station from the URL
2. If `/dwd/` (no resolution), defaults to "Daily"
3. Builds iframe URL with Shiny query params
4. Uses `__DWD_RESOLVED__` metadata (from edge function) to pass real station IDs/names to iframe
On Shiny state changes (via `postMessage`):
1. `handleIframeMessage()` receives broadcast from iframe
2. `updateBrowserUrl()` updates the browser URL (using `history.replaceState`)
3. `updatePageTitle()` updates the browser tab title
4. `updateDynamicContext()` updates the context block HTML
### 4. Shiny App Broadcasts (Iframe)
**File**: `server.R` (in the DWD project)
The `broadcast_state()` function sends a `postMessage` to the parent page with:
```r
list(
station = station_id,
stationName = station_name,
landname = state_name, # German state
resolution = resolution, # UI label (e.g., "Daily")
view = active_view,
start = start_date,
end = end_date,
countryStationCount = ..., # Total stations for this resolution
countryActiveCount = ..., # Active in current date range
countryStateList = ..., # State breakdown for context block
...
)
```
**Broadcast triggers** (observers in server.R):
1. Tab/view changes
2. Station selection changes
3. Station deselection
4. Resolution changes
5. Date range changes
6. State filter changes (`ignoreNULL = FALSE` β€” fires on clear)
## Sitemap Integration
### indexed-pages.json
**File**: `climateexplorer/netlify/edge-functions/indexed-pages.json`
Defines the curated URLs to include in `sitemap.xml`:
```json
{
"/dwd": {
"stations": [
{ "path": "daily/bayern/muenchen-flughafen" },
{ "path": "daily/sachsen/leipzig-holzhausen" }
],
"regions": [
{ "path": "daily/bayern" },
{ "path": "daily/berlin" }
],
"resolutions": [
{ "path": "daily" },
{ "path": "hourly" },
{ "path": "10-minutes" }
]
}
}
```
### Sitemap Normalization
**Script**: `climateexplorer/scripts/normalize-sitemap.mjs`
Runs after `quarto render` to inject curated URLs into `sitemap.xml`. The validator (`scripts/lib/indexed-pages-validator.mjs`) auto-approves path-based entries (entries with a `path` field).
### Google Discovery Chain
The sitemap contains ~36 DWD seed URLs. Google discovers all other pages through internal links:
```
Sitemap: 5 resolution pages ──→ Each links to 16 states
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
16 state pages ──→ Each links to all stations in that state
β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β–Ό
~1400 station pages (per resolution)
```
Total discoverable pages: **~5,000+** across all resolutions.
## Legacy URL Redirect
Old query-parameter URLs are automatically redirected to clean paths:
```
/dwd/?resolution=Daily&landname=Bayern&station=MΓΌnchen-Flughafen
β†’ 301 redirect β†’
/dwd/daily/bayern/muenchen-flughafen/
```
Handled by the edge function's "DWD Legacy Query-Param Redirect" block.
## Applying to Other Sections
To implement this pattern for another section (e.g., `/meteofrance/`, `/jma/`):
### 1. Define the URL hierarchy
```
/{section}/{resolution}/{region?}/{station?}/
```
Choose meaningful slugs for resolutions, regions (departments, prefectures, countries), and stations.
### 2. Create SEO metadata
Write an R script to generate `{section}-seo-metadata.json` with:
- Station metadata (name, region, coordinates, data range, parameters)
- Region metadata (station counts)
- Resolution metadata (station counts)
- Slug map (display name β†’ URL slug)
### 3. Update the edge function
Add a `parse{Section}Path()` function and inject body content + meta tags.
### 4. Create the page JavaScript
Write a `{section}-page.js` that:
- Parses path segments on load
- Configures the iframe with Shiny query params
- Listens for postMessage broadcasts and updates URL/title/context
### 5. Update Shiny's broadcast_state()
Ensure the Shiny app sends state/region/station names in its broadcasts so the JS can construct correct URLs.
### 6. Update indexed-pages.json
Add curated seed URLs for the section's resolutions, regions, and sample stations.
### 7. Verify
```bash
# Run sitemap normalization
QUARTO_PROJECT_OUTPUT_DIR=_site node scripts/normalize-sitemap.mjs
# Run sitemap checks
QUARTO_PROJECT_OUTPUT_DIR=_site node scripts/check-sitemap.mjs --fetch
# Test edge function locally
netlify dev
```
## Key Files Reference
| File | Location | Purpose |
|------|----------|---------|
| `export_seo_metadata.R` | `dwd/scripts/` | Generate SEO metadata JSON |
| `dwd-seo-metadata.json` | `climateexplorer/netlify/edge-functions/` | Station/state/resolution metadata |
| `rewrite-meta.ts` | `climateexplorer/netlify/edge-functions/` | Edge function (SSR injection) |
| `dwd-page.js` | `climateexplorer/dwd/` | Client-side URL sync |
| `server.R` | `dwd/` | Shiny broadcast_state() |
| `indexed-pages.json` | `climateexplorer/netlify/edge-functions/` | Sitemap seed URLs |
| `normalize-sitemap.mjs` | `climateexplorer/scripts/` | Sitemap URL injection |
| `indexed-pages-validator.mjs` | `climateexplorer/scripts/lib/` | Validates curated URLs |