mishig HF Staff Claude Sonnet 4.6 commited on
Commit
80c9ec2
Β·
1 Parent(s): 83d531b

add CLAUDE.md architecture docs and include tests in validate script

Browse files

- Add CLAUDE.md with full architecture overview: dataset version support
(v2.0, v2.1, v3.0), key files, chart data pipeline, testing setup,
URL structure, and post-process instructions
- Update validate script to include `bun test` so tests run as part of
the full CI check alongside type-check, lint, and format

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Files changed (2) hide show
  1. CLAUDE.md +116 -0
  2. package.json +1 -1
CLAUDE.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CLAUDE.md β€” LeRobot Dataset Visualizer
2
+
3
+ ## Package manager
4
+
5
+ Always use **bun** (`bun install`, `bun dev`, `bun run build`, `bun test`). Never use npm or yarn.
6
+
7
+ ## Post-process β€” run after every code change
8
+
9
+ After making any code changes, always run these commands in order and fix any errors before finishing:
10
+
11
+ ```
12
+ bun run format # auto-fix formatting (prettier)
13
+ bun run type-check # TypeScript: app + test files
14
+ bun run lint # ESLint (next lint)
15
+ bun test # unit tests
16
+ ```
17
+
18
+ Or run them all at once (format first, then the full validate suite):
19
+
20
+ ```
21
+ bun run format && bun run validate
22
+ ```
23
+
24
+ `bun run validate` runs: type-check β†’ lint β†’ format:check β†’ test
25
+
26
+ ## Key scripts
27
+
28
+ ```
29
+ bun dev # Next.js dev server
30
+ bun test # Run all unit tests (bun:test)
31
+ bun run type-check # tsc --noEmit (app) + tsc -p tsconfig.test.json --noEmit (tests)
32
+ bun run lint # next lint
33
+ bun run validate # type-check + lint + format:check
34
+ ```
35
+
36
+ ## Architecture
37
+
38
+ ### Dataset version support
39
+
40
+ Three versions are supported. Version is detected from `meta/info.json` β†’ `codebase_version`.
41
+
42
+ | Version | Path pattern | Episode metadata | Video |
43
+ | -------- | ----------------------------------------------------------------- | ------------------------------------------ | ---------------------------------------------- |
44
+ | **v2.0** | `data/{episode_chunk:03d}/episode_{episode_index:06d}.parquet` | None (computed from `chunks_size`) | Full file per episode |
45
+ | **v2.1** | Same as v2.0 | None | Full file per episode |
46
+ | **v3.0** | `data/chunk-{N:03d}/file-{N:03d}.parquet` (via `buildV3DataPath`) | `meta/episodes/chunk-{N}/file-{N}.parquet` | Segmented (timestamps per episode, per camera) |
47
+
48
+ ### Routing to parsers
49
+
50
+ `src/app/[org]/[dataset]/[episode]/fetch-data.ts` β†’ `getEpisodeData()` dispatches to:
51
+
52
+ - `getEpisodeDataV2()` for v2.0 and v2.1
53
+ - `getEpisodeDataV3()` for v3.0
54
+
55
+ ### v3.0 specifics
56
+
57
+ - Episode metadata row has named keys (`episode_index`, `data/chunk_index`, `data/file_index`, `dataset_from_index`, `dataset_to_index`, `videos/{key}/chunk_index`, etc.)
58
+ - Integer columns from parquet come out as **BigInt** β€” always use `bigIntToNumber()` from `src/utils/typeGuards.ts`
59
+ - Row-range selection: `dataset_from_index` / `dataset_to_index` allow reading only the episode's rows from a shared parquet file
60
+ - Fallback format uses numeric keys `"0"`.."9"` when column names are unavailable
61
+
62
+ ### v2.x path construction
63
+
64
+ ```ts
65
+ formatStringWithVars(info.data_path, {
66
+ episode_chunk: Math.floor(episodeId / chunkSize)
67
+ .toString()
68
+ .padStart(3, "0"),
69
+ episode_index: episodeId.toString().padStart(6, "0"),
70
+ });
71
+ // β†’ "data/000/episode_000042.parquet"
72
+ ```
73
+
74
+ `formatStringWithVars` strips `:03d` format specifiers β€” padding must be done by the caller.
75
+
76
+ ## Key files
77
+
78
+ | File | Purpose |
79
+ | ------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------- |
80
+ | `src/app/[org]/[dataset]/[episode]/fetch-data.ts` | Main data-loading entry point; v2/v3 parsers; `computeColumnMinMax` |
81
+ | `src/utils/versionUtils.ts` | `getDatasetInfo`, `getDatasetVersionAndInfo`, `buildVersionedUrl` |
82
+ | `src/utils/stringFormatting.ts` | `buildV3DataPath`, `buildV3VideoPath`, `buildV3EpisodesMetadataPath`, padding helpers |
83
+ | `src/utils/parquetUtils.ts` | `fetchParquetFile`, `readParquetAsObjects`, `formatStringWithVars` |
84
+ | `src/utils/dataProcessing.ts` | Chart grouping pipeline: `buildSuffixGroupsMap` β†’ `computeGroupStats` β†’ `groupByScale` β†’ `flattenScaleGroups` β†’ `processChartDataGroups` |
85
+ | `src/utils/typeGuards.ts` | `bigIntToNumber`, `isNumeric`, `isValidTaskIndex`, etc. |
86
+ | `src/utils/constants.ts` | `PADDING`, `EXCLUDED_COLUMNS`, `CHART_CONFIG`, `THRESHOLDS` |
87
+ | `src/types/` | TypeScript types: `DatasetVersion`, `EpisodeMetadataV3`, `VideoInfo`, `ChartDataGroup`, etc. |
88
+
89
+ ## Chart data pipeline
90
+
91
+ Series keys use `" | "` as delimiter (e.g. `observation.state | 0`).
92
+ `groupRowBySuffix` groups by **suffix**: if two different prefixes share suffix `"0"` (e.g. `observation.state | 0` and `action | 0`), they are merged under `result["0"] = { "observation.state": ..., "action": ... }`. A series with a unique suffix stays flat with its full original key.
93
+
94
+ ## Testing
95
+
96
+ - Test files live in `**/__tests__/` directories alongside source
97
+ - Uses `bun:test` (built-in, no extra install)
98
+ - BigInt literals (`42n`) require `tsconfig.test.json` (target ES2020) β€” test files are excluded from `tsconfig.json`
99
+ - `@types/bun` is installed as a devDependency for `bun:test` type resolution
100
+ - Mocking fetch: `globalThis.fetch = mock(() => Promise.resolve(new Response(...))) as unknown as typeof fetch`
101
+ - CI: `.github/workflows/test.yml` runs `bun test` on push/PR to main
102
+
103
+ ## URL structure
104
+
105
+ All dataset URLs:
106
+
107
+ ```
108
+ https://huggingface.co/datasets/{org}/{dataset}/resolve/main/{path}
109
+ ```
110
+
111
+ Built by `buildVersionedUrl(repoId, version, path)`. The `version` param is accepted but currently unused in the URL (always `main` revision).
112
+
113
+ ## Excluded columns (not shown in charts)
114
+
115
+ - v2.x: `timestamp`, `frame_index`, `episode_index`, `index`, `task_index`
116
+ - v3.0: `index`, `task_index`, `episode_index`, `frame_index`, `next.done`
package.json CHANGED
@@ -12,7 +12,7 @@
12
  "type-check": "tsc --noEmit && tsc -p tsconfig.test.json --noEmit",
13
  "type-check:watch": "tsc --noEmit --watch",
14
  "test": "bun test",
15
- "validate": "bun run type-check && bun run lint && bun run format:check"
16
  },
17
  "dependencies": {
18
  "@react-three/drei": "^10.7.7",
 
12
  "type-check": "tsc --noEmit && tsc -p tsconfig.test.json --noEmit",
13
  "type-check:watch": "tsc --noEmit --watch",
14
  "test": "bun test",
15
+ "validate": "bun run type-check && bun run lint && bun run format:check && bun test"
16
  },
17
  "dependencies": {
18
  "@react-three/drei": "^10.7.7",