sproos commited on
Commit
c5f9e16
·
verified ·
1 Parent(s): 5cb9486

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. .gitattributes +1 -0
  2. README.md +69 -0
  3. cached_challenge_fineweb.py +157 -0
  4. datasets/fineweb10B_sp1024/fineweb_train_000000.bin +3 -0
  5. datasets/fineweb10B_sp1024/fineweb_train_000001.bin +3 -0
  6. datasets/fineweb10B_sp1024/fineweb_train_000002.bin +3 -0
  7. datasets/fineweb10B_sp1024/fineweb_train_000003.bin +3 -0
  8. datasets/fineweb10B_sp1024/fineweb_train_000004.bin +3 -0
  9. datasets/fineweb10B_sp1024/fineweb_train_000005.bin +3 -0
  10. datasets/fineweb10B_sp1024/fineweb_train_000006.bin +3 -0
  11. datasets/fineweb10B_sp1024/fineweb_train_000007.bin +3 -0
  12. datasets/fineweb10B_sp1024/fineweb_train_000008.bin +3 -0
  13. datasets/fineweb10B_sp1024/fineweb_train_000009.bin +3 -0
  14. datasets/fineweb10B_sp1024/fineweb_train_000010.bin +3 -0
  15. datasets/fineweb10B_sp1024/fineweb_train_000011.bin +3 -0
  16. datasets/fineweb10B_sp1024/fineweb_train_000012.bin +3 -0
  17. datasets/fineweb10B_sp1024/fineweb_train_000013.bin +3 -0
  18. datasets/fineweb10B_sp1024/fineweb_train_000014.bin +3 -0
  19. datasets/fineweb10B_sp1024/fineweb_train_000015.bin +3 -0
  20. datasets/fineweb10B_sp1024/fineweb_train_000016.bin +3 -0
  21. datasets/fineweb10B_sp1024/fineweb_train_000017.bin +3 -0
  22. datasets/fineweb10B_sp1024/fineweb_train_000018.bin +3 -0
  23. datasets/fineweb10B_sp1024/fineweb_train_000019.bin +3 -0
  24. datasets/fineweb10B_sp1024/fineweb_train_000020.bin +3 -0
  25. datasets/fineweb10B_sp1024/fineweb_train_000021.bin +3 -0
  26. datasets/fineweb10B_sp1024/fineweb_train_000022.bin +3 -0
  27. datasets/fineweb10B_sp1024/fineweb_train_000023.bin +3 -0
  28. datasets/fineweb10B_sp1024/fineweb_train_000024.bin +3 -0
  29. datasets/fineweb10B_sp1024/fineweb_train_000025.bin +3 -0
  30. datasets/fineweb10B_sp1024/fineweb_train_000026.bin +3 -0
  31. datasets/fineweb10B_sp1024/fineweb_train_000027.bin +3 -0
  32. datasets/fineweb10B_sp1024/fineweb_train_000028.bin +3 -0
  33. datasets/fineweb10B_sp1024/fineweb_train_000029.bin +3 -0
  34. datasets/fineweb10B_sp1024/fineweb_train_000030.bin +3 -0
  35. datasets/fineweb10B_sp1024/fineweb_train_000031.bin +3 -0
  36. datasets/fineweb10B_sp1024/fineweb_train_000032.bin +3 -0
  37. datasets/fineweb10B_sp1024/fineweb_train_000033.bin +3 -0
  38. datasets/fineweb10B_sp1024/fineweb_train_000034.bin +3 -0
  39. datasets/fineweb10B_sp1024/fineweb_train_000035.bin +3 -0
  40. datasets/fineweb10B_sp1024/fineweb_train_000036.bin +3 -0
  41. datasets/fineweb10B_sp1024/fineweb_train_000037.bin +3 -0
  42. datasets/fineweb10B_sp1024/fineweb_train_000038.bin +3 -0
  43. datasets/fineweb10B_sp1024/fineweb_train_000039.bin +3 -0
  44. datasets/fineweb10B_sp1024/fineweb_train_000040.bin +3 -0
  45. datasets/fineweb10B_sp1024/fineweb_train_000041.bin +3 -0
  46. datasets/fineweb10B_sp1024/fineweb_train_000042.bin +3 -0
  47. datasets/fineweb10B_sp1024/fineweb_train_000043.bin +3 -0
  48. datasets/fineweb10B_sp1024/fineweb_train_000044.bin +3 -0
  49. datasets/fineweb10B_sp1024/fineweb_train_000045.bin +3 -0
  50. datasets/fineweb10B_sp1024/fineweb_train_000046.bin +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ docs_selected.jsonl filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Data Workflows
2
+
3
+ This directory contains the dataset download helpers and export scripts used for the challenge.
4
+
5
+ Canonical local layout:
6
+ - `data/datasets/<dataset_name>/`
7
+ - `data/tokenizers/`
8
+ - `data/manifest.json`
9
+ - `data/docs_selected.jsonl`
10
+ - `data/docs_selected.source_manifest.json`
11
+
12
+ ## Downloading Published Data
13
+
14
+ Download the cached FineWeb export for a tokenizer variant with:
15
+
16
+ ```bash
17
+ python3 data/cached_challenge_fineweb.py --variant sp1024
18
+ ```
19
+
20
+ This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
21
+ By default it downloads the full validation split and 8B training tokens (80 train shards).
22
+
23
+ To fetch more training shards, pass `--train-shards`:
24
+
25
+ ```bash
26
+ python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
27
+ ```
28
+
29
+ The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards:
30
+
31
+ ```bash
32
+ MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
33
+ MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
34
+ python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
35
+ ```
36
+
37
+ Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.
38
+
39
+ The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`.
40
+
41
+ ## Rebuilding Tokenizers From Published Docs
42
+
43
+ To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:
44
+
45
+ ```bash
46
+ python3 data/download_hf_docs_and_tokenize.py \
47
+ --repo-id your-hf-username/your-dataset-repo \
48
+ --remote-root your_50B_export_root \
49
+ --output-root /tmp/my_custom_tokenizer_export \
50
+ --tokenizer-config ./data/tokenizer_specs.json \
51
+ --max-train-tokens 8000000000
52
+ ```
53
+
54
+ The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export.
55
+
56
+ ## Useful Knobs
57
+
58
+ For CPU-heavy exports, useful knobs are:
59
+
60
+ ```bash
61
+ MATCHED_FINEWEB_SP_BATCH_SIZE=2048
62
+ MATCHED_FINEWEB_TOKENIZER_THREADS=16
63
+ MATCHED_FINEWEB_TIKTOKEN_THREADS=16
64
+ MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
65
+ ```
66
+
67
+ These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.
68
+
69
+ When rebuilding locally, `--max-train-tokens 8000000000` matches the published 8B-train-token export. With the default shard size of `100_000_000`, that produces 80 train shards plus the full validation split.
cached_challenge_fineweb.py ADDED
@@ -0,0 +1,157 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+ import os
4
+ import shutil
5
+ from pathlib import Path
6
+
7
+ from huggingface_hub import hf_hub_download
8
+
9
+
10
+ REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
11
+ REMOTE_ROOT_PREFIX = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
12
+ ROOT = Path(__file__).resolve().parent
13
+ DATASETS_DIR = ROOT / "datasets"
14
+ TOKENIZERS_DIR = ROOT / "tokenizers"
15
+
16
+ def dataset_dir_for_variant(name: str) -> str:
17
+ if name == "byte260":
18
+ return "fineweb10B_byte260"
19
+ if name.startswith("sp") and name[2:].isdigit():
20
+ return f"fineweb10B_{name}"
21
+ raise ValueError(f"unsupported variant {name!r}; expected byte260 or sp<VOCAB_SIZE>")
22
+
23
+
24
+ def local_path_for_remote(relative_path: str) -> Path:
25
+ remote_path = Path(relative_path)
26
+ if REMOTE_ROOT_PREFIX and remote_path.parts[:1] == (REMOTE_ROOT_PREFIX,):
27
+ remote_path = remote_path.relative_to(REMOTE_ROOT_PREFIX)
28
+ if remote_path.parts[:1] == ("datasets",):
29
+ return DATASETS_DIR.joinpath(*remote_path.parts[1:])
30
+ if remote_path.parts[:1] == ("tokenizers",):
31
+ return TOKENIZERS_DIR.joinpath(*remote_path.parts[1:])
32
+ return ROOT / remote_path
33
+
34
+
35
+ def get(relative_path: str) -> None:
36
+ destination = local_path_for_remote(relative_path)
37
+ if destination.exists():
38
+ return
39
+ if destination.is_symlink():
40
+ destination.unlink()
41
+
42
+ remote_path = Path(relative_path)
43
+ cached_path = Path(
44
+ hf_hub_download(
45
+ repo_id=REPO_ID,
46
+ filename=remote_path.name,
47
+ subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
48
+ repo_type="dataset",
49
+ )
50
+ )
51
+ # HF cache entries may be snapshot symlinks. Resolve to the underlying blob so we
52
+ # always materialize a real file in data/, not a broken relative symlink.
53
+ cached_source = cached_path.resolve(strict=True)
54
+ destination.parent.mkdir(parents=True, exist_ok=True)
55
+ try:
56
+ os.link(cached_source, destination)
57
+ except OSError:
58
+ shutil.copy2(cached_source, destination)
59
+
60
+
61
+ def manifest_path() -> Path:
62
+ return local_path_for_remote(f"{REMOTE_ROOT_PREFIX}/manifest.json")
63
+
64
+
65
+ def load_manifest(*, skip_manifest_download: bool) -> dict:
66
+ path = manifest_path()
67
+ if not path.is_file():
68
+ if skip_manifest_download:
69
+ raise FileNotFoundError(
70
+ f"manifest.json is required for manifest-driven shard counts but is not present locally at {path}"
71
+ )
72
+ get(f"{REMOTE_ROOT_PREFIX}/manifest.json")
73
+ return json.loads(path.read_text(encoding="utf-8"))
74
+
75
+
76
+ def artifact_paths_for_tokenizer(tokenizer_entry: dict) -> list[str]:
77
+ artifacts = []
78
+ for key in ("model_path", "vocab_path", "path"):
79
+ value = tokenizer_entry.get(key)
80
+ if value:
81
+ artifacts.append(str(value))
82
+ if not artifacts:
83
+ raise ValueError(f"tokenizer entry is missing downloadable artifacts: {tokenizer_entry}")
84
+ return artifacts
85
+
86
+
87
+ def build_parser() -> argparse.ArgumentParser:
88
+ parser = argparse.ArgumentParser(description="Download challenge FineWeb shards from Hugging Face")
89
+ parser.add_argument(
90
+ "train_shards_positional",
91
+ nargs="?",
92
+ type=int,
93
+ default=None,
94
+ help=argparse.SUPPRESS,
95
+ )
96
+ parser.add_argument(
97
+ "--train-shards",
98
+ type=int,
99
+ default=80,
100
+ help="Number of training shards to download for the selected variant. Defaults to 80.",
101
+ )
102
+ parser.add_argument(
103
+ "--variant",
104
+ default="sp1024",
105
+ help="Tokenizer family to download, for example sp1024, sp4096, or byte260.",
106
+ )
107
+ parser.add_argument(
108
+ "--skip-manifest",
109
+ action="store_true",
110
+ help="Skip downloading manifest.json.",
111
+ )
112
+ parser.add_argument(
113
+ "--with-docs",
114
+ action="store_true",
115
+ help="Also download docs_selected.jsonl and its sidecar for tokenizer retraining or dataset re-export.",
116
+ )
117
+ return parser
118
+
119
+
120
+ def main() -> None:
121
+ args = build_parser().parse_args()
122
+ dataset_dir = dataset_dir_for_variant(args.variant)
123
+ train_shards = args.train_shards_positional if args.train_shards_positional is not None else args.train_shards
124
+ if train_shards < 0:
125
+ raise ValueError("train_shards must be non-negative")
126
+
127
+ manifest = load_manifest(skip_manifest_download=args.skip_manifest)
128
+ dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir), None)
129
+ if dataset_entry is None:
130
+ raise ValueError(f"dataset {dataset_dir} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
131
+ max_train_shards = int((dataset_entry.get("stats") or {}).get("files_train"))
132
+ val_shards = int((dataset_entry.get("stats") or {}).get("files_val"))
133
+ if train_shards > max_train_shards:
134
+ raise ValueError(
135
+ f"{args.variant} only has {max_train_shards} training shards on {REPO_ID}, requested {train_shards}"
136
+ )
137
+ tokenizer_name = dataset_entry.get("tokenizer_name")
138
+ tokenizer_entry = next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
139
+ if tokenizer_entry is None:
140
+ raise ValueError(f"tokenizer {tokenizer_name} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
141
+
142
+ if args.with_docs:
143
+ get(f"{REMOTE_ROOT_PREFIX}/docs_selected.jsonl")
144
+ get(f"{REMOTE_ROOT_PREFIX}/docs_selected.source_manifest.json")
145
+
146
+ dataset_prefix = f"{REMOTE_ROOT_PREFIX}/datasets/{dataset_dir}"
147
+ for i in range(val_shards):
148
+ get(f"{dataset_prefix}/fineweb_val_{i:06d}.bin")
149
+ for i in range(train_shards):
150
+ get(f"{dataset_prefix}/fineweb_train_{i:06d}.bin")
151
+
152
+ for artifact_path in artifact_paths_for_tokenizer(tokenizer_entry):
153
+ get(f"{REMOTE_ROOT_PREFIX}/{artifact_path}")
154
+
155
+
156
+ if __name__ == "__main__":
157
+ main()
datasets/fineweb10B_sp1024/fineweb_train_000000.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:36eac147392f149f60bf3a2b4425ab6f46fcb7f53d6ea8b4c58e98c4491a1439
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000001.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7940eb87c0d448e366b6cca445d553e8daeaeaefb6e022b3910ded439f1b778d
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:761adf41d248f922e04b46cc43a609f2b0bc6883d9923f48558bc5dd7d4ad146
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000003.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a7ce43cdc285da7cfbd63f975cd874c0269ba0e19c74e456e0ed30e6f3e7e2e5
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000004.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b15f7ff91160f794d2e3dfdd09efffd3a6c04f26c4812327c2111e43b853046a
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000005.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:c9bb0347e9f8d5c9259469ebc407cc1ca8c1075aadf79d2a9daa8f46431aaa94
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000006.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d1c3a9721424020617887f941638d80346cf2926c2979c314071c4aa6481c05c
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000007.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d68f01ed5f8e7667c570aeac955c5353ce16d2124eb93e7a7acd40b5809b56c2
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000008.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:271d462dd660f30a2b68bcae1d2ea9a35cd3c39720709f00350507d2485b7c6c
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000009.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4425acbc5cd638bf0e1ff2f03b9779f11707122920686d142e0842958abd8e8b
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000010.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:211b2f4e2bb135d501cf7c6b9f706395662c0d0cfdf3d83d7b74b0989652ee20
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000011.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ee0ef8fbd56eefcb59b1b88449c8762d28c3d288115f2b2bd42f3dc16e9d3568
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000012.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0b27e0766cf04c6b58464cd21335a7b0b31c01df206cae2f80f3c1623430cea3
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000013.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5aa671ed6c262603fc8894b5d5963b80ab619969c455f233e7c998662d63d8c1
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000014.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:586de0513d9cd827eb62c0fccc8605fff16ec1cd7d367a4f75ffb6a543840526
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000015.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4efdc4fa04613f6fb37038375ff5cbe77d977b5ff563ff7af0622c1d8bc9a7ce
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000016.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3af631a00611d28df4a6c01d2df4692da3e37ba5d26b89f489406d73b7e80139
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000017.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:48a726682a40306f4045aaae18c556e8aded630537da8d8fb71ed97ff3c4d96b
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000018.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:702d813381e93797fc7a2890ba10393a65ac9408b2aa5899cd11fc691aa87703
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000019.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:025d1976497a50ad1b53c619eeb97918688c537c4dbdf69f04199db7cf2d37a3
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000020.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:815bc0a9cd239ab4b3cd0d4b8422961df68dd58836d22562a60401d82c9bfdbb
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000021.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:317ced0712f8bfee6793065566142e93b83173d0519c953622dd84fe18ba15c0
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000022.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:89e61c5d90f7ec393db1fb2cae4e7fc794f674d8fccfd83a57d5c91ed201de74
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000023.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7ae245d8d327f2ea3d2934a6c43ed059e5b1beb4774261df387d2dc2e2049ba1
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000024.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d95e81f3e2cf486c4ac4453ac5b1757c0e38aa07d7f7bb8a75fc9f9ed513afe1
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000025.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52d282c33ef857b4f44013e0fbde313e2ac62c99953093296a0fde2ce0356ce0
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000026.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6c11295fef921a8be05f5821de973e3576b5c46444d652ad585a85f058d01a36
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000027.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:394c6c38f3debde0cff7531729f4f8e2c74aeff8d4686bacc25d8e782916b22c
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000028.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6668341207e74f0faf2e21aaa1260cdd432d07e7dece65edf7e45b0d315817ea
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000029.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6d25680d8e292d9a726479b33d41d17eebad24d446cfbdfdf90cf5fa21ba6124
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000030.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f256f9678a0d3571681760cfd195182a570fc6a6a2c2cd6deddc1dfe2c6da802
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000031.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8337b056c99a4f68fb0780bb84b66ce39ae44fc6be0e4a36bcb7c6bd35ae9a8
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000032.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e070e0dfe7279f8234262390129342d9bd32e1d0ba5cab48348ac7b0813516d
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000033.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de5a681b0432486e6af24ac1a2adaab4f5add638141220eb4682d613cde54660
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000034.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8fab16685e5da01a02337dcca28c925548dc5beb6efdb949b1d2e916033dd66c
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000035.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8f4d97571734ab03852dd3aed19c911dc4d98ef42609a19cf08b7e5d44a207b8
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000036.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63165c8255763c882c1a12e300d2481d6b9fe3feb942b23f611fce65e892f954
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000037.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b1157bd76d18973d42386960e1751cba239b448870c8c214a7bb8191c76ecc93
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000038.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:18fe9529d3bd4e8087a7f16e79c238fb562b9ce4511b03d01198a67d7e2fb166
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000039.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e70b1283ebe39c70dfe714a18694608b0e76e891acfda256880148174eecb93c
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000040.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2fd857c57a0e9abc598c7348fc06376c103e29ba806323ed636be400c4c4e941
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000041.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:cdc9a4cb91e9d2103012f377c8707220216cf7aec0b478d1eb64e28161bd462f
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000042.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ea74c20ab78fa85d0895b1b41d1bdd5c877a4d31687e7888b7659490a47bb5a
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000043.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5427d86014e176e6967e58593a23ea16f90ccb8b0df52b541a914171a9c9dbe9
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000044.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:60e4c784c75d2578080fe8d663ab8eea9adb88d1de0e34682aef188b3b39ac1f
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000045.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5899af6d6d3ba8e851be876a453e54206eb573ada61589c487832ea407d80bf
3
+ size 200001024
datasets/fineweb10B_sp1024/fineweb_train_000046.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e63eab70fe288ae4f23813f27f9226cc2c70324c945e95cdbdd94380a3a1fc0f
3
+ size 200001024