sproos commited on Mar 19

Commit

c5f9e16

verified ·

1 Parent(s): 5cb9486

Upload folder using huggingface_hub

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

.gitattributes +1 -0
README.md +69 -0
cached_challenge_fineweb.py +157 -0
datasets/fineweb10B_sp1024/fineweb_train_000000.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000001.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000002.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000003.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000004.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000005.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000006.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000007.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000008.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000009.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000010.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000011.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000012.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000013.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000014.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000015.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000016.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000017.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000018.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000019.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000020.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000021.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000022.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000023.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000024.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000025.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000026.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000027.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000028.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000029.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000030.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000031.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000032.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000033.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000034.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000035.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000036.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000037.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000038.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000039.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000040.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000041.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000042.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000043.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000044.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000045.bin +3 -0
datasets/fineweb10B_sp1024/fineweb_train_000046.bin +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+docs_selected.jsonl filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+# Data Workflows
+This directory contains the dataset download helpers and export scripts used for the challenge.
+Canonical local layout:
+- `data/datasets/<dataset_name>/`
+- `data/tokenizers/`
+- `data/manifest.json`
+- `data/docs_selected.jsonl`
+- `data/docs_selected.source_manifest.json`
+## Downloading Published Data
+Download the cached FineWeb export for a tokenizer variant with:
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024
+```
+This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
+By default it downloads the full validation split and 8B training tokens (80 train shards).
+To fetch more training shards, pass `--train-shards`:
+```bash
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
+```
+The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards:
+```bash
+MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
+MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
+python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
+```
+Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.
+The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`.
+## Rebuilding Tokenizers From Published Docs
+To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:
+```bash
+python3 data/download_hf_docs_and_tokenize.py \
+  --repo-id your-hf-username/your-dataset-repo \
+  --remote-root your_50B_export_root \
+  --output-root /tmp/my_custom_tokenizer_export \
+  --tokenizer-config ./data/tokenizer_specs.json \
+  --max-train-tokens 8000000000
+```
+The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export.
+## Useful Knobs
+For CPU-heavy exports, useful knobs are:
+```bash
+MATCHED_FINEWEB_SP_BATCH_SIZE=2048
+MATCHED_FINEWEB_TOKENIZER_THREADS=16
+MATCHED_FINEWEB_TIKTOKEN_THREADS=16
+MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
+```
+These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.
+When rebuilding locally, `--max-train-tokens 8000000000` matches the published 8B-train-token export. With the default shard size of `100_000_000`, that produces 80 train shards plus the full validation split.

cached_challenge_fineweb.py ADDED Viewed

	@@ -0,0 +1,157 @@

+import argparse
+import json
+import os
+import shutil
+from pathlib import Path
+from huggingface_hub import hf_hub_download
+REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
+REMOTE_ROOT_PREFIX = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
+ROOT = Path(__file__).resolve().parent
+DATASETS_DIR = ROOT / "datasets"
+TOKENIZERS_DIR = ROOT / "tokenizers"
+def dataset_dir_for_variant(name: str) -> str:
+    if name == "byte260":
+        return "fineweb10B_byte260"
+    if name.startswith("sp") and name[2:].isdigit():
+        return f"fineweb10B_{name}"
+    raise ValueError(f"unsupported variant {name!r}; expected byte260 or sp<VOCAB_SIZE>")
+def local_path_for_remote(relative_path: str) -> Path:
+    remote_path = Path(relative_path)
+    if REMOTE_ROOT_PREFIX and remote_path.parts[:1] == (REMOTE_ROOT_PREFIX,):
+        remote_path = remote_path.relative_to(REMOTE_ROOT_PREFIX)
+    if remote_path.parts[:1] == ("datasets",):
+        return DATASETS_DIR.joinpath(*remote_path.parts[1:])
+    if remote_path.parts[:1] == ("tokenizers",):
+        return TOKENIZERS_DIR.joinpath(*remote_path.parts[1:])
+    return ROOT / remote_path
+def get(relative_path: str) -> None:
+    destination = local_path_for_remote(relative_path)
+    if destination.exists():
+        return
+    if destination.is_symlink():
+        destination.unlink()
+    remote_path = Path(relative_path)
+    cached_path = Path(
+        hf_hub_download(
+            repo_id=REPO_ID,
+            filename=remote_path.name,
+            subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
+            repo_type="dataset",
+        )
+    )
+    # HF cache entries may be snapshot symlinks. Resolve to the underlying blob so we
+    # always materialize a real file in data/, not a broken relative symlink.
+    cached_source = cached_path.resolve(strict=True)
+    destination.parent.mkdir(parents=True, exist_ok=True)
+    try:
+        os.link(cached_source, destination)
+    except OSError:
+        shutil.copy2(cached_source, destination)
+def manifest_path() -> Path:
+    return local_path_for_remote(f"{REMOTE_ROOT_PREFIX}/manifest.json")
+def load_manifest(*, skip_manifest_download: bool) -> dict:
+    path = manifest_path()
+    if not path.is_file():
+        if skip_manifest_download:
+            raise FileNotFoundError(
+                f"manifest.json is required for manifest-driven shard counts but is not present locally at {path}"
+            )
+        get(f"{REMOTE_ROOT_PREFIX}/manifest.json")
+    return json.loads(path.read_text(encoding="utf-8"))
+def artifact_paths_for_tokenizer(tokenizer_entry: dict) -> list[str]:
+    artifacts = []
+    for key in ("model_path", "vocab_path", "path"):
+        value = tokenizer_entry.get(key)
+        if value:
+            artifacts.append(str(value))
+    if not artifacts:
+        raise ValueError(f"tokenizer entry is missing downloadable artifacts: {tokenizer_entry}")
+    return artifacts
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Download challenge FineWeb shards from Hugging Face")
+    parser.add_argument(
+        "train_shards_positional",
+        nargs="?",
+        type=int,
+        default=None,
+        help=argparse.SUPPRESS,
+    )
+    parser.add_argument(
+        "--train-shards",
+        type=int,
+        default=80,
+        help="Number of training shards to download for the selected variant. Defaults to 80.",
+    )
+    parser.add_argument(
+        "--variant",
+        default="sp1024",
+        help="Tokenizer family to download, for example sp1024, sp4096, or byte260.",
+    )
+    parser.add_argument(
+        "--skip-manifest",
+        action="store_true",
+        help="Skip downloading manifest.json.",
+    )
+    parser.add_argument(
+        "--with-docs",
+        action="store_true",
+        help="Also download docs_selected.jsonl and its sidecar for tokenizer retraining or dataset re-export.",
+    )
+    return parser
+def main() -> None:
+    args = build_parser().parse_args()
+    dataset_dir = dataset_dir_for_variant(args.variant)
+    train_shards = args.train_shards_positional if args.train_shards_positional is not None else args.train_shards
+    if train_shards < 0:
+        raise ValueError("train_shards must be non-negative")
+    manifest = load_manifest(skip_manifest_download=args.skip_manifest)
+    dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir), None)
+    if dataset_entry is None:
+        raise ValueError(f"dataset {dataset_dir} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
+    max_train_shards = int((dataset_entry.get("stats") or {}).get("files_train"))
+    val_shards = int((dataset_entry.get("stats") or {}).get("files_val"))
+    if train_shards > max_train_shards:
+        raise ValueError(
+            f"{args.variant} only has {max_train_shards} training shards on {REPO_ID}, requested {train_shards}"
+        )
+    tokenizer_name = dataset_entry.get("tokenizer_name")
+    tokenizer_entry = next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
+    if tokenizer_entry is None:
+        raise ValueError(f"tokenizer {tokenizer_name} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
+    if args.with_docs:
+        get(f"{REMOTE_ROOT_PREFIX}/docs_selected.jsonl")
+        get(f"{REMOTE_ROOT_PREFIX}/docs_selected.source_manifest.json")
+    dataset_prefix = f"{REMOTE_ROOT_PREFIX}/datasets/{dataset_dir}"
+    for i in range(val_shards):
+        get(f"{dataset_prefix}/fineweb_val_{i:06d}.bin")
+    for i in range(train_shards):
+        get(f"{dataset_prefix}/fineweb_train_{i:06d}.bin")
+    for artifact_path in artifact_paths_for_tokenizer(tokenizer_entry):
+        get(f"{REMOTE_ROOT_PREFIX}/{artifact_path}")
+if __name__ == "__main__":
+    main()

datasets/fineweb10B_sp1024/fineweb_train_000000.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:36eac147392f149f60bf3a2b4425ab6f46fcb7f53d6ea8b4c58e98c4491a1439
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000001.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7940eb87c0d448e366b6cca445d553e8daeaeaefb6e022b3910ded439f1b778d
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000002.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:761adf41d248f922e04b46cc43a609f2b0bc6883d9923f48558bc5dd7d4ad146
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000003.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a7ce43cdc285da7cfbd63f975cd874c0269ba0e19c74e456e0ed30e6f3e7e2e5
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000004.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b15f7ff91160f794d2e3dfdd09efffd3a6c04f26c4812327c2111e43b853046a
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000005.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c9bb0347e9f8d5c9259469ebc407cc1ca8c1075aadf79d2a9daa8f46431aaa94
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000006.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d1c3a9721424020617887f941638d80346cf2926c2979c314071c4aa6481c05c
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000007.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d68f01ed5f8e7667c570aeac955c5353ce16d2124eb93e7a7acd40b5809b56c2
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000008.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:271d462dd660f30a2b68bcae1d2ea9a35cd3c39720709f00350507d2485b7c6c
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000009.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4425acbc5cd638bf0e1ff2f03b9779f11707122920686d142e0842958abd8e8b
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000010.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:211b2f4e2bb135d501cf7c6b9f706395662c0d0cfdf3d83d7b74b0989652ee20
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000011.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee0ef8fbd56eefcb59b1b88449c8762d28c3d288115f2b2bd42f3dc16e9d3568
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000012.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0b27e0766cf04c6b58464cd21335a7b0b31c01df206cae2f80f3c1623430cea3
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000013.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5aa671ed6c262603fc8894b5d5963b80ab619969c455f233e7c998662d63d8c1
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000014.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:586de0513d9cd827eb62c0fccc8605fff16ec1cd7d367a4f75ffb6a543840526
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000015.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4efdc4fa04613f6fb37038375ff5cbe77d977b5ff563ff7af0622c1d8bc9a7ce
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000016.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:3af631a00611d28df4a6c01d2df4692da3e37ba5d26b89f489406d73b7e80139
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000017.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:48a726682a40306f4045aaae18c556e8aded630537da8d8fb71ed97ff3c4d96b
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000018.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:702d813381e93797fc7a2890ba10393a65ac9408b2aa5899cd11fc691aa87703
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000019.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:025d1976497a50ad1b53c619eeb97918688c537c4dbdf69f04199db7cf2d37a3
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000020.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:815bc0a9cd239ab4b3cd0d4b8422961df68dd58836d22562a60401d82c9bfdbb
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000021.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:317ced0712f8bfee6793065566142e93b83173d0519c953622dd84fe18ba15c0
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000022.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:89e61c5d90f7ec393db1fb2cae4e7fc794f674d8fccfd83a57d5c91ed201de74
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000023.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ae245d8d327f2ea3d2934a6c43ed059e5b1beb4774261df387d2dc2e2049ba1
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000024.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d95e81f3e2cf486c4ac4453ac5b1757c0e38aa07d7f7bb8a75fc9f9ed513afe1
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000025.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:52d282c33ef857b4f44013e0fbde313e2ac62c99953093296a0fde2ce0356ce0
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000026.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6c11295fef921a8be05f5821de973e3576b5c46444d652ad585a85f058d01a36
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000027.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:394c6c38f3debde0cff7531729f4f8e2c74aeff8d4686bacc25d8e782916b22c
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000028.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6668341207e74f0faf2e21aaa1260cdd432d07e7dece65edf7e45b0d315817ea
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000029.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:6d25680d8e292d9a726479b33d41d17eebad24d446cfbdfdf90cf5fa21ba6124
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000030.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f256f9678a0d3571681760cfd195182a570fc6a6a2c2cd6deddc1dfe2c6da802
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000031.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b8337b056c99a4f68fb0780bb84b66ce39ae44fc6be0e4a36bcb7c6bd35ae9a8
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000032.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:9e070e0dfe7279f8234262390129342d9bd32e1d0ba5cab48348ac7b0813516d
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000033.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:de5a681b0432486e6af24ac1a2adaab4f5add638141220eb4682d613cde54660
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000034.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8fab16685e5da01a02337dcca28c925548dc5beb6efdb949b1d2e916033dd66c
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000035.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:8f4d97571734ab03852dd3aed19c911dc4d98ef42609a19cf08b7e5d44a207b8
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000036.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:63165c8255763c882c1a12e300d2481d6b9fe3feb942b23f611fce65e892f954
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000037.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b1157bd76d18973d42386960e1751cba239b448870c8c214a7bb8191c76ecc93
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000038.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18fe9529d3bd4e8087a7f16e79c238fb562b9ce4511b03d01198a67d7e2fb166
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000039.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e70b1283ebe39c70dfe714a18694608b0e76e891acfda256880148174eecb93c
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000040.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2fd857c57a0e9abc598c7348fc06376c103e29ba806323ed636be400c4c4e941
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000041.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:cdc9a4cb91e9d2103012f377c8707220216cf7aec0b478d1eb64e28161bd462f
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000042.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1ea74c20ab78fa85d0895b1b41d1bdd5c877a4d31687e7888b7659490a47bb5a
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000043.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5427d86014e176e6967e58593a23ea16f90ccb8b0df52b541a914171a9c9dbe9
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000044.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:60e4c784c75d2578080fe8d663ab8eea9adb88d1de0e34682aef188b3b39ac1f
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000045.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f5899af6d6d3ba8e851be876a453e54206eb573ada61589c487832ea407d80bf
+size 200001024

datasets/fineweb10B_sp1024/fineweb_train_000046.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:e63eab70fe288ae4f23813f27f9226cc2c70324c945e95cdbdd94380a3a1fc0f
+size 200001024