Upload folder using huggingface_hub
Browse filesThis view is limited to 50 files because it contains too many changes. See raw diff
- .gitattributes +1 -0
- README.md +69 -0
- cached_challenge_fineweb.py +157 -0
- datasets/fineweb10B_sp1024/fineweb_train_000000.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000001.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000002.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000003.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000004.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000005.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000006.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000007.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000008.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000009.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000010.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000011.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000012.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000013.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000014.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000015.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000016.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000017.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000018.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000019.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000020.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000021.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000022.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000023.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000024.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000025.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000026.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000027.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000028.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000029.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000030.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000031.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000032.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000033.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000034.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000035.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000036.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000037.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000038.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000039.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000040.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000041.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000042.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000043.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000044.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000045.bin +3 -0
- datasets/fineweb10B_sp1024/fineweb_train_000046.bin +3 -0
.gitattributes
CHANGED
|
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
| 33 |
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
| 36 |
+
docs_selected.jsonl filter=lfs diff=lfs merge=lfs -text
|
README.md
ADDED
|
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Data Workflows
|
| 2 |
+
|
| 3 |
+
This directory contains the dataset download helpers and export scripts used for the challenge.
|
| 4 |
+
|
| 5 |
+
Canonical local layout:
|
| 6 |
+
- `data/datasets/<dataset_name>/`
|
| 7 |
+
- `data/tokenizers/`
|
| 8 |
+
- `data/manifest.json`
|
| 9 |
+
- `data/docs_selected.jsonl`
|
| 10 |
+
- `data/docs_selected.source_manifest.json`
|
| 11 |
+
|
| 12 |
+
## Downloading Published Data
|
| 13 |
+
|
| 14 |
+
Download the cached FineWeb export for a tokenizer variant with:
|
| 15 |
+
|
| 16 |
+
```bash
|
| 17 |
+
python3 data/cached_challenge_fineweb.py --variant sp1024
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
+
This populates `./data/datasets/fineweb10B_sp1024/` and `./data/tokenizers/`.
|
| 21 |
+
By default it downloads the full validation split and 8B training tokens (80 train shards).
|
| 22 |
+
|
| 23 |
+
To fetch more training shards, pass `--train-shards`:
|
| 24 |
+
|
| 25 |
+
```bash
|
| 26 |
+
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 180
|
| 27 |
+
```
|
| 28 |
+
|
| 29 |
+
The downloader is manifest-driven and can fetch only a prefix of train shards from a larger published export. With the current shard size of `100_000_000` tokens, `10B` retokenized training tokens is `100` train shards:
|
| 30 |
+
|
| 31 |
+
```bash
|
| 32 |
+
MATCHED_FINEWEB_REPO_ID=your-hf-username/your-dataset-repo \
|
| 33 |
+
MATCHED_FINEWEB_REMOTE_ROOT_PREFIX=your_50B_export_root \
|
| 34 |
+
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 100
|
| 35 |
+
```
|
| 36 |
+
|
| 37 |
+
Validation is always downloaded in full from the fixed `fineweb_val_*` split. Training on the first `N` train shards means training on the prefix of the same frozen shuffled export, so the data order stays aligned with the baseline for that tokenizer family.
|
| 38 |
+
|
| 39 |
+
The default published repo is `willdepueoai/parameter-golf`, with the export rooted under the repo subdirectory `datasets/`.
|
| 40 |
+
|
| 41 |
+
## Rebuilding Tokenizers From Published Docs
|
| 42 |
+
|
| 43 |
+
To retrain a tokenizer or re-export shards from exactly the same selected documents, run the standalone retokenizer against the published docs cache:
|
| 44 |
+
|
| 45 |
+
```bash
|
| 46 |
+
python3 data/download_hf_docs_and_tokenize.py \
|
| 47 |
+
--repo-id your-hf-username/your-dataset-repo \
|
| 48 |
+
--remote-root your_50B_export_root \
|
| 49 |
+
--output-root /tmp/my_custom_tokenizer_export \
|
| 50 |
+
--tokenizer-config ./data/tokenizer_specs.json \
|
| 51 |
+
--max-train-tokens 8000000000
|
| 52 |
+
```
|
| 53 |
+
|
| 54 |
+
The sidecar `docs_selected.source_manifest.json` includes `docs_sha256`, so users can verify they are rebuilding from the exact same document list and order as the baseline export.
|
| 55 |
+
|
| 56 |
+
## Useful Knobs
|
| 57 |
+
|
| 58 |
+
For CPU-heavy exports, useful knobs are:
|
| 59 |
+
|
| 60 |
+
```bash
|
| 61 |
+
MATCHED_FINEWEB_SP_BATCH_SIZE=2048
|
| 62 |
+
MATCHED_FINEWEB_TOKENIZER_THREADS=16
|
| 63 |
+
MATCHED_FINEWEB_TIKTOKEN_THREADS=16
|
| 64 |
+
MATCHED_FINEWEB_GPT2_DECODE_BATCH_SIZE=512
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
These control batched tokenizer encoding during shard export, tokenizer thread count, tiktoken thread count, and batched GPT-2 decode for the blobstore docs-cache path.
|
| 68 |
+
|
| 69 |
+
When rebuilding locally, `--max-train-tokens 8000000000` matches the published 8B-train-token export. With the default shard size of `100_000_000`, that produces 80 train shards plus the full validation split.
|
cached_challenge_fineweb.py
ADDED
|
@@ -0,0 +1,157 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import json
|
| 3 |
+
import os
|
| 4 |
+
import shutil
|
| 5 |
+
from pathlib import Path
|
| 6 |
+
|
| 7 |
+
from huggingface_hub import hf_hub_download
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
REPO_ID = os.environ.get("MATCHED_FINEWEB_REPO_ID", "willdepueoai/parameter-golf")
|
| 11 |
+
REMOTE_ROOT_PREFIX = os.environ.get("MATCHED_FINEWEB_REMOTE_ROOT_PREFIX", "datasets")
|
| 12 |
+
ROOT = Path(__file__).resolve().parent
|
| 13 |
+
DATASETS_DIR = ROOT / "datasets"
|
| 14 |
+
TOKENIZERS_DIR = ROOT / "tokenizers"
|
| 15 |
+
|
| 16 |
+
def dataset_dir_for_variant(name: str) -> str:
|
| 17 |
+
if name == "byte260":
|
| 18 |
+
return "fineweb10B_byte260"
|
| 19 |
+
if name.startswith("sp") and name[2:].isdigit():
|
| 20 |
+
return f"fineweb10B_{name}"
|
| 21 |
+
raise ValueError(f"unsupported variant {name!r}; expected byte260 or sp<VOCAB_SIZE>")
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
def local_path_for_remote(relative_path: str) -> Path:
|
| 25 |
+
remote_path = Path(relative_path)
|
| 26 |
+
if REMOTE_ROOT_PREFIX and remote_path.parts[:1] == (REMOTE_ROOT_PREFIX,):
|
| 27 |
+
remote_path = remote_path.relative_to(REMOTE_ROOT_PREFIX)
|
| 28 |
+
if remote_path.parts[:1] == ("datasets",):
|
| 29 |
+
return DATASETS_DIR.joinpath(*remote_path.parts[1:])
|
| 30 |
+
if remote_path.parts[:1] == ("tokenizers",):
|
| 31 |
+
return TOKENIZERS_DIR.joinpath(*remote_path.parts[1:])
|
| 32 |
+
return ROOT / remote_path
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
def get(relative_path: str) -> None:
|
| 36 |
+
destination = local_path_for_remote(relative_path)
|
| 37 |
+
if destination.exists():
|
| 38 |
+
return
|
| 39 |
+
if destination.is_symlink():
|
| 40 |
+
destination.unlink()
|
| 41 |
+
|
| 42 |
+
remote_path = Path(relative_path)
|
| 43 |
+
cached_path = Path(
|
| 44 |
+
hf_hub_download(
|
| 45 |
+
repo_id=REPO_ID,
|
| 46 |
+
filename=remote_path.name,
|
| 47 |
+
subfolder=remote_path.parent.as_posix() if remote_path.parent != Path(".") else None,
|
| 48 |
+
repo_type="dataset",
|
| 49 |
+
)
|
| 50 |
+
)
|
| 51 |
+
# HF cache entries may be snapshot symlinks. Resolve to the underlying blob so we
|
| 52 |
+
# always materialize a real file in data/, not a broken relative symlink.
|
| 53 |
+
cached_source = cached_path.resolve(strict=True)
|
| 54 |
+
destination.parent.mkdir(parents=True, exist_ok=True)
|
| 55 |
+
try:
|
| 56 |
+
os.link(cached_source, destination)
|
| 57 |
+
except OSError:
|
| 58 |
+
shutil.copy2(cached_source, destination)
|
| 59 |
+
|
| 60 |
+
|
| 61 |
+
def manifest_path() -> Path:
|
| 62 |
+
return local_path_for_remote(f"{REMOTE_ROOT_PREFIX}/manifest.json")
|
| 63 |
+
|
| 64 |
+
|
| 65 |
+
def load_manifest(*, skip_manifest_download: bool) -> dict:
|
| 66 |
+
path = manifest_path()
|
| 67 |
+
if not path.is_file():
|
| 68 |
+
if skip_manifest_download:
|
| 69 |
+
raise FileNotFoundError(
|
| 70 |
+
f"manifest.json is required for manifest-driven shard counts but is not present locally at {path}"
|
| 71 |
+
)
|
| 72 |
+
get(f"{REMOTE_ROOT_PREFIX}/manifest.json")
|
| 73 |
+
return json.loads(path.read_text(encoding="utf-8"))
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
def artifact_paths_for_tokenizer(tokenizer_entry: dict) -> list[str]:
|
| 77 |
+
artifacts = []
|
| 78 |
+
for key in ("model_path", "vocab_path", "path"):
|
| 79 |
+
value = tokenizer_entry.get(key)
|
| 80 |
+
if value:
|
| 81 |
+
artifacts.append(str(value))
|
| 82 |
+
if not artifacts:
|
| 83 |
+
raise ValueError(f"tokenizer entry is missing downloadable artifacts: {tokenizer_entry}")
|
| 84 |
+
return artifacts
|
| 85 |
+
|
| 86 |
+
|
| 87 |
+
def build_parser() -> argparse.ArgumentParser:
|
| 88 |
+
parser = argparse.ArgumentParser(description="Download challenge FineWeb shards from Hugging Face")
|
| 89 |
+
parser.add_argument(
|
| 90 |
+
"train_shards_positional",
|
| 91 |
+
nargs="?",
|
| 92 |
+
type=int,
|
| 93 |
+
default=None,
|
| 94 |
+
help=argparse.SUPPRESS,
|
| 95 |
+
)
|
| 96 |
+
parser.add_argument(
|
| 97 |
+
"--train-shards",
|
| 98 |
+
type=int,
|
| 99 |
+
default=80,
|
| 100 |
+
help="Number of training shards to download for the selected variant. Defaults to 80.",
|
| 101 |
+
)
|
| 102 |
+
parser.add_argument(
|
| 103 |
+
"--variant",
|
| 104 |
+
default="sp1024",
|
| 105 |
+
help="Tokenizer family to download, for example sp1024, sp4096, or byte260.",
|
| 106 |
+
)
|
| 107 |
+
parser.add_argument(
|
| 108 |
+
"--skip-manifest",
|
| 109 |
+
action="store_true",
|
| 110 |
+
help="Skip downloading manifest.json.",
|
| 111 |
+
)
|
| 112 |
+
parser.add_argument(
|
| 113 |
+
"--with-docs",
|
| 114 |
+
action="store_true",
|
| 115 |
+
help="Also download docs_selected.jsonl and its sidecar for tokenizer retraining or dataset re-export.",
|
| 116 |
+
)
|
| 117 |
+
return parser
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def main() -> None:
|
| 121 |
+
args = build_parser().parse_args()
|
| 122 |
+
dataset_dir = dataset_dir_for_variant(args.variant)
|
| 123 |
+
train_shards = args.train_shards_positional if args.train_shards_positional is not None else args.train_shards
|
| 124 |
+
if train_shards < 0:
|
| 125 |
+
raise ValueError("train_shards must be non-negative")
|
| 126 |
+
|
| 127 |
+
manifest = load_manifest(skip_manifest_download=args.skip_manifest)
|
| 128 |
+
dataset_entry = next((x for x in manifest.get("datasets", []) if x.get("name") == dataset_dir), None)
|
| 129 |
+
if dataset_entry is None:
|
| 130 |
+
raise ValueError(f"dataset {dataset_dir} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
|
| 131 |
+
max_train_shards = int((dataset_entry.get("stats") or {}).get("files_train"))
|
| 132 |
+
val_shards = int((dataset_entry.get("stats") or {}).get("files_val"))
|
| 133 |
+
if train_shards > max_train_shards:
|
| 134 |
+
raise ValueError(
|
| 135 |
+
f"{args.variant} only has {max_train_shards} training shards on {REPO_ID}, requested {train_shards}"
|
| 136 |
+
)
|
| 137 |
+
tokenizer_name = dataset_entry.get("tokenizer_name")
|
| 138 |
+
tokenizer_entry = next((x for x in manifest.get("tokenizers", []) if x.get("name") == tokenizer_name), None)
|
| 139 |
+
if tokenizer_entry is None:
|
| 140 |
+
raise ValueError(f"tokenizer {tokenizer_name} not found in {REMOTE_ROOT_PREFIX}/manifest.json")
|
| 141 |
+
|
| 142 |
+
if args.with_docs:
|
| 143 |
+
get(f"{REMOTE_ROOT_PREFIX}/docs_selected.jsonl")
|
| 144 |
+
get(f"{REMOTE_ROOT_PREFIX}/docs_selected.source_manifest.json")
|
| 145 |
+
|
| 146 |
+
dataset_prefix = f"{REMOTE_ROOT_PREFIX}/datasets/{dataset_dir}"
|
| 147 |
+
for i in range(val_shards):
|
| 148 |
+
get(f"{dataset_prefix}/fineweb_val_{i:06d}.bin")
|
| 149 |
+
for i in range(train_shards):
|
| 150 |
+
get(f"{dataset_prefix}/fineweb_train_{i:06d}.bin")
|
| 151 |
+
|
| 152 |
+
for artifact_path in artifact_paths_for_tokenizer(tokenizer_entry):
|
| 153 |
+
get(f"{REMOTE_ROOT_PREFIX}/{artifact_path}")
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
if __name__ == "__main__":
|
| 157 |
+
main()
|
datasets/fineweb10B_sp1024/fineweb_train_000000.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:36eac147392f149f60bf3a2b4425ab6f46fcb7f53d6ea8b4c58e98c4491a1439
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000001.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7940eb87c0d448e366b6cca445d553e8daeaeaefb6e022b3910ded439f1b778d
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000002.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:761adf41d248f922e04b46cc43a609f2b0bc6883d9923f48558bc5dd7d4ad146
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000003.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:a7ce43cdc285da7cfbd63f975cd874c0269ba0e19c74e456e0ed30e6f3e7e2e5
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000004.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b15f7ff91160f794d2e3dfdd09efffd3a6c04f26c4812327c2111e43b853046a
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000005.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:c9bb0347e9f8d5c9259469ebc407cc1ca8c1075aadf79d2a9daa8f46431aaa94
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000006.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d1c3a9721424020617887f941638d80346cf2926c2979c314071c4aa6481c05c
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000007.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d68f01ed5f8e7667c570aeac955c5353ce16d2124eb93e7a7acd40b5809b56c2
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000008.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:271d462dd660f30a2b68bcae1d2ea9a35cd3c39720709f00350507d2485b7c6c
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000009.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4425acbc5cd638bf0e1ff2f03b9779f11707122920686d142e0842958abd8e8b
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000010.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:211b2f4e2bb135d501cf7c6b9f706395662c0d0cfdf3d83d7b74b0989652ee20
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000011.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:ee0ef8fbd56eefcb59b1b88449c8762d28c3d288115f2b2bd42f3dc16e9d3568
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000012.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:0b27e0766cf04c6b58464cd21335a7b0b31c01df206cae2f80f3c1623430cea3
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000013.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5aa671ed6c262603fc8894b5d5963b80ab619969c455f233e7c998662d63d8c1
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000014.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:586de0513d9cd827eb62c0fccc8605fff16ec1cd7d367a4f75ffb6a543840526
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000015.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:4efdc4fa04613f6fb37038375ff5cbe77d977b5ff563ff7af0622c1d8bc9a7ce
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000016.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:3af631a00611d28df4a6c01d2df4692da3e37ba5d26b89f489406d73b7e80139
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000017.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:48a726682a40306f4045aaae18c556e8aded630537da8d8fb71ed97ff3c4d96b
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000018.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:702d813381e93797fc7a2890ba10393a65ac9408b2aa5899cd11fc691aa87703
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000019.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:025d1976497a50ad1b53c619eeb97918688c537c4dbdf69f04199db7cf2d37a3
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000020.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:815bc0a9cd239ab4b3cd0d4b8422961df68dd58836d22562a60401d82c9bfdbb
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000021.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:317ced0712f8bfee6793065566142e93b83173d0519c953622dd84fe18ba15c0
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000022.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:89e61c5d90f7ec393db1fb2cae4e7fc794f674d8fccfd83a57d5c91ed201de74
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000023.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:7ae245d8d327f2ea3d2934a6c43ed059e5b1beb4774261df387d2dc2e2049ba1
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000024.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:d95e81f3e2cf486c4ac4453ac5b1757c0e38aa07d7f7bb8a75fc9f9ed513afe1
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000025.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:52d282c33ef857b4f44013e0fbde313e2ac62c99953093296a0fde2ce0356ce0
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000026.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6c11295fef921a8be05f5821de973e3576b5c46444d652ad585a85f058d01a36
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000027.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:394c6c38f3debde0cff7531729f4f8e2c74aeff8d4686bacc25d8e782916b22c
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000028.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6668341207e74f0faf2e21aaa1260cdd432d07e7dece65edf7e45b0d315817ea
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000029.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:6d25680d8e292d9a726479b33d41d17eebad24d446cfbdfdf90cf5fa21ba6124
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000030.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f256f9678a0d3571681760cfd195182a570fc6a6a2c2cd6deddc1dfe2c6da802
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000031.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b8337b056c99a4f68fb0780bb84b66ce39ae44fc6be0e4a36bcb7c6bd35ae9a8
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000032.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:9e070e0dfe7279f8234262390129342d9bd32e1d0ba5cab48348ac7b0813516d
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000033.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:de5a681b0432486e6af24ac1a2adaab4f5add638141220eb4682d613cde54660
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000034.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8fab16685e5da01a02337dcca28c925548dc5beb6efdb949b1d2e916033dd66c
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000035.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:8f4d97571734ab03852dd3aed19c911dc4d98ef42609a19cf08b7e5d44a207b8
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000036.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:63165c8255763c882c1a12e300d2481d6b9fe3feb942b23f611fce65e892f954
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000037.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:b1157bd76d18973d42386960e1751cba239b448870c8c214a7bb8191c76ecc93
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000038.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:18fe9529d3bd4e8087a7f16e79c238fb562b9ce4511b03d01198a67d7e2fb166
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000039.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e70b1283ebe39c70dfe714a18694608b0e76e891acfda256880148174eecb93c
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000040.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:2fd857c57a0e9abc598c7348fc06376c103e29ba806323ed636be400c4c4e941
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000041.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:cdc9a4cb91e9d2103012f377c8707220216cf7aec0b478d1eb64e28161bd462f
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000042.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:1ea74c20ab78fa85d0895b1b41d1bdd5c877a4d31687e7888b7659490a47bb5a
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000043.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:5427d86014e176e6967e58593a23ea16f90ccb8b0df52b541a914171a9c9dbe9
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000044.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:60e4c784c75d2578080fe8d663ab8eea9adb88d1de0e34682aef188b3b39ac1f
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000045.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f5899af6d6d3ba8e851be876a453e54206eb573ada61589c487832ea407d80bf
|
| 3 |
+
size 200001024
|
datasets/fineweb10B_sp1024/fineweb_train_000046.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:e63eab70fe288ae4f23813f27f9226cc2c70324c945e95cdbdd94380a3a1fc0f
|
| 3 |
+
size 200001024
|