pankajmathur commited on
Commit
de2c461
·
verified ·
1 Parent(s): 602b5ab

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -6
README.md CHANGED
@@ -2,15 +2,15 @@
2
  license: mit
3
  ---
4
 
5
- nanochat-d34 model. It was pretrained like this:
 
 
6
 
7
  ```bash
8
  torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=34 --device_batch_size=4 --target_param_data_ratio=40 --save_every=5000 --run=d34
9
  ```
10
 
11
- On an 8XH100 node, which ran for \~100 hours (\~4 days) and cost \~$2,500. Notably, note that it is 2X longtrained compared to Chinchilla, i.e. the param:token ratio is overridden from 20 up to 40. This means the model was trained longer that is compute optimal, just to squeeze a bit more capability into a bit smaller package.
12
-
13
- Some of the notable stats of the model are as follows:
14
 
15
  ```
16
  - depth: 34
@@ -37,6 +37,4 @@ This upload allows you to skip base model pretraining and focus on finetuning, w
37
 
38
  - the `token_bytes.pt`, `tokenizer.pkl` have to go into ~/.cache/nanochat/tokenizer directory
39
  - the `meta_169150.json` and `model_169150.pt` have to go into ~/.cache/nanochat/chatsft_checkpoints/d34/
40
- -
41
- I'll figure out how to make this less janky in the future, and to make nanochat play nicer with huggingface infra.
42
 
 
2
  license: mit
3
  ---
4
 
5
+ Origial nanochat-d34 model from https://github.com/karpathy/nanochat .
6
+
7
+ It was pretrained like this:
8
 
9
  ```bash
10
  torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=34 --device_batch_size=4 --target_param_data_ratio=40 --save_every=5000 --run=d34
11
  ```
12
 
13
+ Some stats of the model are as follows:
 
 
14
 
15
  ```
16
  - depth: 34
 
37
 
38
  - the `token_bytes.pt`, `tokenizer.pkl` have to go into ~/.cache/nanochat/tokenizer directory
39
  - the `meta_169150.json` and `model_169150.pt` have to go into ~/.cache/nanochat/chatsft_checkpoints/d34/
 
 
40