Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,40 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Model Info:
|
| 2 |
+
- Size: 7B
|
| 3 |
+
- Dataset: The Pile v2
|
| 4 |
+
- `contaminated(P3) + lower_code(5%) + wiki(fixed) + books3(fixed & broken)`
|
| 5 |
+
- Batch size (in tokens): 8M
|
| 6 |
+
- Checkpoint path (AWS East): `/fsx/ckpts/7b_tok=neox_data=pilev2-recontam_lower-code_bs=8m_tp=4_pp=1_init=wang-small-init/global_step69000_hf`
|
| 7 |
+
|
| 8 |
+
Notes:
|
| 9 |
+
- Trained for 36k steps with incorrectly tokenized Books3 dataset (GPT-2 tokenizer instead of NeoX tokenizer)
|
| 10 |
+
- tp=2 (not 4)
|
| 11 |
+
|
| 12 |
+
W&B Report: https://stability.wandb.io/stability-llm/stable-lm/reports/StableLM-7B-alpha---Vmlldzo2MjA
|
| 13 |
+
|
| 14 |
+
Usage:
|
| 15 |
+
|
| 16 |
+
```python
|
| 17 |
+
import transformers
|
| 18 |
+
|
| 19 |
+
model = transformers.AutoModelForCausalLM.from_pretrained("CarperAI/7b-alpha")
|
| 20 |
+
tokenizer = transformers.AutoTokenizer.from_pretrained("CarperAI/7b-alpha")
|
| 21 |
+
tokenizer.pad_token = tokenizer.eos_token
|
| 22 |
+
tokenizer.paddding_side = "left"
|
| 23 |
+
|
| 24 |
+
prompts = [
|
| 25 |
+
"User1: The dog sat on a man's lap and barked 3 times.\nUser2: How many times did the dog bark?"
|
| 26 |
+
"Curious Person Question: A group of genetically identical individuals is called what?\nSmart Person Answer: a clone\n\nCurious Person Question: Who proposed the theory of evolution by natural selection?\nSmart Person Answer:"
|
| 27 |
+
]
|
| 28 |
+
batch_encoding = tokenizer(prompts, return_tensors="pt", padding=True)
|
| 29 |
+
|
| 30 |
+
print(f"Generating {len(prompts)} prompts...")
|
| 31 |
+
samples = model.generate(
|
| 32 |
+
**batch_encoding,
|
| 33 |
+
max_new_tokens=64,
|
| 34 |
+
temperature=0.0,
|
| 35 |
+
do_sample=False,
|
| 36 |
+
)
|
| 37 |
+
samples = tokenizer.batch_decode(samples, skip_special_tokens=True)
|
| 38 |
+
for i, sample in enumerate(samples):
|
| 39 |
+
print(f"Prompt: {prompts[i]}\nSample: {sample}\n")
|
| 40 |
+
```
|