frozbite commited on
Commit
68e68e4
·
verified ·
1 Parent(s): 404fc49

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +83 -3
README.md CHANGED
@@ -1,3 +1,83 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-generation-inference
8
+ - maxtext
9
+ - base
10
+ - bexamask
11
+ - pile
12
+ ---
13
+ # 🚀 BexaMask-v2 (≈800M Parameters)
14
+
15
+ **BexaMask-v2** is a **pretrained base (foundation) decoder-only Transformer model** trained on large-scale **permissively licensed and uncopyrighted text data** using the MaxText framework on TPU v4-16.
16
+
17
+ > ⚠️ This is a **base model** — it is **not instruction-tuned** and may not follow prompts like ChatGPT without further fine-tuning.
18
+
19
+ ---
20
+
21
+ ## 🧠 Model Overview
22
+
23
+ - **Type:** Pretrained Base Model (Foundation Model)
24
+ - **Architecture:** Decoder-only Transformer
25
+ - **Parameters:** ~800M
26
+ - **Layers:** 16
27
+ - **Embedding Dimension:** 2048
28
+ - **MLP Dimension:** 5120
29
+ - **Attention Heads:**
30
+ - Query Heads: 16
31
+ - KV Heads: 4 (Grouped Query Attention)
32
+ - **Head Dimension:** 128
33
+ - **Activation:** SiLU + Linear
34
+ - **Max Context Length:** 4096 tokens
35
+ - **Vocabulary Size:** 32,000 (SentencePiece)
36
+
37
+ ---
38
+
39
+ ## ⚙️ Training Details
40
+
41
+ - **Framework:** MaxText
42
+ - **Hardware:** TPU v4-16 (8 chips, 256GB HBM)
43
+
44
+ ### 📦 Dataset
45
+ - Subset of **The Pile (uncopyrighted / permissive sources only)**
46
+ - Filtered to remove restricted or copyrighted data
47
+
48
+ ### 🔧 Training Config
49
+ - **Steps:** 100,000
50
+ - **Epochs:** 2
51
+ - **Batch Size:** 16 per device
52
+ - **Learning Rate:** 3e-4
53
+ - **Warmup Steps:** 2,000
54
+ - **Scheduler:** Cosine decay
55
+
56
+ ---
57
+
58
+ ## ⚡ Optimization Techniques
59
+
60
+ - Flash Attention
61
+ - Full Rematerialization (Remat)
62
+ - Asynchronous Checkpointing
63
+ - Distributed GCS checkpointing
64
+ - IOTA embeddings
65
+
66
+ ---
67
+
68
+ ## 🧪 Inference
69
+
70
+ Run inference using MaxText:
71
+
72
+ ```bash
73
+ python3 -m MaxText.decode \
74
+ maxtext/configs/pretrain.yml \
75
+ run_name=inference \
76
+ load_parameters_path=/home/pynatic079/bexamask_v2_inference_local/items \
77
+ tokenizer_path=/path/to/llama/tokenizer.model \
78
+ max_target_length=512 \
79
+ 'prompt="<Your prompt>"' \
80
+ decode_sampling_strategy="topk" \
81
+ decode_sampling_top_k=4 \
82
+ decode_sampling_temperature=1.9 \
83
+ attention=dot_product