Efe2898 commited on
Commit
1966da3
·
verified ·
1 Parent(s): e5c4555

Add RSLM tokenizer trained on AkademikDerlem/makaleler

Browse files
README.md CHANGED
@@ -1,49 +1,41 @@
1
  ---
2
- library_name: transformers
3
- tags:
4
- - causal-lm
5
- - turkish
6
- - rslm
7
- - mqa
8
- - long-context
9
- - custom-code
10
- license: apache-2.0
11
  ---
12
 
13
- # RSLM-1B-Speed
14
 
15
- Speed-first 990M civarı decoder-only SLM mimarisi.
16
 
17
- ## Mimari
18
 
19
- - hidden_size: 2048
20
- - layers: 24
21
- - Q heads: 16
22
- - KV heads: 1 / MQA
23
- - head_dim: 128
24
- - intermediate_size: 4352
25
- - vocab_size: 65536
26
- - context target: 262144
27
- - original training context target: 8192
28
- - block: parallel attention + MLP
29
- - norm: pre-RMSNorm
30
- - activation: SwiGLU
31
- - local layers: window 4096
32
- - global layers 0-indexed: 5, 11, 17, 23
33
 
34
- ## Notlar
35
 
36
- Bu repo şu an tokenizer içermeyebilir. Tokenizer sonraki aşamada eklenecek.
 
 
 
 
37
 
38
- Bu checkpoint random init olabilir. Eğitimli model değildir.
39
 
40
- ## Loading
41
-
42
- ```python
43
- from transformers import AutoModelForCausalLM, AutoConfig
44
-
45
- config = AutoConfig.from_pretrained("Efe2898/new-model", trust_remote_code=True)
46
- model = AutoModelForCausalLM.from_pretrained("Efe2898/new-model", trust_remote_code=True)
 
 
 
 
 
47
  ```
48
-
49
- > Kaggle ortamındaki transformers/huggingface_hub uyumsuzluğu yüzünden bu repo, notebook içinde transformers import etmeden oluşturulmuştur.
 
1
  ---
2
+ library_name: tokenizers
3
+ language:
4
+ - tr
5
+ license: cc-by-sa-4.0
 
 
 
 
 
6
  ---
7
 
8
+ # RSLM Tokenizer
9
 
10
+ Byte-Level BPE tokenizer trained for RSLM.
11
 
12
+ ## Training source
13
 
14
+ - Dataset: `turkish-nlp-suite/AkademikDerlem`
15
+ - Subset/config: `makaleler`
16
+ - Split: `train`
17
+ - Column: `text`
 
 
 
 
 
 
 
 
 
 
18
 
19
+ ## Settings
20
 
21
+ - Vocab size: `65536`
22
+ - Model max length: `262144`
23
+ - Target estimated tokens: `500,000,000`
24
+ - Seen chars: `2,000,013,236`
25
+ - Estimated tokens seen: `500,003,309`
26
 
27
+ ## Special tokens
28
 
29
+ ```text
30
+ <|pad|>
31
+ <|bos|>
32
+ <|eos|>
33
+ <|unk|>
34
+ <|system|>
35
+ <|user|>
36
+ <|assistant|>
37
+ <|answer|>
38
+ <|end|>
39
+ <think>
40
+ </think>
41
  ```
 
 
rslm-byte-bpe-merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
rslm-byte-bpe-vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<|bos|>",
3
+ "eos_token": "<|eos|>",
4
+ "unk_token": "<|unk|>",
5
+ "pad_token": "<|pad|>",
6
+ "additional_special_tokens": [
7
+ "<|system|>",
8
+ "<|user|>",
9
+ "<|assistant|>",
10
+ "<|answer|>",
11
+ "<|end|>",
12
+ "<think>",
13
+ "</think>"
14
+ ]
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_max_length": 262144,
3
+ "tokenizer_class": "PreTrainedTokenizerFast",
4
+ "clean_up_tokenization_spaces": false,
5
+ "padding_side": "right",
6
+ "truncation_side": "right",
7
+ "bos_token": "<|bos|>",
8
+ "eos_token": "<|eos|>",
9
+ "unk_token": "<|unk|>",
10
+ "pad_token": "<|pad|>",
11
+ "additional_special_tokens": [
12
+ "<|system|>",
13
+ "<|user|>",
14
+ "<|assistant|>",
15
+ "<|answer|>",
16
+ "<|end|>",
17
+ "<think>",
18
+ "</think>"
19
+ ],
20
+ "chat_template": "{% for message in messages %}{% if message['role'] == 'system' %}<|system|>\n{{ message['content'] }}<|end|>\n{% elif message['role'] == 'user' %}<|user|>\n{{ message['content'] }}<|end|>\n{% elif message['role'] == 'assistant' %}<|assistant|>\n{{ message['content'] }}<|end|>\n{% endif %}{% endfor %}{% if add_generation_prompt %}<|assistant|>\n{% endif %}"
21
+ }
tokenizer_stats.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "dataset_id": "turkish-nlp-suite/AkademikDerlem",
3
+ "config_name": "makaleler",
4
+ "split": "train",
5
+ "text_column": "text",
6
+ "target_est_tokens": 500000000,
7
+ "chars_per_token_est": 4.0,
8
+ "target_chars": 2000000000,
9
+ "seen_rows": 98246,
10
+ "used_rows": 98238,
11
+ "seen_chars": 2000013236,
12
+ "skipped_rows": 8,
13
+ "started_at": "2026-05-07T21:31:05.276537Z",
14
+ "ended_at": "2026-05-07T21:40:07.755856Z",
15
+ "seconds": 610.76,
16
+ "estimated_tokens_seen": 500003309,
17
+ "final_vocab_size": 65536
18
+ }