somebody-to-love Claude Opus 4.6 (1M context) commited on
Commit
e95edec
·
1 Parent(s): 8297984

Update usage docs: replace AutoModel with working safetensors inference

Browse files

- Replace non-functional AutoModelForCausalLM example with direct safetensors loading
- Add GGUF/Ollama incompatibility notice (Mamba-2 hybrid architecture)
- Add evafrill_runner.py alternative method with frankenstallm_test link
- Add prerequisites (source clone, pip install) and system requirements
- Update both Korean and English sections
- Fix slerp/README.md usage code to match actual loading pattern

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (2) hide show
  1. README.md +130 -38
  2. slerp/README.md +53 -5
README.md CHANGED
@@ -179,37 +179,83 @@ ORPO의 약점: SFT 65K 스텝 대비 10K 스텝만 학습되어 기반 명령
179
 
180
  ### 사용법
181
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
182
  ```python
 
183
  import torch
 
184
  from model.transformer import LLM
185
  from tokenizers import Tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
186
 
187
- # 커스텀 아키텍처이므로 저장소 클론 후 사용
188
- # git clone https://github.com/pathcosmos/EVAFRILL-Mo
189
 
190
- device = "cuda" if torch.cuda.is_available() else "cpu"
 
 
191
 
192
- model = LLM.from_pretrained("hf_export/slerp")
193
- model = model.to(device=device, dtype=torch.bfloat16)
194
- model.eval()
 
 
 
 
 
 
 
 
 
 
 
195
 
196
- tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
197
 
198
- prompt = "인공지능이란 무엇인가요?"
199
- ids = tok.encode(prompt).ids
200
- input_ids = torch.tensor([ids], device=device)
201
 
202
- with torch.no_grad():
203
- output = model.generate(
204
- input_ids,
205
- max_new_tokens=256,
206
- temperature=0.7,
207
- repetition_penalty=1.2,
208
- )
209
-
210
- print(tok.decode(output[0].tolist()))
211
  ```
212
 
 
 
 
 
213
  ### 재현 자료
214
 
215
  | 경로 | 내용 |
@@ -356,37 +402,83 @@ ORPO's weakness: only 10K steps of training vs SFT's 65K — insufficient base i
356
 
357
  ### Usage
358
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
359
  ```python
 
360
  import torch
 
361
  from model.transformer import LLM
362
  from tokenizers import Tokenizer
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
363
 
364
- # Requires cloning the repository (custom architecture — not loadable via AutoModel)
365
- # git clone https://github.com/pathcosmos/EVAFRILL-Mo
366
 
367
- device = "cuda" if torch.cuda.is_available() else "cpu"
 
 
368
 
369
- model = LLM.from_pretrained("hf_export/slerp")
370
- model = model.to(device=device, dtype=torch.bfloat16)
371
- model.eval()
 
 
 
 
 
 
 
 
 
 
 
372
 
373
- tok = Tokenizer.from_file("tokenizer/korean_sp/tokenizer.json")
374
 
375
- prompt = "What is artificial intelligence?"
376
- ids = tok.encode(prompt).ids
377
- input_ids = torch.tensor([ids], device=device)
378
 
379
- with torch.no_grad():
380
- output = model.generate(
381
- input_ids,
382
- max_new_tokens=256,
383
- temperature=0.7,
384
- repetition_penalty=1.2,
385
- )
386
-
387
- print(tok.decode(output[0].tolist()))
388
  ```
389
 
 
 
 
 
390
  ### Reproducibility
391
 
392
  | Path | Contents |
 
179
 
180
  ### 사용법
181
 
182
+ > **GGUF/Ollama 미지원**: 커스텀 Mamba-2 하이브리드 아키텍처로 llama.cpp/GGUF/Ollama와 호환되지 않습니다. PyTorch 직접 추론만 가능합니다.
183
+
184
+ **사전 준비:**
185
+
186
+ ```bash
187
+ # 1. 소스 코드 클론 (커스텀 아키텍처 모듈 필요)
188
+ git clone https://github.com/pathcosmos/EVAFRILL-Mo
189
+ cd EVAFRILL-Mo
190
+
191
+ # 2. 의존성 설치
192
+ pip install torch safetensors tokenizers PyYAML
193
+ ```
194
+
195
+ **방법 1: safetensors 직접 로딩 (권장)**
196
+
197
  ```python
198
+ import json
199
  import torch
200
+ from model.config import LMConfig
201
  from model.transformer import LLM
202
  from tokenizers import Tokenizer
203
+ from safetensors.torch import load_file as load_safetensors
204
+
205
+ CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # 이 저장소의 slerp/ 디렉토리
206
+
207
+ # Config & 모델 로드
208
+ with open(f"{CKPT}/config.json") as f:
209
+ data = json.load(f)
210
+ for k in ("model_type", "architectures", "_variant", "_description"):
211
+ data.pop(k, None)
212
+ cfg = LMConfig(**data)
213
+ cfg.use_flash_attn = False
214
+
215
+ model = LLM(cfg)
216
+ state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
217
+ model.load_state_dict(state, strict=False)
218
+ model = model.to(device="cuda:0", dtype=torch.bfloat16)
219
+ model.eval()
220
 
221
+ tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
 
222
 
223
+ # 생성 (권장: temp=0.7, rep_penalty=1.2)
224
+ prompt = "<|user|>\n인공지능이란 무엇인가요?\n<|assistant|>\n"
225
+ ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
226
 
227
+ with torch.no_grad():
228
+ for _ in range(256):
229
+ logits, _ = model(ids)
230
+ logits = logits[:, -1, :].float()
231
+ for prev_id in set(ids[0].tolist()):
232
+ if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
233
+ else: logits[0, prev_id] *= 1.2
234
+ probs = torch.softmax(logits / 0.7, dim=-1)
235
+ next_id = torch.multinomial(probs, 1)
236
+ ids = torch.cat([ids, next_id], dim=1)
237
+ if next_id.item() == tok.token_to_id("</s>"): break
238
+
239
+ print(tok.decode(ids[0].tolist()))
240
+ ```
241
 
242
+ **방법 2: 평가 프레임워크 러너 사용**
243
 
244
+ [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test)의 `evafrill_runner.py`가 과정을 래핑합니다:
 
 
245
 
246
+ ```python
247
+ from eval_framework.evafrill_runner import generate, unload_model
248
+
249
+ result = generate("한국어로 인사해주세요.")
250
+ print(result["response"])
251
+ print(f"속도: {result['tokens_per_sec']:.1f} TPS")
252
+ unload_model()
 
 
253
  ```
254
 
255
+ > 설정 방법: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론) 참조
256
+
257
+ **시스템 요구사항**: GPU VRAM 8GB+ (BF16), CPU 추론 가능하지만 극히 느림 (~0.5 TPS)
258
+
259
  ### 재현 자료
260
 
261
  | 경로 | 내용 |
 
402
 
403
  ### Usage
404
 
405
+ > **GGUF/Ollama not supported**: Custom Mamba-2 hybrid architecture is incompatible with llama.cpp/GGUF/Ollama. PyTorch direct inference only.
406
+
407
+ **Prerequisites:**
408
+
409
+ ```bash
410
+ # 1. Clone source code (custom architecture modules required)
411
+ git clone https://github.com/pathcosmos/EVAFRILL-Mo
412
+ cd EVAFRILL-Mo
413
+
414
+ # 2. Install dependencies
415
+ pip install torch safetensors tokenizers PyYAML
416
+ ```
417
+
418
+ **Method 1: Direct safetensors loading (recommended)**
419
+
420
  ```python
421
+ import json
422
  import torch
423
+ from model.config import LMConfig
424
  from model.transformer import LLM
425
  from tokenizers import Tokenizer
426
+ from safetensors.torch import load_file as load_safetensors
427
+
428
+ CKPT = "path/to/EVAFRILL-Mo-3B/slerp" # slerp/ directory of this repo
429
+
430
+ # Load config & model
431
+ with open(f"{CKPT}/config.json") as f:
432
+ data = json.load(f)
433
+ for k in ("model_type", "architectures", "_variant", "_description"):
434
+ data.pop(k, None)
435
+ cfg = LMConfig(**data)
436
+ cfg.use_flash_attn = False
437
+
438
+ model = LLM(cfg)
439
+ state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
440
+ model.load_state_dict(state, strict=False)
441
+ model = model.to(device="cuda:0", dtype=torch.bfloat16)
442
+ model.eval()
443
 
444
+ tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
 
445
 
446
+ # Generate (recommended: temp=0.7, rep_penalty=1.2)
447
+ prompt = "<|user|>\nWhat is artificial intelligence?\n<|assistant|>\n"
448
+ ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
449
 
450
+ with torch.no_grad():
451
+ for _ in range(256):
452
+ logits, _ = model(ids)
453
+ logits = logits[:, -1, :].float()
454
+ for prev_id in set(ids[0].tolist()):
455
+ if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
456
+ else: logits[0, prev_id] *= 1.2
457
+ probs = torch.softmax(logits / 0.7, dim=-1)
458
+ next_id = torch.multinomial(probs, 1)
459
+ ids = torch.cat([ids, next_id], dim=1)
460
+ if next_id.item() == tok.token_to_id("</s>"): break
461
+
462
+ print(tok.decode(ids[0].tolist()))
463
+ ```
464
 
465
+ **Method 2: Evaluation framework runner**
466
 
467
+ The `evafrill_runner.py` in [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test) wraps the above into a simple API:
 
 
468
 
469
+ ```python
470
+ from eval_framework.evafrill_runner import generate, unload_model
471
+
472
+ result = generate("Hello, please introduce yourself.")
473
+ print(result["response"])
474
+ print(f"Speed: {result['tokens_per_sec']:.1f} TPS")
475
+ unload_model()
 
 
476
  ```
477
 
478
+ > Setup instructions: [frankenstallm_test README](https://github.com/pathcosmos/frankenstallm_test#evafrill-mo-모델-설정-pytorch-직접-추론)
479
+
480
+ **System requirements**: GPU VRAM 8GB+ (BF16), CPU inference possible but extremely slow (~0.5 TPS)
481
+
482
  ### Reproducibility
483
 
484
  | Path | Contents |
slerp/README.md CHANGED
@@ -51,10 +51,58 @@ See the [main README](../../README.md) for full project details, architecture, a
51
 
52
  ## Usage
53
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
54
  ```python
55
- from transformers import AutoModelForCausalLM, AutoTokenizer
56
- model = AutoModelForCausalLM.from_pretrained("path/to/slerp", torch_dtype="bfloat16")
57
- tokenizer = AutoTokenizer.from_pretrained("path/to/slerp")
58
- inputs = tokenizer("<|user|>\n질문을 여기에 입력하세요\n<|assistant|>\n", return_tensors="pt")
59
- output = model.generate(**inputs, temperature=0.7, repetition_penalty=1.2, max_new_tokens=512)
60
  ```
 
51
 
52
  ## Usage
53
 
54
+ > **Note**: This is a custom Mamba-2 hybrid architecture — `AutoModelForCausalLM` is **not supported**. Use direct safetensors loading with the [EVAFRILL-Mo source code](https://github.com/pathcosmos/EVAFRILL-Mo).
55
+
56
+ ```bash
57
+ # Prerequisites
58
+ git clone https://github.com/pathcosmos/EVAFRILL-Mo
59
+ pip install torch safetensors tokenizers PyYAML
60
+ ```
61
+
62
+ ```python
63
+ import json, torch
64
+ from model.config import LMConfig
65
+ from model.transformer import LLM
66
+ from tokenizers import Tokenizer
67
+ from safetensors.torch import load_file as load_safetensors
68
+
69
+ CKPT = "path/to/slerp" # this directory
70
+
71
+ with open(f"{CKPT}/config.json") as f:
72
+ data = json.load(f)
73
+ for k in ("model_type", "architectures", "_variant", "_description"):
74
+ data.pop(k, None)
75
+ cfg = LMConfig(**data)
76
+ cfg.use_flash_attn = False
77
+
78
+ model = LLM(cfg)
79
+ state = load_safetensors(f"{CKPT}/model.safetensors", device="cpu")
80
+ model.load_state_dict(state, strict=False)
81
+ model = model.to(device="cuda:0", dtype=torch.bfloat16).eval()
82
+
83
+ tok = Tokenizer.from_file(f"{CKPT}/tokenizer.json")
84
+ prompt = "<|user|>\n질문을 여기에 입력하세요\n<|assistant|>\n"
85
+ ids = torch.tensor([tok.encode(prompt).ids], device="cuda:0")
86
+
87
+ with torch.no_grad():
88
+ for _ in range(512):
89
+ logits, _ = model(ids)
90
+ logits = logits[:, -1, :].float()
91
+ for prev_id in set(ids[0].tolist()):
92
+ if logits[0, prev_id] > 0: logits[0, prev_id] /= 1.2
93
+ else: logits[0, prev_id] *= 1.2
94
+ probs = torch.softmax(logits / 0.7, dim=-1)
95
+ next_id = torch.multinomial(probs, 1)
96
+ ids = torch.cat([ids, next_id], dim=1)
97
+ if next_id.item() == tok.token_to_id("</s>"): break
98
+
99
+ print(tok.decode(ids[0].tolist()))
100
+ ```
101
+
102
+ Alternatively, use the wrapped runner from [frankenstallm_test](https://github.com/pathcosmos/frankenstallm_test):
103
+
104
  ```python
105
+ from eval_framework.evafrill_runner import generate
106
+ result = generate("한국어로 인사해주세요.")
107
+ print(result["response"])
 
 
108
  ```