YongganFu commited on
Commit
f6866e2
·
verified ·
1 Parent(s): 0d6ec65

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +42 -0
README.md ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ tags: []
4
+ ---
5
+
6
+ # Fast-SLM-2.7B
7
+
8
+ It is a follow-up work (under review for NeurIPS'25) of our Hymba model, with significantly improved decoding speed for edge use cases.
9
+
10
+ Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS.
11
+
12
+
13
+ ## Chat with Fast-SLM-2.7B
14
+
15
+ ```
16
+ from transformers import AutoModelForCausalLM, AutoTokenizer
17
+ import torch
18
+
19
+ repo_name = "YongganFu/Fast_SLM_2_7B"
20
+ tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
21
+ model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True).cuda().to(torch.bfloat16)
22
+
23
+
24
+ def chat_with_model(prompt, model, tokenizer, max_length=64):
25
+ inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
26
+
27
+ outputs = model.generate(**inputs, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
28
+
29
+ response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
30
+ return response
31
+
32
+ print("Chat with the model (type 'exit' to quit):")
33
+ while True:
34
+ print("User:")
35
+ prompt = input()
36
+ if prompt.lower() == "exit":
37
+ break
38
+
39
+ response = chat_with_model(prompt, model, tokenizer)
40
+
41
+ print(f"Model: {response}")
42
+ ```