nvidia
/

Nemotron-Flash-3B

Text Generation

Model card Files Files and versions

YongganFu commited on May 20, 2025

Commit

f6866e2

·

verified ·

1 Parent(s): 0d6ec65

Create README.md

Files changed (1) hide show

README.md +42 -0

README.md ADDED Viewed

	@@ -0,0 +1,42 @@

+---
+library_name: transformers
+tags: []
+---
+# Fast-SLM-2.7B
+It is a follow-up work (under review for NeurIPS'25) of our Hymba model, with significantly improved decoding speed for edge use cases.
+Docker path: `/lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25_fla.sqsh` on ORD/NRT or `/lustre/fsw/nvr_lpr_llm/yongganf/docker/megatron_py25_fla.sqsh` on EOS.
+## Chat with Fast-SLM-2.7B
+```
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+repo_name = "YongganFu/Fast_SLM_2_7B"
+tokenizer = AutoTokenizer.from_pretrained(repo_name, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(repo_name, trust_remote_code=True).cuda().to(torch.bfloat16)
+def chat_with_model(prompt, model, tokenizer, max_length=64):
+    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
+    outputs = model.generate(**inputs, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
+    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
+    return response
+print("Chat with the model (type 'exit' to quit):")
+while True:
+    print("User:")
+    prompt = input()
+    if prompt.lower() == "exit":
+        break
+    response = chat_with_model(prompt, model, tokenizer)
+    print(f"Model: {response}")
+```