Spaces:

microsoft
/

MInference

Runtime error

iofu728 commited on Jul 2, 2024

Commit

770d29d

1 Parent(s): 27e09a4

Feature(MInference): update information

Files changed (1) hide show

app.py CHANGED Viewed

@@ -14,12 +14,15 @@ HF_TOKEN = os.environ.get("HF_TOKEN", None)
 DESCRIPTION = """
-# MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (Under Review) [[paper](https://arxiv.org/abs/2406.05736)]
 _Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu_
 <h2 style="text-align: center;"><a href="https://github.com/microsoft/MInference" target="blank"> [Code]</a>
-<a href="https://hqjiang.com/minference.html" target="blank"> [Project Page]</a>
-<a href="https://arxiv.org/abs/2406.05736" target="blank"> [Paper]</a></h2>
 <font color="brown"><b>This is only a deployment demo. Due to limited GPU resources, we do not provide an online demo. You will need to follow the code below to try MInference locally.</b></font>
@@ -55,7 +58,7 @@ h1 {
 """
 # Load the tokenizer and model
-model_name = "gradientai/Llama-3-8B-Instruct-Gradient-1048k"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
     model_name, torch_dtype="auto", device_map="auto"

 DESCRIPTION = """
+# [MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention](https://aka.ms/MInference)(Under Review, ES-FoMo @ ICML'24)
 _Huiqiang Jiang†, Yucheng Li†, Chengruidong Zhang†, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H. Abdi, Dongsheng Li, Chin-Yew Lin, Yuqing Yang and Lili Qiu_
 <h2 style="text-align: center;"><a href="https://github.com/microsoft/MInference" target="blank"> [Code]</a>
+<a href="https://aka.ms/MInference" target="blank"> [Project Page]</a>
+<a href="https://arxiv.org/abs/2407" target="blank"> [Paper]</a></h2>
+## News
+- 🧩 We will present **MInference 1.0** at the _**Microsoft Booth**_ and _**ES-FoMo**_ at ICML'24. See you in Vienna!
 <font color="brown"><b>This is only a deployment demo. Due to limited GPU resources, we do not provide an online demo. You will need to follow the code below to try MInference locally.</b></font>
 """
 # Load the tokenizer and model
+model_name = "gradientai/Llama-3-8B-Instruct-Gradient-1048k" if torch.cuda.is_available() else "Qwen/Qwen2-0.5B"
 tokenizer = AutoTokenizer.from_pretrained(model_name)
 model = AutoModelForCausalLM.from_pretrained(
     model_name, torch_dtype="auto", device_map="auto"