3morixd commited on
Commit
c1501c6
·
verified ·
1 Parent(s): 97c2636

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +44 -36
README.md CHANGED
@@ -1,55 +1,63 @@
1
  ---
2
  license: apache-2.0
3
- base_model: HuggingFaceTB/SmolLM2-360M-Instruct
 
 
4
  tags:
5
- - speculative-decoding-draft [dispatch-ai, mobile, quantized, gguf, phone-farm-tested]
 
 
 
 
6
  pipeline_tag: text-generation
7
- language: [en]
8
  ---
 
9
  # SmolLM2-360M-Instruct-mobile
10
- **Dispatch AI** — Built for mobile. Tested on real phones.
11
- ## Category
12
- Text Generation — 360M sweet spot
13
- ## Model
14
- Re-engineered from [HuggingFaceTB/SmolLM2-360M-Instruct](https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct).
15
- Size: 258 MB. Q4_K_M GGUF for llama.cpp.
16
- ## Usage
17
- ```bash
18
- ./llama-cli -m model.gguf -p "Hello" -n 100 -t 4 -c 512
19
- ```
20
- 🌐 [dispatchAI on HuggingFace](https://huggingface.co/dispatchAI)
21
 
 
 
 
 
 
 
 
 
22
 
23
- ## Speculative Decoding Draft Model
24
 
25
- This model is optimized for use as a **draft model** in speculative decoding setups.
26
 
27
- ### What is speculative decoding?
28
- Speculative decoding pairs a small, fast "draft" model with a larger "target" model.
29
- The draft model proposes tokens that the target model verifies in parallel, achieving
30
- 2-3x speedup with zero quality loss.
31
 
32
- ### Why this model?
33
- - **Small and fast**: Sub-1B parameters = minimal draft overhead
34
- - **Mobile-optimized**: Already quantized and pruned for edge deployment
35
- - **Same family**: Pairs naturally with larger models of the same architecture
 
 
 
 
 
 
36
 
37
- ### Usage with vLLM
38
  ```python
39
- from vllm import LLM, SamplingParams
40
 
41
- llm = LLM(
42
- model="target-model-7b",
43
- speculative_model="dispatchAI/SmolLM2-360M-Instruct-mobile",
44
- num_speculative_tokens=5,
45
  )
 
46
  ```
47
 
48
- ### Usage with transformers
49
  ```python
50
- from transformers import AutoModelForCausalLM, AutoTokenizer
51
-
52
- target = AutoModelForCausalLM.from_pretrained("target-model-7b")
53
- draft = AutoModelForCausalLM.from_pretrained("dispatchAI/SmolLM2-360M-Instruct-mobile")
54
- # See transformers docs for assisted_generation
55
  ```
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language:
4
+ - en
5
+ library_name: transformers
6
  tags:
7
+ - mobile
8
+ - on-device
9
+ - quantized
10
+ - gguf
11
+ - dispatchai
12
  pipeline_tag: text-generation
 
13
  ---
14
+
15
  # SmolLM2-360M-Instruct-mobile
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ ⚠️ **PARTIAL** — Verified June 2026.
18
+
19
+ ## Verification Results
20
+
21
+ | Prompt | Response | Correct? |
22
+ |--------|----------|----------|
23
+ | What is the capital of France? | "What is the capital of France?" | ⚠️ |
24
+ | What is 2+2? Just the number. | "2+2=4
25
 
26
+ So, 2+2 equals 4." | ✅ |
27
 
28
+ **Chat format**: `llama-3`
29
 
30
+ ## Model Details
 
 
 
31
 
32
+ | Attribute | Value |
33
+ |-----------|-------|
34
+ | **Base Model** | HuggingFaceTB/SmolLM2-360M-Instruct |
35
+ | **File Size** | 258 MB |
36
+ | **Format** | GGUF |
37
+ | **Chat Format** | llama-3 |
38
+ | **CPU Speed** | 26.4 tokens/sec |
39
+ | **License** | apache-2.0 |
40
+
41
+ ## Usage
42
 
 
43
  ```python
44
+ from llama_cpp import Llama
45
 
46
+ llm = Llama(model_path="model.gguf", chat_format="llama-3", n_ctx=512, n_threads=4)
47
+ response = llm.create_chat_completion(
48
+ messages=[{"role": "user", "content": "What is the capital of France?"}],
49
+ max_tokens=50,
50
  )
51
+ print(response["choices"][0]["message"]["content"])
52
  ```
53
 
54
+ ### dispatchAI SDK
55
  ```python
56
+ from dispatchai import load_model
57
+ model = load_model("SmolLM2-360M-Instruct-mobile", backend="gguf")
58
+ print(model.chat("Hello!"))
 
 
59
  ```
60
+
61
+ ## About dispatchAI
62
+
63
+ [dispatchAI](https://huggingface.co/dispatchAI) — Small. Mobile. Free. UAE-built.