daniicruzz commited on
Commit
2f0a41b
·
verified ·
1 Parent(s): f0f5e64

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +365 -3
README.md CHANGED
@@ -1,3 +1,365 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ base_model:
4
+
5
+ - Qwen/Qwen3-0.6B
6
+ - MultiverseComputing/LittleLamb-0.3B
7
+ library_name: transformers
8
+ license: apache-2.0
9
+
10
+ ---
11
+ <div align="center">
12
+
13
+ # LittleLamb 0.3B Tool-Calling
14
+
15
+ ### Powered by CompactifAI
16
+
17
+ [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
18
+ [![HuggingFace](https://img.shields.io/badge/🤗-Model_Hub-yellow.svg)](https://huggingface.co/MultiverseComputingCAI/LittleLamb-ToolCalling)
19
+ [![Discord](https://img.shields.io/badge/Discord-Community-5865F2?logo=discord&logoColor=white)](https://discord.gg/cGas9uStqp)
20
+
21
+ **Tiny Model** · **50% Compressed** · **Native Tool Calling** · **Thinking & Non-Thinking Modes**
22
+
23
+ </div>
24
+
25
+ ---
26
+
27
+ ## Table of Contents
28
+
29
+ - [Highlights](#highlights)
30
+ - [Model Overview](#model-overview)
31
+ - [Key Characteristics](#key-characteristics)
32
+ - [Quick Start](#quick-start)
33
+ - [What's New in LittleLamb 0.3B Tool-Calling](#whats-new-in-littlelamb-03b-tool-calling)
34
+ - [Tool Calling](#tool-calling)
35
+ - [Dual-Mode Inference (Thinking / Non-Thinking)](#dual-mode-inference-thinking--non-thinking)
36
+ - [Training & Fine-Tuning](#training--fine-tuning)
37
+ - [Architecture](#architecture)
38
+ - [Evaluation & Benchmarks](#evaluation--benchmarks)
39
+ - [Languages](#languages)
40
+ - [Intended Use](#intended-use)
41
+ - [Safety & Limitations](#safety--limitations)
42
+ - [Model Information](#model-information)
43
+ - [Citation](#citation)
44
+
45
+ ---
46
+
47
+ ## Model Overview
48
+
49
+ **LittleLamb 0.3B Tool-Calling** is a **tool-calling–optimized variant** of [LittleLamb 0.3B](https://huggingface.co/MultiverseComputingCAI/LittleLamb-ToolCalling) at **290M parameters**, developed based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) by **Multiverse Computing**. Built on top of the CompactifAI-compressed LittleLamb base, this variant has been additionally fine-tuned for **function calling, structured outputs, and agentic workflows**. It supports **thinking and non-thinking modes** while adding native tool-use support in a sub-300M-parameter footprint.
50
+
51
+ ---
52
+
53
+ ## Key Characteristics
54
+
55
+
56
+ | Characteristic | Description |
57
+ | ---------------- | ---------------------------------------------------------------------------------------------------------------- |
58
+ | Base model | [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) (0.6B params, 0.44B non-embedding; open-weight, Apache 2.0) |
59
+ | **Tool calling** | Native support for function calling with defined schemas and structured outputs |
60
+ | **Parameters** |290M total parameters after CompactifAI compression (50% compression rate from base 0.6B) |
61
+ | **Architecture** | Decoder-only Transformer (Qwen3 family) |
62
+ | **Compression** | CompactifAI (proprietary) |
63
+ | **Languages** | English. Spanish is yet to be tested for tool-calling capabilities. |
64
+ | **Modes** | Thinking (`enable_thinking=True`) and non-thinking (`enable_thinking=False`) via chat template |
65
+
66
+
67
+ ---
68
+
69
+ ## Quick Start
70
+
71
+ This model can be loaded with the **Transformers** library. Requires `transformers>=4.51.0` for Qwen3 architecture support.
72
+
73
+ ```python
74
+ from transformers import AutoModelForCausalLM, AutoTokenizer
75
+
76
+ model_id = "MultiverseComputingCAI/LittleLamb-ToolCalling"
77
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
78
+ model = AutoModelForCausalLM.from_pretrained(
79
+ model_id,
80
+ torch_dtype="auto",
81
+ device_map="auto",
82
+ )
83
+
84
+ messages = [{"role": "user", "content": "Hello!"}]
85
+ text = tokenizer.apply_chat_template(
86
+ messages,
87
+ tokenize=False,
88
+ add_generation_prompt=True,
89
+ enable_thinking=True,
90
+ )
91
+ inputs = tokenizer([text], return_tensors="pt").to(model.device)
92
+ output_ids = model.generate(**inputs, max_new_tokens=256)[0]
93
+ response = tokenizer.decode(
94
+ output_ids[len(inputs.input_ids[0]) :], skip_special_tokens=True
95
+ )
96
+ print(response)
97
+ ```
98
+
99
+ For OpenAI-compatible serving, use a stack that supports Qwen3 reasoning and tool calling (e.g. recent **vLLM** or **SGLang** with Qwen3 parsers); see the [Qwen3-0.6B model card](https://huggingface.co/Qwen/Qwen3-0.6B) for deployment examples.
100
+
101
+ ---
102
+
103
+ ## What's New in LittleLamb 0.3B Tool-Calling
104
+
105
+ ### Summary
106
+
107
+ - **Tool-calling–optimized** variant of LittleLamb 0.3B, fine-tuned for function calling and structured outputs.
108
+ - **Ultra-compact** at 290M parameters, suitable for edge and on-device deployment with agentic capabilities.
109
+ - **Developed based on [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** with **CompactifAI** compression (~50% parameter reduction vs. base non-embedding count).
110
+
111
+
112
+ ---
113
+
114
+ ## Tool Calling
115
+
116
+ LittleLamb 0.3B Tool-Calling supports **native tool use** and is designed for:
117
+
118
+ - **Function calling** with defined schemas
119
+ - **Structured outputs**
120
+ - **Agentic operations** (e.g. browser tasks, code execution where supported)
121
+
122
+ The model can detect when to invoke tools, emit structured JSON tool calls, and consume tool outputs to continue generation. Tool-calling behavior follows Qwen3-style schemas.
123
+
124
+ ### Example Tool Call
125
+
126
+ ```json
127
+ {
128
+ "name": "get_weather",
129
+ "arguments": {
130
+ "city": "Paris",
131
+ "date": "2026-02-10"
132
+ }
133
+ }
134
+ ```
135
+
136
+ ---
137
+
138
+ ## Dual-Mode Inference (Thinking / Non-Thinking)
139
+
140
+ LittleLamb 0.3B Tool-Calling inherits Qwen3's dual-mode capability, supporting seamless switching between **thinking mode** (for complex reasoning) and **non-thinking mode** (for efficient general-purpose dialogue).
141
+
142
+ The model generates internal reasoning in Qwen3's thinking format (see the Qwen3 chat template) before producing the final response. Use this for tasks requiring multi-step reasoning, math, or code generation.
143
+
144
+ Set `enable_thinking=False` for lower-latency dialogue without explicit chain-of-thought in the template. Follow the **sampling parameters** recommended in the [Qwen3-0.6B model card](https://huggingface.co/Qwen/Qwen3-0.6B) for each mode.
145
+
146
+ ---
147
+
148
+ ## Training & Fine-Tuning
149
+
150
+ ### Base Model: Qwen3-0.6B
151
+
152
+ The base model [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) is a causal language model from the Qwen3 family, supporting thinking/non-thinking modes. See the [Qwen3 technical report](https://arxiv.org/abs/2505.09388) for details.
153
+
154
+ ### CompactifAI Compression & Tool-Calling Fine-Tuning
155
+
156
+ - **Compression:** CompactifAI was applied to produce a smaller, efficient model (~0.3B parameters) while aiming to preserve reasoning capabilities.
157
+ - **Tool-calling fine-tuning:** This variant includes additional fine-tuning for function calling and structured outputs on top of the compressed LittleLamb base.
158
+
159
+ ---
160
+
161
+ ## Architecture
162
+
163
+ ### Model Specifications
164
+
165
+
166
+ | Field | Value |
167
+ | ---------------- | ----------------------------------------------------------------------- |
168
+ | Base model | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) (0.6B params) |
169
+ | Total parameters | 290M dense |
170
+
171
+
172
+ ---
173
+
174
+ ## Evaluation & Benchmarks
175
+
176
+ ### Evaluation Methodology
177
+
178
+ Benchmark scores were obtained with the following setups. Methodology varies by benchmark family.
179
+
180
+ For **LittleLamb 0.3B Tool-Calling** and **Qwen3-0.6B (base)**, benchmark runs are reported under both **thinking** and **non-thinking** chat modes using the sampling settings recommended in the [Qwen3-0.6B model card](https://huggingface.co/Qwen/Qwen3-0.6B).
181
+
182
+ #### MMLU-Pro, GPQA Diamond, IFBench
183
+
184
+ - **Evaluation framework**: [Nemo-skills](https://github.com/NVIDIA/NeMo-Skills)
185
+ - **Inference library**: vLLM 0.18.0
186
+ - **Thinking mode** (`enable_thinking=True`, per Qwen3-0.6B instruct): temperature = 0.6, top_p = 0.95, top_k = 20, min_p = 0
187
+ - **Non-thinking mode** (`enable_thinking=False`, per Qwen3-0.6B instruct): temperature = 0.7, top_p = 0.8, top_k = 20, min_p = 0
188
+
189
+ #### BFCL v4, τ²-Bench
190
+
191
+ - **Evaluation framework**: [EvalScope](https://github.com/EvalScope/EvalScope)
192
+ - **Inference library**: vLLM 0.18.0
193
+ - **Thinking mode** (`enable_thinking=True`, per Qwen3-0.6B instruct): temperature = 0.6, top_p = 0.95, top_k = 20, min_p = 0
194
+ - **Non-thinking mode** (`enable_thinking=False`, per Qwen3-0.6B instruct): temperature = 0.7, top_p = 0.8, top_k = 20, min_p = 0
195
+ - Results of `functiongemma-270m-it` for BFCL v4 were extracted from [Google's model card](https://huggingface.co/google/functiongemma-270m-it) (09/04/2026)
196
+
197
+
198
+ ### Quantitative Results
199
+
200
+ Reported numbers use the methodology described above.
201
+
202
+ #### Thinking mode
203
+
204
+
205
+ | Benchmark | functiongemma-270m-it | Qwen3-0.6B (think) | LittleLamb-TC 0.3B (think) |
206
+ | --------------------------- | --------------------- | ------------------ | -------------------------- |
207
+ | IFBench | 12.00 | 23.88 | 20.00 |
208
+ | GPQA Diamond | 2.53 | 29.59 | 27.47 |
209
+ | MMLU-Pro | 0.42 | 38.27 | 28.74 |
210
+ | τ²-Bench | 5.05 | 19.59 | 18.70 |
211
+ | BFCL Simple | 61.60 | 72.73 | 72.36 |
212
+ | BFCL Multiple | 63.50 | 85.00 | 89.50 |
213
+ | BFCL Parallel | 39.00 | 70.00 | 70.00 |
214
+ | BFCL Parallel Multiple | 29.50 | 71.50 | 68.00 |
215
+ | BFCL Live Simple | 36.20 | 63.18 | 64.34 |
216
+ | BFCL Live Multiple | 25.70 | 56.41 | 60.78 |
217
+ | BFCL Live Parallel | 22.90 | 50.00 | 62.50 |
218
+ | BFCL Live Parallel Multiple | 20.80 | 50.00 | 45.83 |
219
+ | BFCL Relevance | 61.10 | 75.00 | 75.00 |
220
+ | BFCL Irrelevance | 73.70 | 84.58 | 77.92 |
221
+ | **BFCL v4** | 27.03 | 54.08 | 51.55 |
222
+
223
+
224
+ #### Non-thinking mode
225
+
226
+
227
+ | Benchmark | functiongemma-270m-it | Qwen3-0.6B (no think) | LittleLamb-TC 0.3B (no think) |
228
+ | --------------------------- | --------------------- | --------------------- | ----------------------------- |
229
+ | IFBench | 12.00 | 23.80 | 21.00 |
230
+ | GPQA Diamond | 2.53 | 27.77 | 27.37 |
231
+ | MMLU-Pro | 0.42 | 25.72 | 23.71 |
232
+ | τ²-Bench | 5.05 | 15.50 | 26.67 |
233
+ | BFCL Simple | 61.60 | 12.73 | 70.55 |
234
+ | BFCL Multiple | 63.50 | 20.00 | 80.50 |
235
+ | BFCL Parallel | 39.00 | 18.00 | 71.50 |
236
+ | BFCL Parallel Multiple | 29.50 | 30.50 | 70.50 |
237
+ | BFCL Live Simple | 36.20 | 4.65 | 62.02 |
238
+ | BFCL Live Multiple | 25.70 | 11.02 | 50.43 |
239
+ | BFCL Live Parallel | 22.90 | 0.00 | 43.75 |
240
+ | BFCL Live Parallel Multiple | 20.80 | 12.50 | 29.17 |
241
+ | BFCL Relevance | 61.10 | 12.50 | 75.00 |
242
+ | BFCL Irrelevance | 73.70 | 97.50 | 87.50 |
243
+ | **BFCL v4** | 27.03 | 29.17 | 50.51 |
244
+
245
+
246
+ ![Intelligence Thinking](assets/littlelamb-tc-intelligence-family.png)
247
+
248
+ ### Quantitative Results (Inference Performance)
249
+
250
+ #### Metrics reported
251
+ - **System Output Throughput**: Mean output tokens per second across all concurrent requests over the benchmarking phase.
252
+ - **End-to-End Latency per Query:** Median end-to-end response time for each query from the time the query is sent.
253
+ - **Output Speed per Query:** Median output tokens per second after the first token is received for each query.
254
+ - **Time to first token (TTFT):** Median
255
+ - **Estimated Peak Memory Usage:** KV cache utilization is monitored during the phase and we estimate memory usage as follows: $model\_ weights_{gb} + kv\_ cache_{usage\_pct} × (nvml\_used_{gb} − model\_ weights_{gb})$
256
+ - **Model weights:**
257
+
258
+
259
+
260
+ #### Performance evaluation conditions
261
+
262
+ Our performance evaluation follows the spirit of [Artificial Analysis](https://artificialanalysis.ai/methodology/system-load-test).
263
+
264
+ - **Inference library**: vLLM 0.18.0
265
+ - **Monitoring libraries**: GuideLLM 0.6.0, nvidia-ml-py 13.590.48
266
+ - **Hardware**: 1× NVIDIA L4 GPU
267
+ - **Conditions**: concurrency=16
268
+ - **Phase duration**: Each phase lasts 3 minutes (excluding ramp-up and cool-down periods).
269
+ - **Workload shape**: 1,000 input tokens and 1,000 output tokens per query.
270
+ - **Streaming**: Benchmarking is conducted with streaming enabled.
271
+
272
+
273
+ **Summary of improvements:** Little Lamb shows a slight improvement in performance with respect to the original Qwen Model. This is expected as for such small models, VRAM usage is dominated by KV cache and not model weights.
274
+
275
+ ![Performance](assets/littlelamb-tc-performance-family.png)
276
+
277
+
278
+
279
+ ---
280
+
281
+ ## Languages
282
+
283
+ - **Primary languages**: English. Spanish is yet to be tested for tool-calling capabilities.
284
+
285
+ ---
286
+
287
+ ## Intended Use
288
+
289
+ ### Recommended Use Cases
290
+
291
+ Aligned with [Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) use cases, with the added benefit of tool-calling capabilities in a smaller footprint suitable for edge and on-device deployment:
292
+
293
+ - **Function calling and agentic workflows** in resource-constrained environments
294
+ - **On-device and edge inference** where memory and compute are constrained
295
+ - **Structured output generation** (JSON, schemas)
296
+ - **Reasoning tasks** with configurable thinking/non-thinking modes
297
+ - **Chatbots and virtual assistants** with tool integration
298
+
299
+ ### Out-of-Scope Uses
300
+
301
+ - Harmful, illegal, or deceptive content generation
302
+ - Impersonation of real individuals without consent
303
+ - High-risk decision-making without human oversight
304
+ - Surveillance or tracking of individuals
305
+ - Any use that violates applicable laws or regulations
306
+
307
+ ---
308
+
309
+ ## Safety & Limitations
310
+
311
+ ### Known Limitations
312
+
313
+ - **Model scale:** At ~0.3B parameters, this is an ultra-compact model. Several frontier-scale benchmarks (GDPval-AA, Terminal-Bench Hard, AA-LCR, CritPt) produce no discriminative signal at this model size, as the base Qwen3-0.6B itself scores near zero on them.
314
+ - **Thinking mode:** Performance differs substantially between thinking and non-thinking modes across benchmarks. Users should evaluate both modes for their specific use case.
315
+ - **Tool calling:** While fine-tuned for tool use, accuracy and reliability of tool calls should be validated for production use cases given the model's compact size.
316
+
317
+ ### Recommendations
318
+
319
+ - Use human oversight for critical applications
320
+ - Perform task-specific evaluation prior to deployment
321
+ - Test both thinking and non-thinking modes for your use case
322
+ - Validate tool-call outputs before executing them in production
323
+
324
+ ---
325
+
326
+ ## Model Information
327
+
328
+
329
+ | Field | Value |
330
+ | ------------ | --------------------------------------------------------------------------- |
331
+ | Model name | LittleLamb Tool-Calling |
332
+ | Based on | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) |
333
+ | Version | 2604 |
334
+ | Release date | 28/04/2026 |
335
+ | Developed by | Multiverse Computing |
336
+ | License | Apache 2.0 |
337
+ | Contact | [business@multiversecomputing.com](mailto:business@multiversecomputing.com) |
338
+
339
+
340
+ ---
341
+
342
+ ## Citation
343
+
344
+ If you use this model, please cite the base model and this variant:
345
+
346
+ ```bibtex
347
+ @misc{qwen3technicalreport,
348
+ title = {Qwen3 Technical Report},
349
+ author = {Qwen Team},
350
+ year = {2025},
351
+ eprint = {2505.09388},
352
+ archivePrefix = {arXiv},
353
+ primaryClass = {cs.CL},
354
+ url = {https://arxiv.org/abs/2505.09388}
355
+ }
356
+ @misc{littlelambtc,
357
+ title = {LittleLamb Tool-Calling: Compressed Qwen3-0.6B with Tool-Use via CompactifAI},
358
+ author = {Multiverse Computing},
359
+ year = {2026},
360
+ url = {https://huggingface.co/MultiverseComputingCAI/LittleLamb-ToolCalling},
361
+ note = {Model developed based on Qwen/Qwen3-0.6B using CompactifAI technology, fine-tuned for tool calling}
362
+ }
363
+ ```
364
+
365
+ **Built by [Multiverse Computing](https://www.multiversecomputing.com)** · [Report an issue](https://huggingface.co/MultiverseComputingCAI/LittleLamb-ToolCalling/discussions) · [Discord](https://discord.gg/cGas9uStqp)