Instructions to use alirezashirmarz/NICoLE-LLM with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use alirezashirmarz/NICoLE-LLM with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="alirezashirmarz/NICoLE-LLM",
	filename="nicole-f16.gguf",
)

output = llm(
	"Once upon a time,",
	max_tokens=512,
	echo=True
)
print(output)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use alirezashirmarz/NICoLE-LLM with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf alirezashirmarz/NICoLE-LLM:F16
# Run inference directly in the terminal:
llama-cli -hf alirezashirmarz/NICoLE-LLM:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf alirezashirmarz/NICoLE-LLM:F16
# Run inference directly in the terminal:
llama-cli -hf alirezashirmarz/NICoLE-LLM:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf alirezashirmarz/NICoLE-LLM:F16
# Run inference directly in the terminal:
./llama-cli -hf alirezashirmarz/NICoLE-LLM:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf alirezashirmarz/NICoLE-LLM:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf alirezashirmarz/NICoLE-LLM:F16

Use Docker

docker model run hf.co/alirezashirmarz/NICoLE-LLM:F16

LM Studio
Jan
Ollama
How to use alirezashirmarz/NICoLE-LLM with Ollama:
```
ollama run hf.co/alirezashirmarz/NICoLE-LLM:F16
```

Unsloth Studio

How to use alirezashirmarz/NICoLE-LLM with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for alirezashirmarz/NICoLE-LLM to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for alirezashirmarz/NICoLE-LLM to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for alirezashirmarz/NICoLE-LLM to start chatting

Docker Model Runner
How to use alirezashirmarz/NICoLE-LLM with Docker Model Runner:
```
docker model run hf.co/alirezashirmarz/NICoLE-LLM:F16
```

Lemonade

How to use alirezashirmarz/NICoLE-LLM with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull alirezashirmarz/NICoLE-LLM:F16

Run and chat with the model

lemonade run user.NICoLE-LLM-F16

List all available models

lemonade list

alirezashirmarz commited on 22 days ago

Commit

b52b298

verified ·

1 Parent(s): a27537d

Update README.md

Browse files

Files changed (1) hide show

README.md +15 -13

README.md CHANGED Viewed

@@ -16,24 +16,26 @@ tags:
 NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.
-It predicts:
 - ECN
 - Current Profile (CP)
 - Next Profile (NP)
 from RTP packetization and queue telemetry using compact symbolic prompting.
-Optimized for:
   - low-latency inference
   - edge deployment
   - GGUF quantization
   - deterministic structured outputs
-Applications:
   - WebRTC adaptive streaming
   - congestion-aware real-time video encoding adaptation
   - in-Network QoE Optimization
   - edge AI networking
 ---
 # Profiles
@@ -50,21 +52,21 @@ The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottl
 # Prompt Format
-Input order:
 ```text
 PS FS IFGS IFGR CQ LQ E
 ```
-Output order:
 ```text
 E C N
 ```
-Example:
 ```text
 I:PS FS IFGS IFGR CQ LQ E
@@ -81,7 +83,7 @@ Expected output:
 ```
 ---
-# Hugging Face Usage
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
@@ -150,12 +152,12 @@ A:" \
 | 16 | 1043 | 0.96 | 132.46 |
 | 32 | 1432 | 0.70 | 104.29 |
-Best CPU deployment:
 - 4 threads
 - 343 ms response time
 - 2.91 decisions/sec
-Compact symbolic prompting significantly reduces:
 - prompt tokens
 - KV-cache usage
 - inference latency
@@ -169,10 +171,10 @@ compared to verbose natural-language prompting.
 Available quantization:
 - Q4_K_M (recommended)
-Runtime:
 - llama.cpp
-Designed for:
 - edge deployment
 - CPU inference
 - bounded symbolic control inference
@@ -192,5 +194,5 @@ Designed for:
 If you use this model, please cite the NICoLE paper and repository.
-  - Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,*"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?"*,
-IEEE/IFIP Networking, Switzerland 2026.

 NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.
+It **predicts**:
 - ECN
 - Current Profile (CP)
 - Next Profile (NP)
 from RTP packetization and queue telemetry using compact symbolic prompting.
+**Optimized** for:
   - low-latency inference
   - edge deployment
   - GGUF quantization
   - deterministic structured outputs
+**Applications**:
   - WebRTC adaptive streaming
   - congestion-aware real-time video encoding adaptation
   - in-Network QoE Optimization
   - edge AI networking
 ---
 # Profiles
 # Prompt Format
+**Input order**:
 ```text
 PS FS IFGS IFGR CQ LQ E
 ```
+**Output order**:
 ```text
 E C N
 ```
+**Example**:
 ```text
 I:PS FS IFGS IFGR CQ LQ E
 ```
 ---
+# Hugging Face Usage (Python code)
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 | 16 | 1043 | 0.96 | 132.46 |
 | 32 | 1432 | 0.70 | 104.29 |
+**Best CPU deployment:**
 - 4 threads
 - 343 ms response time
 - 2.91 decisions/sec
+## Compact symbolic prompting significantly reduces:
 - prompt tokens
 - KV-cache usage
 - inference latency
 Available quantization:
 - Q4_K_M (recommended)
+**Runtime**:
 - llama.cpp
+**Designed** for:
 - edge deployment
 - CPU inference
 - bounded symbolic control inference
 If you use this model, please cite the NICoLE paper and repository.
+  - **Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,*"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?"*,
+IEEE/IFIP Networking, Switzerland 2026.**