NICoLE-LLM / README.md
alirezashirmarz's picture
Update README.md
b52b298 verified
---
license: apache-2.0
tags:
- networking
- webrtc
- congestion-control
- edge-ai
- gguf
- llama.cpp
- qos
- qoe
---
# NICoLE-LLM
NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.
It **predicts**:
- ECN
- Current Profile (CP)
- Next Profile (NP)
from RTP packetization and queue telemetry using compact symbolic prompting.
**Optimized** for:
- low-latency inference
- edge deployment
- GGUF quantization
- deterministic structured outputs
**Applications**:
- WebRTC adaptive streaming
- congestion-aware real-time video encoding adaptation
- in-Network QoE Optimization
- edge AI networking
---
# Profiles
| Profile | Resolution | FPS | GoP |
|---|---|---|---|
| P0 | 3840×2160 (4K) | 30 / 60 / 90 / 120 | 2 s |
| P1 | 1920×1080 | 30 / 60 / 90 / 120 | 2 s |
| P2 | 1280×720 | 30 / 60 / 90 / 120 | 2 s |
| P3 | 640×360 | 30 / 60 / 90 / 120 | 2 s |
The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming.
---
# Prompt Format
**Input order**:
```text
PS FS IFGS IFGR CQ LQ E
```
**Output order**:
```text
E C N
```
**Example**:
```text
I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:
```
Expected output:
```text
0,1,1
```
---
# Hugging Face Usage (Python code)
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "YOUR_USERNAME/NICoLE-LLM"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto"
)
prompt = """I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:"""
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=6,
do_sample=False
)
print(tok.decode(out[0], skip_special_tokens=True))
```
---
# GGUF / llama.cpp Usage
```bash
./llama-cli \
-no-cnv \
-t 4 \
-m nicole-q4.gguf \
-p "I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:" \
-n 6 \
--temp 0 \
--top-k 1
```
---
# Runtime Configuration
| Parameter | Value |
|---|---|
| Runtime | llama.cpp |
| Quantization | Q4_K_M |
| Model Size | 636 MB |
| Context Length | 4096 |
| Inference | Deterministic |
| Prompting | Compact Symbolic |
---
# CPU Core Benchmark
| Threads | Response (ms) | Decisions/sec | Tokens/sec |
|---|---|---|---|
| 1 | 1325 | 0.75 | 52.71 |
| 2 | 624 | 1.60 | 113.30 |
| **4** | **343** | **2.91** | **203.35** |
| 8 | 904 | 1.11 | 60.70 |
| 16 | 1043 | 0.96 | 132.46 |
| 32 | 1432 | 0.70 | 104.29 |
**Best CPU deployment:**
- 4 threads
- 343 ms response time
- 2.91 decisions/sec
## Compact symbolic prompting significantly reduces:
- prompt tokens
- KV-cache usage
- inference latency
- deployment overhead
compared to verbose natural-language prompting.
---
# Quantized Models
Available quantization:
- Q4_K_M (recommended)
**Runtime**:
- llama.cpp
**Designed** for:
- edge deployment
- CPU inference
- bounded symbolic control inference
- real-time congestion-aware adaptation
---
# Limitations
- Trained under a 40 Mbps bottleneck scenario
- Designed for bounded RTP/WebRTC streaming tasks
- Not intended for open-ended conversational generation
---
# Citation
If you use this model, please cite the NICoLE paper and repository.
- **Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,*"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?"*,
IEEE/IFIP Networking, Switzerland 2026.**