---
license: apache-2.0

tags:
- networking
- webrtc
- congestion-control
- edge-ai
- gguf
- llama.cpp
- qos
- qoe
---

# NICoLE-LLM

NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming. 

It **predicts**:
- ECN
- Current Profile (CP)
- Next Profile (NP)

from RTP packetization and queue telemetry using compact symbolic prompting.

**Optimized** for:
  - low-latency inference
  - edge deployment
  - GGUF quantization
  - deterministic structured outputs

**Applications**:

  - WebRTC adaptive streaming
  - congestion-aware real-time video encoding adaptation
  - in-Network QoE Optimization
  - edge AI networking
    
---

# Profiles
| Profile | Resolution | FPS | GoP |
|---|---|---|---|
| P0 | 3840×2160 (4K) | 30 / 60 / 90 / 120 | 2 s |
| P1 | 1920×1080 | 30 / 60 / 90 / 120 | 2 s |
| P2 | 1280×720 | 30 / 60 / 90 / 120 | 2 s |
| P3 | 640×360 | 30 / 60 / 90 / 120 | 2 s |

The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming.

---

# Prompt Format

**Input order**:

```text

PS FS IFGS IFGR CQ LQ E
```

**Output order**:

```text

E C N
```

**Example**:

```text
I:PS FS IFGS IFGR CQ LQ E
O:E C N

U:1400,40,34,33,2,0,0

A:
```

Expected output:
```text
0,1,1
```
---

# Hugging Face Usage (Python code)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "YOUR_USERNAME/NICoLE-LLM"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)

prompt = """I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=6,
    do_sample=False
)
print(tok.decode(out[0], skip_special_tokens=True))
```
---

# GGUF / llama.cpp Usage

```bash
./llama-cli \
-no-cnv \
-t 4 \
-m nicole-q4.gguf \
-p "I:PS FS IFGS IFGR CQ LQ E
O:E C N

U:1400,40,34,33,2,0,0

A:" \
-n 6 \
--temp 0 \
--top-k 1
```
---

# Runtime Configuration

| Parameter | Value |
|---|---|
| Runtime | llama.cpp |
| Quantization | Q4_K_M |
| Model Size | 636 MB |
| Context Length | 4096 |
| Inference | Deterministic |
| Prompting | Compact Symbolic |

---
# CPU Core Benchmark

| Threads | Response (ms) | Decisions/sec | Tokens/sec |
|---|---|---|---|
| 1 | 1325 | 0.75 | 52.71 |
| 2 | 624 | 1.60 | 113.30 |
| **4** | **343** | **2.91** | **203.35** |
| 8 | 904 | 1.11 | 60.70 |
| 16 | 1043 | 0.96 | 132.46 |
| 32 | 1432 | 0.70 | 104.29 |

**Best CPU deployment:**
- 4 threads
- 343 ms response time
- 2.91 decisions/sec

## Compact symbolic prompting significantly reduces:
- prompt tokens
- KV-cache usage
- inference latency
- deployment overhead
compared to verbose natural-language prompting.

---

# Quantized Models

Available quantization:
- Q4_K_M (recommended)

**Runtime**:
- llama.cpp

**Designed** for:
- edge deployment
- CPU inference
- bounded symbolic control inference
- real-time congestion-aware adaptation

---

# Limitations

- Trained under a 40 Mbps bottleneck scenario
- Designed for bounded RTP/WebRTC streaming tasks
- Not intended for open-ended conversational generation

---

# Citation

If you use this model, please cite the NICoLE paper and repository.

  - **Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,*"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?"*, 
IEEE/IFIP Networking, Switzerland 2026.**