NICoLE-LLM
NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.
It predicts:
- ECN
- Current Profile (CP)
- Next Profile (NP)
from RTP packetization and queue telemetry using compact symbolic prompting.
Optimized for:
- low-latency inference
- edge deployment
- GGUF quantization
- deterministic structured outputs
Applications:
- WebRTC adaptive streaming
- congestion-aware real-time video encoding adaptation
- in-Network QoE Optimization
- edge AI networking
Profiles
| Profile | Resolution | FPS | GoP |
|---|---|---|---|
| P0 | 3840×2160 (4K) | 30 / 60 / 90 / 120 | 2 s |
| P1 | 1920×1080 | 30 / 60 / 90 / 120 | 2 s |
| P2 | 1280×720 | 30 / 60 / 90 / 120 | 2 s |
| P3 | 640×360 | 30 / 60 / 90 / 120 | 2 s |
The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming.
Prompt Format
Input order:
PS FS IFGS IFGR CQ LQ E
Output order:
E C N
Example:
I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:
Expected output:
text 0,1,1
Hugging Face Usage (Python code)
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "YOUR_USERNAME/NICoLE-LLM"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto"
)
prompt = """I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:"""
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=6,
do_sample=False
)
print(tok.decode(out[0], skip_special_tokens=True))
GGUF / llama.cpp Usage
./llama-cli \
-no-cnv \
-t 4 \
-m nicole-q4.gguf \
-p "I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:" \
-n 6 \
--temp 0 \
--top-k 1
Runtime Configuration
| Parameter | Value |
|---|---|
| Runtime | llama.cpp |
| Quantization | Q4_K_M |
| Model Size | 636 MB |
| Context Length | 4096 |
| Inference | Deterministic |
| Prompting | Compact Symbolic |
CPU Core Benchmark
| Threads | Response (ms) | Decisions/sec | Tokens/sec |
|---|---|---|---|
| 1 | 1325 | 0.75 | 52.71 |
| 2 | 624 | 1.60 | 113.30 |
| 4 | 343 | 2.91 | 203.35 |
| 8 | 904 | 1.11 | 60.70 |
| 16 | 1043 | 0.96 | 132.46 |
| 32 | 1432 | 0.70 | 104.29 |
Best CPU deployment:
- 4 threads
- 343 ms response time
- 2.91 decisions/sec
Compact symbolic prompting significantly reduces:
- prompt tokens
- KV-cache usage
- inference latency
- deployment overhead compared to verbose natural-language prompting.
Quantized Models
Available quantization:
- Q4_K_M (recommended)
Runtime:
- llama.cpp
Designed for:
- edge deployment
- CPU inference
- bounded symbolic control inference
- real-time congestion-aware adaptation
Limitations
- Trained under a 40 Mbps bottleneck scenario
- Designed for bounded RTP/WebRTC streaming tasks
- Not intended for open-ended conversational generation
Citation
If you use this model, please cite the NICoLE paper and repository.
- Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?", IEEE/IFIP Networking, Switzerland 2026.
- Downloads last month
- 147
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support