NICoLE-LLM

NICoLE is a compact LLM-based controller for congestion-aware RTP/WebRTC adaptive video streaming.

It predicts:

  • ECN
  • Current Profile (CP)
  • Next Profile (NP)

from RTP packetization and queue telemetry using compact symbolic prompting.

Optimized for:

  • low-latency inference
  • edge deployment
  • GGUF quantization
  • deterministic structured outputs

Applications:

  • WebRTC adaptive streaming
  • congestion-aware real-time video encoding adaptation
  • in-Network QoE Optimization
  • edge AI networking

Profiles

Profile Resolution FPS GoP
P0 3840×2160 (4K) 30 / 60 / 90 / 120 2 s
P1 1920×1080 30 / 60 / 90 / 120 2 s
P2 1280×720 30 / 60 / 90 / 120 2 s
P3 640×360 30 / 60 / 90 / 120 2 s

The dataset was generated using real-time WebRTC streaming under a 40 Mbps bottleneck shared between background traffic and adaptive RTP video streaming.


Prompt Format

Input order:


PS FS IFGS IFGR CQ LQ E

Output order:


E C N

Example:

I:PS FS IFGS IFGR CQ LQ E
O:E C N

U:1400,40,34,33,2,0,0

A:

Expected output: text 0,1,1

Hugging Face Usage (Python code)

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "YOUR_USERNAME/NICoLE-LLM"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto"
)

prompt = """I:PS FS IFGS IFGR CQ LQ E
O:E C N
U:1400,40,34,33,2,0,0
A:"""

inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=6,
    do_sample=False
)
print(tok.decode(out[0], skip_special_tokens=True))

GGUF / llama.cpp Usage

./llama-cli \
-no-cnv \
-t 4 \
-m nicole-q4.gguf \
-p "I:PS FS IFGS IFGR CQ LQ E
O:E C N

U:1400,40,34,33,2,0,0

A:" \
-n 6 \
--temp 0 \
--top-k 1

Runtime Configuration

Parameter Value
Runtime llama.cpp
Quantization Q4_K_M
Model Size 636 MB
Context Length 4096
Inference Deterministic
Prompting Compact Symbolic

CPU Core Benchmark

Threads Response (ms) Decisions/sec Tokens/sec
1 1325 0.75 52.71
2 624 1.60 113.30
4 343 2.91 203.35
8 904 1.11 60.70
16 1043 0.96 132.46
32 1432 0.70 104.29

Best CPU deployment:

  • 4 threads
  • 343 ms response time
  • 2.91 decisions/sec

Compact symbolic prompting significantly reduces:

  • prompt tokens
  • KV-cache usage
  • inference latency
  • deployment overhead compared to verbose natural-language prompting.

Quantized Models

Available quantization:

  • Q4_K_M (recommended)

Runtime:

  • llama.cpp

Designed for:

  • edge deployment
  • CPU inference
  • bounded symbolic control inference
  • real-time congestion-aware adaptation

Limitations

  • Trained under a 40 Mbps bottleneck scenario
  • Designed for bounded RTP/WebRTC streaming tasks
  • Not intended for open-ended conversational generation

Citation

If you use this model, please cite the NICoLE paper and repository.

  • Alireza Shirmarz, Fabio Luciano Verdi, Gyanesh Patra, Gergely Pongracz,"NICoLE: Are In-Network LLM-Based Agents Cost-Feasible for RTP Video Streaming?", IEEE/IFIP Networking, Switzerland 2026.
Downloads last month
147
Safetensors
Model size
1B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support