Spaces:
Running
Running
A newer version of the Gradio SDK is available: 6.19.0
NLProxy Service Module Reference
This document describes the compression orchestration service in service/compression.py.
Purpose
CompressionService coordinates the full prompt compression workflow, including shielding, segmentation, compression, reconstruction, and safety validation.
Primary Class
CompressionService
Responsibilities
- Orchestrates prompt transformation across multiple core modules.
- Executes shielding, semantic segmentation, compression, and reconstruction stages.
- Provides thread pool parallelism for batch workloads.
- Optionally integrates Redis-backed semantic caching.
- Controls privacy mode and NLI refinement.
Constructor
CompressionService(
use_cache: bool = True,
device: Optional[str] = None,
redis_url: Optional[str] = None,
nli_refinement_fn: Optional = None,
privacy_mode: bool = False,
models_dir: Optional[Path] = None,
llm_default_model: Optional[str] = None,
thread_pool_workers: Optional[int] = None,
)
Key Behaviors
- Builds a thread pool via
ThreadPoolExecutor(max_workers=self.thread_pool_workers). - Reads
NLPROXY_COMPRESSION_WORKERSto override default worker count. - Initializes
PromptShield,SemanticSegmenter,SemanticCompressor,PromptReconstructor, andSafetyChecker. - Optionally initializes
SemanticLLMCacheif Redis is configured. - Caches shield and embedding results in memory when
use_cache=True.
Pipeline Stages
- Shielding:
PromptShieldprotects sensitive text and extracts restrictions. - Segmentation:
SemanticSegmentersplits text into sentences and encodes them. - Compression:
SemanticCompressorselects representative sentence clusters. - Reconstruction:
PromptReconstructorrebuilds prompt text and computes metrics. - Safety:
SafetyCheckervalidates intent preservation and optional perplexity.
Parallel Execution
- Uses
ThreadPoolExecutorfor parallel shield and compression tasks. - Submits
_shield_with_cacheand_process_singlejobs concurrently. - Collects results with
as_completed(). - Ensures blocking CPU-bound operations do not stall the event loop.
Performance Characteristics
- Latency is dominated by embedding generation and LLM inference.
- Batch complexity is roughly O(N · T_stage / M) where N = prompt count and M = worker count.
- Effective compression aggressiveness adapts based on NLI confidence and domain mode.
Configuration
NLPROXY_COMPRESSION_WORKERScontrols thread pool size.privacy_modetoggles strict handling of protected entities.redis_urlenables distributed semantic cache.models_dirdefines the local model artifact directory.
Dependencies
numpyredis(optional)sentence_transformersoptimum.onnxruntime(for ONNX segmenter backends)
Edge Cases
- Empty prompts return a result with zero tokens and a safety alert.
- Redis unavailability causes the service to fallback to disabled semantic cache.
compress_batch_asyncmust be called within an async event loop.- Compression failures are retried up to configured limits in the API layer.