NLProxy / nlproxy /docs /service.md
Luiserb's picture
first commit
2129c29
|
Raw
History Blame Contribute Delete
3.15 kB

A newer version of the Gradio SDK is available: 6.19.0

Upgrade

NLProxy Service Module Reference

This document describes the compression orchestration service in service/compression.py.

Purpose

CompressionService coordinates the full prompt compression workflow, including shielding, segmentation, compression, reconstruction, and safety validation.

Primary Class

CompressionService

Responsibilities

  • Orchestrates prompt transformation across multiple core modules.
  • Executes shielding, semantic segmentation, compression, and reconstruction stages.
  • Provides thread pool parallelism for batch workloads.
  • Optionally integrates Redis-backed semantic caching.
  • Controls privacy mode and NLI refinement.

Constructor

CompressionService(
    use_cache: bool = True,
    device: Optional[str] = None,
    redis_url: Optional[str] = None,
    nli_refinement_fn: Optional = None,
    privacy_mode: bool = False,
    models_dir: Optional[Path] = None,
    llm_default_model: Optional[str] = None,
    thread_pool_workers: Optional[int] = None,
)

Key Behaviors

  • Builds a thread pool via ThreadPoolExecutor(max_workers=self.thread_pool_workers).
  • Reads NLPROXY_COMPRESSION_WORKERS to override default worker count.
  • Initializes PromptShield, SemanticSegmenter, SemanticCompressor, PromptReconstructor, and SafetyChecker.
  • Optionally initializes SemanticLLMCache if Redis is configured.
  • Caches shield and embedding results in memory when use_cache=True.

Pipeline Stages

  1. Shielding: PromptShield protects sensitive text and extracts restrictions.
  2. Segmentation: SemanticSegmenter splits text into sentences and encodes them.
  3. Compression: SemanticCompressor selects representative sentence clusters.
  4. Reconstruction: PromptReconstructor rebuilds prompt text and computes metrics.
  5. Safety: SafetyChecker validates intent preservation and optional perplexity.

Parallel Execution

  • Uses ThreadPoolExecutor for parallel shield and compression tasks.
  • Submits _shield_with_cache and _process_single jobs concurrently.
  • Collects results with as_completed().
  • Ensures blocking CPU-bound operations do not stall the event loop.

Performance Characteristics

  • Latency is dominated by embedding generation and LLM inference.
  • Batch complexity is roughly O(N · T_stage / M) where N = prompt count and M = worker count.
  • Effective compression aggressiveness adapts based on NLI confidence and domain mode.

Configuration

  • NLPROXY_COMPRESSION_WORKERS controls thread pool size.
  • privacy_mode toggles strict handling of protected entities.
  • redis_url enables distributed semantic cache.
  • models_dir defines the local model artifact directory.

Dependencies

  • numpy
  • redis (optional)
  • sentence_transformers
  • optimum.onnxruntime (for ONNX segmenter backends)

Edge Cases

  • Empty prompts return a result with zero tokens and a safety alert.
  • Redis unavailability causes the service to fallback to disabled semantic cache.
  • compress_batch_async must be called within an async event loop.
  • Compression failures are retried up to configured limits in the API layer.