Implemented robust type handling and tokenizer exception safety

#2
by sk16er - opened

Description

This PR addresses critical deserialization errors, type-mismatch vulnerabilities, and low-level boundary crashes within the custom tiktoken backend.

Key Changes

  • Configuration Serialization Alignment: Added explicit integer coercion for added_tokens_decoder keys loaded via JSON configuration files. This eliminates runtime KeyError exceptions on special token bounds (such as [BOS]) caused by unmapped stringified dictionary keys.
  • Upstream Type Coercion: Implemented conditional casting (.tolist()) for inputs passed to the encoder/decoder interfaces. This ensures native Python integer sequences are supplied to the underlying Rust binary, completely mitigating PyO3 serialization panics when handling PyTorch or NumPy tensors.
  • Robust Exception Handling: Wrapped the core BPE byte decoding routine in defensive exception boundaries, providing fallback paths for corrupt or out-of-bounds byte tokens to prevent terminal thread panics.
Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment