Implemented robust type handling and tokenizer exception safety
#2
by sk16er - opened
Description
This PR addresses critical deserialization errors, type-mismatch vulnerabilities, and low-level boundary crashes within the custom tiktoken backend.
Key Changes
- Configuration Serialization Alignment: Added explicit integer coercion for
added_tokens_decoderkeys loaded via JSON configuration files. This eliminates runtimeKeyErrorexceptions on special token bounds (such as[BOS]) caused by unmapped stringified dictionary keys. - Upstream Type Coercion: Implemented conditional casting (
.tolist()) for inputs passed to the encoder/decoder interfaces. This ensures native Python integer sequences are supplied to the underlying Rust binary, completely mitigating PyO3 serialization panics when handling PyTorch or NumPy tensors. - Robust Exception Handling: Wrapped the core BPE byte decoding routine in defensive exception boundaries, providing fallback paths for corrupt or out-of-bounds byte tokens to prevent terminal thread panics.