kashif HF Staff commited on
Commit
54d1dc0
·
verified ·
1 Parent(s): 19273a6

tokenizer: expose .vocab property for fast-tokenizer-style callers

Browse files

Adds a `vocab` property mirroring `get_vocab()` so downstream tools that expect the fast-tokenizer interface (e.g. llama.cpp's `convert_hf_to_gguf.py` which does `tokenizer.vocab`) work without a fallback. No behavior change.

Files changed (1) hide show
  1. tokenizer.py +8 -0
tokenizer.py CHANGED
@@ -144,6 +144,14 @@ class HybridDNATokenizer(PreTrainedTokenizer):
144
  def get_vocab(self) -> Dict[str, int]:
145
  return self._vocab.copy()
146
 
 
 
 
 
 
 
 
 
147
  def __len__(self):
148
  # Override default (len(get_vocab())) because get_vocab() deduplicates
149
  # CCCCCC which exists as both BPE (ID 91443) and DNA 6-mer (ID 154402).
 
144
  def get_vocab(self) -> Dict[str, int]:
145
  return self._vocab.copy()
146
 
147
+ @property
148
+ def vocab(self) -> Dict[str, int]:
149
+ # Compatibility shim: fast tokenizers (PreTrainedTokenizerFast) expose
150
+ # `tokenizer.vocab` as a property; slow PreTrainedTokenizer subclasses
151
+ # like this one only expose `get_vocab()`. Some downstream tools
152
+ # (e.g. llama.cpp's convert_hf_to_gguf.py) read `.vocab` directly.
153
+ return self._vocab
154
+
155
  def __len__(self):
156
  # Override default (len(get_vocab())) because get_vocab() deduplicates
157
  # CCCCCC which exists as both BPE (ID 91443) and DNA 6-mer (ID 154402).