Prose2Tags 4B

#2256

by redaihf - opened 15 days ago

Discussion

redaihf

15 days ago

https://huggingface.co/USS-Inferprise/Phi4-Mini-Prose2Tags-4B

RichardErkhov

14 days ago

weird, but

Phi4-Mini-Prose2Tags-4B ValueError: Error: Missing Phi4-Mini-Prose2Tags-4B/tokenizer.model
5 days ago

redaihf

14 days ago

@USS-Inferprise might know more.

USS-Inferprise

14 days ago

•

edited 14 days ago

Llama.cpp hf to guff error?

The Phi3MiniModel class in the script is hardcoded to look for a tokenizer_class string of exactly 'GPT2Tokenizer'.

In Phi-4, that field in tokenizer_config.json might be named LlamaTokenizer, TiktokenTokenizer, or even just PreTrainedTokenizer, causing the script to skip the _set_vocab_gpt2() branch and fall straight into the ValueError.

Fix

In the script replace

class Phi3MiniModel(TextModel):
    model_arch = gguf.MODEL_ARCH.PHI3

    def set_vocab(self):
        # Phi-4 model uses GPT2Tokenizer
        tokenizer_config_file = self.dir_model / 'tokenizer_config.json'
        if tokenizer_config_file.is_file():
            with open(tokenizer_config_file, "r", encoding="utf-8") as f:
                tokenizer_config_json = json.load(f)
                tokenizer_class = tokenizer_config_json['tokenizer_class']
                if tokenizer_class == 'GPT2Tokenizer':
                    return self._set_vocab_gpt2()

        from sentencepiece import SentencePieceProcessor

with this:

def set_vocab(self):
        # NEW LOGIC: If we have a tokenizer.json, use GPT2/HFFT logic immediately.
        # This bypasses the need for the elusive tokenizer.model.
        if (self.dir_model / 'tokenizer.json').is_file():
            print("Detected tokenizer.json. Using GPT2/BPE vocabulary logic for Phi-4.")
            return self._set_vocab_gpt2()

        # Fallback for older Phi-3 models that actually use SentencePiece
        tokenizer_path = self.dir_model / 'tokenizer.model'
        if not tokenizer_path.is_file():
             raise ValueError(f'Error: Missing {tokenizer_path}. If this is a Phi-4 model, ensure tokenizer.json is in the folder.')

        from sentencepiece import SentencePieceProcessor

RichardErkhov

14 days ago

sorry, we are working only with main llama cpp and do not allow modifications/forks =(

USS-Inferprise

14 days ago

static quants available here USS-Inferprise/Phi4-Mini-Prose2Tags-4B-GGUF

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment