--- license: mit library_name: mlx pipeline_tag: text-generation tags: - transformers - mlx base_model: meituan-longcat/LongCat-Flash-Chat --- # TOTORONG/LongCat-Flash-3.5bits This model [TOTORONG/LongCat-Flash-3.5bits](https://huggingface.co/TOTORONG/LongCat-Flash-3.5bits) was converted to MLX format from [meituan-longcat/LongCat-Flash-Chat](https://huggingface.co/meituan-longcat/LongCat-Flash-Chat) using mlx-lm version **0.27.1**. #Quantized model with 3.516 bits per weight to fit M3 Ultra 256GB #“Selected layers” (the precision bump mask) #A layer is considered early/late/periodic if its index i (from model.layers.i) satisfies: #i < num_layers // 8 or #i >= 7 * num_layers // 8 or #(i - num_layers // 8) % 3 == 2 #These layers receive: #Q/K/V: 3b → 4b #O-proj: 4b → 6b #Experts (.mlps..*): 2b → 3b #Switch-MLP remains 3b across all layers. #This mask preserves prompt-sensitivity (front) and output stability (tail), with a periodic boost to reduce worst-case error accumulation. ## Use with mlx ```bash pip install mlx-lm ``` ```python from mlx_lm import load, generate model, tokenizer = load("TOTORONG/LongCat-Flash-3.5bits") prompt = "hello" if tokenizer.chat_template is not None: messages = [{"role": "user", "content": prompt}] prompt = tokenizer.apply_chat_template( messages, add_generation_prompt=True ) response = generate(model, tokenizer, prompt=prompt, verbose=True) ```