geetu040 commited on
Commit
f3a92fc
·
verified ·
1 Parent(s): f8acd96

Add files using upload-large-folder tool

Browse files
Files changed (1) hide show
  1. README.md +6 -2
README.md CHANGED
@@ -55,12 +55,12 @@ The checkpoint configuration records the following quantization settings:
55
  "quant_method": "bitsandbytes",
56
  "bnb_4bit_quant_type": "fp4",
57
  "bnb_4bit_quant_storage": "uint8",
58
- "bnb_4bit_compute_dtype": "float32",
59
  "bnb_4bit_use_double_quant": false
60
  }
61
  ```
62
 
63
- The model config also sets `use_cache` to `false`, matching the local quantized checkpoint.
64
 
65
  ## Quickstart
66
 
@@ -82,6 +82,10 @@ model = AutoModelForCausalLM.from_pretrained(
82
  device_map="auto",
83
  )
84
 
 
 
 
 
85
  batch_size, lookback_length = 1, 2880
86
  seqs = torch.randn(batch_size, lookback_length).to(model.device)
87
 
 
55
  "quant_method": "bitsandbytes",
56
  "bnb_4bit_quant_type": "fp4",
57
  "bnb_4bit_quant_storage": "uint8",
58
+ "bnb_4bit_compute_dtype": "bfloat16",
59
  "bnb_4bit_use_double_quant": false
60
  }
61
  ```
62
 
63
+ The model config also sets `use_cache` to `true`, matching the local quantized checkpoint. For lower memory usage during generation, set `model.config.use_cache = False` after loading the model.
64
 
65
  ## Quickstart
66
 
 
82
  device_map="auto",
83
  )
84
 
85
+ # Optional: reduce generation memory usage by disabling the KV cache.
86
+ # This can be useful on smaller GPUs or for longer lookback windows.
87
+ model.config.use_cache = False
88
+
89
  batch_size, lookback_length = 1, 2880
90
  seqs = torch.randn(batch_size, lookback_length).to(model.device)
91