zay25
/

final-quantized-w8a8

Text Generation

text-generation-inference

8-bit precision

compressed-tensors

Model card Files Files and versions

final-quantized-w8a8 / README.md

zay25's picture

update readme

24f020d verified 7 months ago

|

history blame contribute delete

1.5 kB

	---
	library_name: transformers
	tags: []
	---
	# MNLP_M3_quantized_model

	This model is a quantized version of the best-performing MCQA model from our CS-552 Modern NLP project (Milestone 3). It was optimized for efficient inference while maintaining strong accuracy on STEM multiple-choice question answering tasks.

	## Model Summary

	- Base model: [hssawhney/Best-Performing-Model](https://huggingface.co/hssawhney/Best-Performing-Model)
	- Quantization type: Post-Training Quantization (PTQ)
	- Precision: W8A8
	- Method: SmoothQuant + GPTQ via [LLMCompressor](https://github.com/vllm-project/llm-compressor)
	- Excluded layers: `lm_head` (to preserve logits quality)
	- Final model size: ~717 MB

	## Calibration Details

	- Calibration dataset: 512 samples randomly selected from [`zay25/MNLP_M3_quantized_dataset`](https://huggingface.co/datasets/zay25/MNLP_M3_quantized_dataset)
	- The calibration set preserves the original format (STEM MCQA) and was selected to represent a broad distribution of question types.


	## Intended Use

	This model is intended for:
	- STEM-focused multiple-choice question answering
	- Educational assistant systems
	- Low-resource inference environments (e.g., CPU, edge devices)

	It is not intended for freeform generation or use outside the MCQA format.

	## License

	This model inherits the license of the base model. Check the [hssawhney/Best-Performing-Model](https://huggingface.co/hssawhney/Best-Performing-Model) repo for license terms.