--- license: apache-2.0 language: - en base_model: Qwen/Qwen3-1.7B pipeline_tag: text-generation library_name: transformers tags: - qwen3 - sft - general-knowledge - multiple-choice - cs-552 datasets: - cais/mmlu metrics: - accuracy --- # General Knowledge Model This model is a fine-tuned version of [`Qwen/Qwen3-1.7B`](https://huggingface.co/Qwen/Qwen3-1.7B) for the CS-552 Modern NLP course project. The model targets the **General Knowledge** benchmark, where it answers closed-book multiple-choice factual and reasoning questions. It was trained to return the final answer as a single option letter inside a LaTeX `\boxed{}` expression. ## Intended output format The model should produce answers in the following format: ```text \boxed{C} ``` Anything outside `\boxed{}` is treated as reasoning and is not used for scoring by the evaluation pipeline. ## Training procedure This checkpoint was trained using **Supervised Fine-Tuning (SFT)** with LoRA on top of `Qwen/Qwen3-1.7B`. The SFT data was formatted as instruction-style multiple-choice examples: ```text Q: ... A) ... B) ... C) ... D) ... Answer: \boxed{C} ``` The current checkpoint was trained on a processed General Knowledge dataset derived from MMLU-style multiple-choice examples. ## Model behavior The model is optimized for: - closed-book factual question answering - multiple-choice reasoning - final-answer extraction through `\boxed{}` - concise option-letter responses The tokenizer chat template was configured with non-thinking mode to encourage concise answers. ## Local validation On the provided General Knowledge validation snapshot from the course starter repository, this checkpoint achieved: - Extraction rate: `10/10` - Accuracy: `6/10` These validation samples are only a small sanity-check set and are not the hidden evaluation benchmark. ## Framework versions - Transformers - PEFT - PyTorch - Datasets - Hugging Face Hub ## Limitations This is an intermediate SFT baseline, not the final model. It was trained mainly to establish a working General Knowledge pipeline and verify that the model can produce extractable boxed answers. Performance may vary on broader or harder factual reasoning tasks.