TuanNguyen2003's picture
Create README.md
515e131 verified
metadata
license: apache-2.0
language:
  - en
base_model: Qwen/Qwen3-1.7B
pipeline_tag: text-generation
library_name: transformers
tags:
  - qwen3
  - sft
  - general-knowledge
  - multiple-choice
  - cs-552
datasets:
  - cais/mmlu
metrics:
  - accuracy

General Knowledge Model

This model is a fine-tuned version of Qwen/Qwen3-1.7B for the CS-552 Modern NLP course project.

The model targets the General Knowledge benchmark, where it answers closed-book multiple-choice factual and reasoning questions. It was trained to return the final answer as a single option letter inside a LaTeX \boxed{} expression.

Intended output format

The model should produce answers in the following format:

\boxed{C}

Anything outside \boxed{} is treated as reasoning and is not used for scoring by the evaluation pipeline.

Training procedure

This checkpoint was trained using Supervised Fine-Tuning (SFT) with LoRA on top of Qwen/Qwen3-1.7B.

The SFT data was formatted as instruction-style multiple-choice examples:

Q: ...
A) ...
B) ...
C) ...
D) ...

Answer: \boxed{C}

The current checkpoint was trained on a processed General Knowledge dataset derived from MMLU-style multiple-choice examples.

Model behavior

The model is optimized for:

  • closed-book factual question answering
  • multiple-choice reasoning
  • final-answer extraction through \boxed{}
  • concise option-letter responses

The tokenizer chat template was configured with non-thinking mode to encourage concise answers.

Local validation

On the provided General Knowledge validation snapshot from the course starter repository, this checkpoint achieved:

  • Extraction rate: 10/10
  • Accuracy: 6/10

These validation samples are only a small sanity-check set and are not the hidden evaluation benchmark.

Framework versions

  • Transformers
  • PEFT
  • PyTorch
  • Datasets
  • Hugging Face Hub

Limitations

This is an intermediate SFT baseline, not the final model. It was trained mainly to establish a working General Knowledge pipeline and verify that the model can produce extractable boxed answers. Performance may vary on broader or harder factual reasoning tasks.