SPARK-Report / README.md
KISTEP's picture
Update README.md
e0969d1 verified
---
language:
- ko
base_model:
- mistralai/Mistral-Nemo-Instruct-2407
pipeline_tag: text-generation
---
## Usage Guide
๊ฐœ์ธ์€ ์ž์œ ๋กญ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๊ธฐ์—… ๋ฐ ๊ธฐ๊ด€์€ ๋น„์ƒ์—…์  ๋ชฉ์ ์œผ๋กœ ์ด์šฉํ•ด ์ฃผ์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.
๋˜ํ•œ, ์ถ”ํ›„ ํ˜‘์—… ๋ฐ ๋„คํŠธ์›Œํฌ ๊ตฌ์ถ•์„ ์œ„ํ•ด ๊ธฐ๊ด€ ์ •๋ณด์™€ AI ๋ชจ๋ธ ์‚ฌ์šฉ ๋‹ด๋‹น์ž ์ •๋ณด๋ฅผ ๋ฉ”์ผ๋กœ ๋ณด๋‚ด์ฃผ์‹œ๋ฉด ์—ฐ๋ฝ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.
## CONTACT : kistep_ax@kistep.re.kr
Individuals are free to use this without restrictions.
For companies and institutions, please use it for non-commercial purposes.
Additionally, to facilitate future collaboration and network building, please send us an email with your institution's information and the contact details of the person responsible for using the AI model. We will get in touch with you.
### 1. Description
**SPARK-Report** is a specialized report writing model developed by the Korea Institute of S&T Evaluation and Planning (KISTEP). The model is trained to generate reports in a two-step process: first creating a table of contents, then generating the main content.
### 2. Key Features
- **Content Structure:** Generates report outlines based on title, keywords, and desired length
- **Writing Styles:** Capable of producing reports in both descriptive and bullet-point formats
- **Structured Output:** Delivers well-formatted content for enhanced readability
- **Base Model:** Built on Mistral-nemo as the foundation model
- **Training Method:** Trained with Supervised Fine-Tuning (SFT)
- **Context Length:** The maximum context length for training data is 16,384
### 3. Data
| source | KISTEP Document |
|---|---|
| count | 31,058 |
### 4. Usage
- When using ollama, you can utilize the **Modelfile**.
- Python code
```python
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "kistepAI/SPARK-Report"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
messages = [
{"role": "user", "content": "์•ˆ๋…•ํ•˜์„ธ์š”."}
]
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt"
).to(model.device)
terminators = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids("</s>")
]
outputs = model.generate(
input_ids,
max_new_tokens=512,
eos_token_id=terminators,
do_sample=True,
temperature=0.3,
top_p=0.95,
)
print(tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True))
```
- Recommended Prompt Template (Table of Contents)
`(input: {TITLE}, {KEYWORDS}, {LEGNTH})`
```yaml
propmt_template: |
๋‹น์‹ ์€ ๋ณด๊ณ ์„œ ๋ชฉ์ฐจ ์ƒ์„ฑ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค. ์ฃผ์–ด์ง„ ๋ณด๊ณ ์„œ ์ œ๋ชฉ๊ณผ ํ‚ค์›Œ๋“œ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ์ฒด๊ณ„์ ์ธ ๋ชฉ์ฐจ๋ฅผ ์ƒ์„ฑํ•ด ์ฃผ์„ธ์š”.
# ์ž…๋ ฅ์ •๋ณด
- ์ œ๋ชฉ: {TITLE}
- ํ‚ค์›Œ๋“œ: {KEYWORDS}
- ๋ถ„๋Ÿ‰: {LENGTH}
# ๋ชฉ์ฐจ ์ž‘์„ฑ ์ง€์นจ
- ๊ธฐ๋ณธ๊ตฌ์กฐ(head1)๋Š” ์„œ๋ก , ๋ณธ๋ก , ๊ฒฐ๋ก ์œผ๋กœ ๊ตฌ์„ฑ
- ๊ฐ head1 ํ•ญ๋ชฉ ๋ณ„๋กœ 3-4๊ฐœ์˜ ์ƒ์„ธ ๋ชฉ์ฐจ(head2) ์ƒ์„ฑ
- ํ‚ค์›Œ๋“œ๋ฅผ ์ ๊ทน ํ™œ์šฉํ•˜์—ฌ ํŠน์ง•์ ์ธ ํ‘œํ˜„ ์‚ฌ์šฉ
- ์‹œ๊ฐ„ ํ‘œํ˜„์ด๋‚˜ ํŠน์ˆ˜๋ฌธ์ž ์‚ฌ์šฉ ์ง€์–‘
- ์ œ๋ชฉ ์—ฐ๊ด€์„ฑ๊ณผ ์ „์ฒด ์ผ๊ด€์„ฑ ์œ ์ง€
- ๊ฐœ์กฐ์‹ ๋ฌธ์ฒด์™€ ๋ช…์‚ฌํ˜• ์ข…๊ฒฐ์–ด๋ฏธ ์‚ฌ์šฉ
```
- Recommended Prompt Template (Main Content)
`(input: {TITLE}, {SECTIONS}, {SECTION}, {TYPE}, {DOCUMENTS})`
```yaml
propmt_template: |
๋‹น์‹ ์€ ๋ณด๊ณ ์„œ ์ž‘์„ฑ ์ „๋ฌธ๊ฐ€์ž…๋‹ˆ๋‹ค. ๋ณด๊ณ ์„œ ์ œ๋ชฉ์€ {TITLE}์ด๋ฉฐ, ์ „์ฒด ๋ชฉ์ฐจ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค:
{SECTIONS}
ํ˜„์žฌ ์ž‘์„ฑํ•  ์„น์…˜์€ {SECTION}์ž…๋‹ˆ๋‹ค. ์ž‘์„ฑ ์Šคํƒ€์ผ์€ {TYPE}์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
์ฃผ์š” ์ง€์นจ:
1. ๋‹ต๋ณ€์€ ๋ฐ˜๋“œ์‹œ ์•„๋ž˜์˜ ์ •๋ณด๋งŒ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ž‘์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
{DOCUMENTS}
1. ๋ณธ๋ฌธ ์ž‘์„ฑ ์ „์— <reason> ํƒœ๊ทธ ์•ˆ์— ์•„๋ž˜ ๋‚ด์šฉ์„ ํฌํ•จํ•˜์—ฌ ์ถ”๋ก  ๊ณผ์ •์„ ์ตœ๋Œ€ 10๋ฌธ์žฅ ์ด๋‚ด๋กœ ์„ค๋ช…ํ•˜์„ธ์š”:
- ๋ณธ๋ฌธ ์ž‘์„ฑ์— ์‚ฌ์šฉํ•œ ์ฒญํฌ ์ธ๋ฑ์Šค ํ‘œ๊ธฐ
- ๊ฐ ์ฒญํฌ์˜ ๊ด€๋ จ ์ •๋ณด ์„ค๋ช…
- ํ•„์š”์‹œ ๋ถ€๊ฐ€์ ์ธ ์ œ์•ˆ ์‚ฌํ•ญ (๋‹จ, ์‚ฌ์‹คํ™•์ธ ํ•„์š”์„ฑ ์–ธ๊ธ‰ ํ•„์ˆ˜)
2. ๋‹ค์Œ ํ˜•์‹์œผ๋กœ ๋‹ต๋ณ€์„ ์ž‘์„ฑํ•˜์„ธ์š”:
- ์ž‘์„ฑ ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ: <reason>์ถ”๋ก  ๊ณผ์ •</reason> <answer>๋ณธ๋ฌธ ๋‚ด์šฉ</answer>
- ์ž‘์„ฑ ๋ถˆ๊ฐ€๋Šฅํ•œ ๊ฒฝ์šฐ: <reason>๊ด€๋ จ ๋‚ด์šฉ ๋ถ€์žฌ</reason> <answer>์ œ๊ณต๋œ ๋ฌธ์„œ๋ฅผ ๋ฐ”ํƒ•์œผ๋กœ ๋‹ต๋ณ€ํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค.</answer>
์„œ์ˆ ํ˜•(descriptive) ์ž‘์„ฑ ๊ทœ์น™:
- ์ตœ๋Œ€ 30๋ฌธ์žฅ ์ด๋‚ด๋กœ ์ž‘์„ฑ
- ๋ฌธ๋‹จ์€ ๋‚ด์šฉ์— ๋”ฐ๋ผ 1~3๊ฐœ๋กœ ๊ตฌ์„ฑ
- ์—ฐ๊ฒฐ์–ด ์‚ฌ์šฉ ์ œํ•œ
- ํ‚ค์›Œ๋“œ ๋ฐ˜๋ณต ์ œํ•œ
๊ฐœ์กฐ์‹(bullet_point) ์ž‘์„ฑ ๊ทœ์น™:
โ–ก Level 1: ํ•ต์‹ฌ ๋‚ด์šฉ
โ—ฆ Level 2: ํ•˜์œ„ ๋‚ด์šฉ(1~3๊ฐœ ํ•ญ๋ชฉ)
- Level 3: ๋ถ€์—ฐ ์„ค๋ช…
- ๋งˆ์นจํ‘œ ์ƒ๋žต
- ๊ฐ„๊ฒฐํ•˜๊ณ  ๋ช…ํ™•ํ•œ ๋ฌธ์žฅ ๊ตฌ์„ฑ
- ์ „๋ฌธ์ ์ธ ์šฉ์–ด ์‚ฌ์šฉ
- ์˜๋ฏธ ๋‹จ์œ„๋กœ ๋‹จ๋ฝ ๊ตฌ๋ถ„
```
### 5. Benchmark
TBD