DataPilot
/

ArrowIdeative-13b-Instruct-test-llm-jp-v0.2

@@ -1,199 +1,236 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+license: apache-2.0
+datasets:
+- TeamDelta/bare-ja-v0.1
+language:
+- ja
+base_model:
+- llm-jp/llm-jp-3-13b
+pipeline_tag: text-generation
 ---
+# ArrowIdeative-13b-NeoBase-ZERO-llm-jp
+## 概要
+**ArrowIdeative-13b-NeoBase-ZERO-llm-jp** は、ベースモデルから **GRPO（RL）だけ**で事後学習を行うことを主軸に設計された、日本語向けLLMです。狙いとしては、典型的な「強い指示追従（Instruct）」に寄せ切らず、**ベースモデル寄りの“出力の自由度”**を残しつつ、**チャット運用に最低限必要な形式順守**と、**回答品質の底上げ**を同時に実現することです。
+位置づけを一言でまとめると：
+- **「ある程度プロンプトエンジニアリングが効くベースモデル」**
+- ただし **完全なInstructモデルではない**（過剰な同調・過剰な定型化を狙っていない）
+---
+## モデルの要点
+- **学習方式**：ベースモデルから **GRPOのみ**で直接作成（SFTを主軸にしない方針）
+- **目的**：
+  1. **チャットテンプレート順守**（例：終端トークンなど、形式崩れの抑制）
+  2. **回答の品質向上**（報酬モデルによるスカラー報酬の導入）
+- **特性**：ベースモデルに近い性格を維持しやすい設計（＝指示追従の“均質化”を抑える意図）
+---
+## 推論コード
+```python
+import torch
+from copy import deepcopy
+from transformers import AutoTokenizer, AutoModelForCausalLM, StoppingCriteria, StoppingCriteriaList
+# ===== モデル =====
+model_path = "DataPilot/ArrowIdeative-13b-NeoBase-ZERO-llm-jp-v0.2"
+tokenizer = AutoTokenizer.from_pretrained(model_path)
+model = AutoModelForCausalLM.from_pretrained(
+    model_path,
+    device_map="auto",
+    torch_dtype=torch.bfloat16,
+)
+model.eval()
+system_prompt = """あなたは有能なアシスタントです。日本語で丁寧に答えてください。"""
+prompt = """CPUとGPUの違いについて教えてください。"""
+# （元コードのChatML形式を維持）
+text = f"""<|im_start|>system
+{system_prompt}<|im_end|>
+<|im_start|>user
+{prompt}<|im_end|>
+<|im_start|>assistant
+"""
+inputs = tokenizer(text, add_special_tokens=False, return_tensors="pt", return_token_type_ids=False).to(model.device)
+prompt_len = inputs["input_ids"].shape[1]
+# "<|im_end|>" のトークン列（1トークンとは限らないので列で扱う）
+stop_ids = tokenizer.encode("<|im_end|>", add_special_tokens=False)
+stop_ids = torch.tensor(stop_ids, device=model.device, dtype=inputs["input_ids"].dtype)
+class StopOnImEnd(StoppingCriteria):
+    def __init__(self, stop_ids_tensor: torch.Tensor):
+        super().__init__()
+        self.stop_ids = stop_ids_tensor
+    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
+        k = int(self.stop_ids.numel())
+        if k == 0 or input_ids.shape[1] < k:
+            return False
+        return torch.equal(input_ids[0, -k:], self.stop_ids)
+stopping_criteria = StoppingCriteriaList([StopOnImEnd(stop_ids)])
+# 既定EOSで止まらないようにする（= "<|im_end|>" のみで停止させる）
+gen_config = deepcopy(model.generation_config)
+gen_config.eos_token_id = None
+gen_config.pad_token_id = tokenizer.pad_token_id if tokenizer.pad_token_id is not None else model.config.eos_token_id
+with torch.inference_mode():
+    output = model.generate(
+        **inputs,
+        generation_config=gen_config,
+        stopping_criteria=stopping_criteria,
+        max_new_tokens=1024,
+        do_sample=True,
+        top_p=0.95,
+        temperature=0.5,
+        repetition_penalty=1.05,
+    )
+generated = tokenizer.decode(output[0, prompt_len:], skip_special_tokens=False)
+print(generated.split("<|im_end|>", 1)[0])
+```
+---
+## ベースモデル
+- Base: **llm-jp-3-13b**
+  https://huggingface.co/llm-jp/llm-jp-3-13b
+---
+## 使用データ（概要）
+- Dataset: **TeamDelta/bare-ja-v0.1** の **質問（プロンプト）部分のみ**を一部利用
+  https://huggingface.co/datasets/TeamDelta/bare-ja-v0.1
+このデータは、以下の合成フローにより作成されたものです（要約）：
+1. **ベースモデル（Sarashina2-70b）**で質問/回答のたたき台を生成
+2. **Microsoft Phi-4-mini**で品質キュレーション（選別・整形）
+3. **Multilingual E5**で多様性フィルタリング（近似質問の除去、重複削減）
+- 参照：Sarashina2-70b
+  https://www.sbintuitions.co.jp/blog/entry/2024/08/21/144254
+- BARE用プロンプト：
+  https://github.com/foxn2000/sdg/blob/main/prompts/bare.txt
+---
+## 学習構成
+### 学習・推論フレームワーク
+- 学習：**Unsloth**
+- 報酬推論：**SGLang**
+### 使用デバイス
+- **NVIDIA RTX 5090 (32GB)**：主学習
+- **NVIDIA RTX 4060 Ti (16GB)**：報酬モデル推論
+### 報酬モデル
+- **cyberagent/ca-reward-3b-ja**
+  https://huggingface.co/cyberagent/ca-reward-3b-ja
+---
+## 報酬設計（概要）
+報酬は以下の5つの報酬関数で構成され、多角的に学習を誘導します：
+### 1. **チャットテンプレートの順守**
+   - 終端トークン（`<|im_end|>`）の適切な出力とフォーマット準拠を評価
+   - **準拠時**: +1.0 × 長さファ���ター（短すぎる回答を抑制）
+   - **非準拠時**: -5.0（強いペナルティ）
+   - **極端に短い回答**: -5.0（15文字未満でハード拒否）
+### 2. **反復ペナルティ**
+   - n-gram（デフォルト6文字）の反復率でループ出力を検出
+   - ペナルティ: -0.5 × 反復率（最大 -2.0）
+   - RM-hack（冗長な繰り返しで高スコア獲得）を防止
+### 3. **オーバーロング抑制**
+   - max_completion_length近傍（85%以降）で段階的にペナルティ
+   - ソフトペナルティ: -0.8 × (進行率)^2.0（DAPO風）
+   - ハードペナルティ: -1.5（100%以上で切断時）
+   - 「最大長まで埋める」ドリフトを防止
+### 4. **グループ内多様性**
+   - 同一プロンプトに対する複数生成間の重複・類似を検出
+   - **完全重複**: -0.3（2個目以降）
+   - **高類似（Jaccard≥0.85）**: -0.2 × 類似度
+   - エントロピー崩壊（mode collapse）対策
+### 5. **回答品質（報酬モデル）**
+   - テンプレート準拠の場合のみ評価（ゲート制御）
+   - 外部RM（cyberagent/ca-reward-3b-ja）のスカラーを利用
+   - スケール: 1.0 × RMスコア、クリップ範囲: ±10.0
+   - **正値の場合のみ**長さファクター適用（短い回答への報酬を抑制）
+   - RM失敗時は`None`（マスク）として無視され学習に影響しない
+### 報酬の合成
+- TRL GRPOが全報酬関数の出力を合算（オプションで重み付け可能）
+- グループ内相対的優位性（advantage）を計算してポリシー勾配を算出
+- 適応的KL制御（beta調整）で参照モデルからの乖離を制御
+---
+## 使い方（推奨）
+### 想定ユースケース
+- 0→1のアイデア出し、探索的思考、下書き生成
+- 指示を強く固定しすぎない対話（プロンプト設計で誘導する用途）
+- ベースモデルの“面白さ”や多様性を残しつつ、最低限チャット運用したい場面
+### 注意点
+- **強い安全アラインメントや厳密な指示追従**を最優先したモデルではありません
+- プロンプト設計次第で出力が大きく振れます（＝長所でも短所でもある）
+- チャットテンプレートを使う場合、**テンプレート仕様に合わせた入出力**を推奨します
+---
+## 生成品質・挙動の指針
+- **ベース寄り**：過度に無難な“合意的テンプレ回答”へ収束させることを目的にしていません
+- **プロンプト耐性**：命令の書き方で結果が変わりやすい設計（指示の粒度が重要）
+- **出力の個性**：SFT偏重で起きやすい均質化を避け、探索性を残す狙い
+---
+## 既知の制限
+- 形式順守は改善しても、**厳密な指示追従**や**安全性の自動担保**を保証しません
+- 報酬モデルのバイアス（価値観・スタイル）を受けます
+- 一般的なInstructモデルと同じ評価軸で単純比較すると、用途によっては不利になる場合があります
+---
+## ライセンス
+- ベースモデルおよび関連データセットのライセンスに従います。
+  具体的には以下を参照してください：
+  - llm-jp-3-13b： https://huggingface.co/llm-jp/llm-jp-3-13b
+  - TeamDelta/bare-ja-v0.1： https://huggingface.co/datasets/TeamDelta/bare-ja-v0.1
+  - ca-reward-3b-ja： https://huggingface.co/cyberagent/ca-reward-3b-ja
+---
+## 謝辞
+- llm-jp プロジェクト
+- TeamDelta / bare-ja-v0.1
+- サイバーエージェント（ca-reward-3b-ja）
+- Unsloth / SGLang および関連OSS
+---
+## 引用（必要に応じて）
+このリポジトリやモデルカードを引用する場合は、以下をベースに調整してください：
+```bibtex
+@misc{arrowideative_13b_neobase_zero_llm_jp,
+  title        = {ArrowIdeative-13b-NeoBase-ZERO-llm-jp},
+  author       = {holy-fox},
+  year         = {2026},
+}
+```