Publish subtitle postprocessor v12

Browse files

Files changed (5) hide show

README.md +20 -129
adapter_config.json +6 -6
adapter_model.safetensors +1 -1
special_tokens_map.json +7 -1
training_args.bin +2 -2

README.md CHANGED Viewed

@@ -1,147 +1,38 @@
 ---
 license: apache-2.0
 base_model: HuggingFaceTB/SmolLM2-135M-Instruct
-library_name: peft
-pipeline_tag: text-generation
-language:
-- zh
-- en
 tags:
-- base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
-- lora
-- peft
-- sft
-- transformers
-- trl
-- code-tape
-- subtitle-correction
-- chapter-generation
 ---
-# code-tape subtitle postprocessor LoRA
-This repository contains the LoRA adapter used by code-tape for subtitle post-processing experiments. It is fine-tuned from `HuggingFaceTB/SmolLM2-135M-Instruct` for a narrow browser-local workflow:
-- correct ASR subtitle text for frontend/code terminology, identifiers, component names, package names, and mixed Chinese/English narration;
-- preserve unchanged subtitle segments by returning a sparse `segments` change set;
-- generate playback chapter jump points from subtitle content and timestamps;
-- output one strict JSON object that the code-tape web app can validate.
-This model is not an ASR model. It expects subtitle segments that were already produced by an ASR pipeline such as Whisper.
-## Repository role
-code-tape publishes the same experiment in three forms:
-| Repository | Purpose |
-| --- | --- |
-| [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
-| [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | Full merged Hugging Face model after applying this adapter to the base model. |
-| [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | Transformers.js-compatible ONNX export used by the browser app. |
-For the code-tape application, use the ONNX repository. Use this LoRA repository only if you want to inspect, merge, or continue training the adapter.
-## Intended input and output
-The model is trained on chat-style records. The user message should contain JSON with code-tape subtitle context:
-```json
-{
-  "context": {
-    "fileName": "Counter.tsx",
-    "code": "const [count, setCount] = useState(0);",
-    "runtimeOutput": "",
-    "glossary": ["React", "useState", "setCount", "render"]
-  },
-  "segments": [
-    { "id": "subtitle-1", "startMs": 0, "endMs": 1200, "text": "这里用 use state 维护 count" },
-    { "id": "subtitle-2", "startMs": 1200, "endMs": 2600, "text": "然后 set count 触发 render" }
-  ]
-}
-```
-Expected assistant output:
 ```json
-{
-  "segments": [
-    { "id": "subtitle-1", "text": "这里用 useState 维护 count" },
-    { "id": "subtitle-2", "text": "然后 setCount 触发 render" }
-  ],
-  "chapters": [
-    { "title": "使用 useState 维护状态", "startMs": 0, "endMs": 1200 },
-    { "title": "调用 setCount 触发渲染", "startMs": 1200, "endMs": 2600 }
-  ]
-}
 ```
-`segments` may be sparse: unchanged subtitle segments can be omitted, and the application keeps their original text. Returned segment ids must come from the input exactly once. `chapters` must stay inside the input subtitle timeline.
-## Usage
-```python
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from peft import PeftModel
-base_model = "HuggingFaceTB/SmolLM2-135M-Instruct"
-adapter_id = "ceilf6/code-tape-subtitle-postprocessor-lora"
-tokenizer = AutoTokenizer.from_pretrained(adapter_id)
-base = AutoModelForCausalLM.from_pretrained(base_model)
-model = PeftModel.from_pretrained(base, adapter_id)
-messages = [
-    {
-        "role": "system",
-        "content": (
-            "You are the code-tape subtitle post-processing model.\n"
-            "Only output one JSON object.\n"
-            "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points."
-        ),
-    },
-    {"role": "user", "content": "{\"context\":{},\"segments\":[]}"},
-]
-prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(prompt, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=384, do_sample=False)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-## Training data
-The adapter was trained from code-tape subtitle post-processing records. Each record contains:
-- ASR-like subtitle segments with ids and timestamps;
-- frontend/code context such as file name, source snippet, runtime output, and glossary terms;
-- an assistant JSON response with sparse subtitle corrections and chapter jump points.
-The seed examples are intentionally narrow and project-specific. They cover React, TypeScript, Monaco/editor events, replay scheduler terminology, IndexedDB subtitle storage, Vite/GitHub Pages routing, Tailwind theme tokens, and repo-guard/code-review phrasing.
-## Evaluation
-This repository does not claim broad language-model benchmark performance. code-tape evaluates this model family with project-specific checks:
-- JSON parseability;
-- valid sparse segment references with no unknown or duplicate ids;
-- preservation of frontend/code glossary terms after applying sparse corrections;
-- chapter ordering, overlap, and timeline bounds.
-The application must still validate model output before applying it.
-## Limitations
-- Designed for short subtitle batches, not long-form document summarization.
-- Optimized for code-tape frontend/code explanation scenarios; quality outside that domain is not guaranteed.
-- Small local model behavior can be brittle. Always parse, validate, and fall back to original subtitles on invalid output.
-- It does not transcribe audio and does not replace Whisper/ASR.
-## Privacy and security
-The intended application path is browser-local inference through the ONNX export. No Hugging Face token is required for public model loading, and user audio/subtitles do not need to be uploaded to a hosted inference API.
-Do not include secrets, private source code, access tokens, or credentials in prompts unless you control the full inference environment and storage path.
-## License
-Apache-2.0, following the base model license.

 ---
 license: apache-2.0
 base_model: HuggingFaceTB/SmolLM2-135M-Instruct
 tags:
+  - code-tape
+  - subtitle
+  - lora
+  - text-generation
 ---
+# code-tape Subtitle Postprocessor LoRA v12
+LoRA adapter for the code-tape browser-local subtitle postprocessor. It is trained to correct ASR subtitles for frontend/code terminology and generate playback chapter jump points.
+## Contract
+Input messages contain:
+- `context`: file name, code/runtime snippets, and glossary.
+- `inputSegments`: subtitle `id` and `text` only.
+- `timeline`: subtitle `id`, `startMs`, and `endMs`.
+Output must be one JSON object:
 ```json
+{"segments":[{"id":"subtitle-1","text":"这里用 useState 维护 count"}],"chapters":[{"title":"状态设计","startMs":0,"endMs":1000}]}
 ```
+`segments` should be sparse and contain only changed subtitles.
+## Training Notes
+- Base: `HuggingFaceTB/SmolLM2-135M-Instruct`
+- Records: 450 curated/distilled examples
+- Epochs: 2
+- Final train loss: 0.2545
+- Corpus gates: JSON valid rate 1.0, sparse output rate 0.9333, unknown segment reference rate 0

adapter_config.json CHANGED Viewed

@@ -3,7 +3,7 @@
   "alpha_pattern": {},
   "arrow_config": null,
   "auto_mapping": null,
-  "base_model_name_or_path": "HuggingFaceTB/SmolLM2-135M-Instruct",
   "bias": "none",
   "corda_config": null,
   "ensure_weight_tying": false,
@@ -30,13 +30,13 @@
   "rank_pattern": {},
   "revision": null,
   "target_modules": [
-    "up_proj",
-    "gate_proj",
-    "o_proj",
     "k_proj",
-    "down_proj",
     "v_proj",
-    "q_proj"
   ],
   "target_parameters": null,
   "task_type": "CAUSAL_LM",

   "alpha_pattern": {},
   "arrow_config": null,
   "auto_mapping": null,
+  "base_model_name_or_path": "/Users/ceilf6/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M-Instruct/snapshots/12fd25f77366fa6b3b4b768ec3050bf629380bac",
   "bias": "none",
   "corda_config": null,
   "ensure_weight_tying": false,
   "rank_pattern": {},
   "revision": null,
   "target_modules": [
     "k_proj",
+    "q_proj",
     "v_proj",
+    "up_proj",
+    "down_proj",
+    "gate_proj",
+    "o_proj"
   ],
   "target_parameters": null,
   "task_type": "CAUSAL_LM",

adapter_model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:455ae313667c729be9d4eb44d665834ed5a1088ad87b7f5fda757857a7f9453e
 size 19593064

 version https://git-lfs.github.com/spec/v1
+oid sha256:43a72e030fe66a18463420dcc707486aba066071cd76f0d485add9fcc1efaeb9
 size 19593064

special_tokens_map.json CHANGED Viewed

@@ -17,7 +17,13 @@
     "rstrip": false,
     "single_word": false
   },
-  "pad_token": "<|im_end|>",
   "unk_token": {
     "content": "<|endoftext|>",
     "lstrip": false,

     "rstrip": false,
     "single_word": false
   },
+  "pad_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
   "unk_token": {
     "content": "<|endoftext|>",
     "lstrip": false,

training_args.bin CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:7c5bc73070a4694ced65d8556f33cc194bd42f217765fe591da87caf7e93210d
-size 6353

 version https://git-lfs.github.com/spec/v1
+oid sha256:80cb1e39785edc42b22dc6bf02b41659c6629e26715e2924028840dc7fcab1a8
+size 5841