ceilf6 commited on
Commit
1ed2b2d
·
verified ·
1 Parent(s): 7586591

Publish subtitle postprocessor v12

Browse files
README.md CHANGED
@@ -1,147 +1,38 @@
1
  ---
2
  license: apache-2.0
3
  base_model: HuggingFaceTB/SmolLM2-135M-Instruct
4
- library_name: peft
5
- pipeline_tag: text-generation
6
- language:
7
- - zh
8
- - en
9
  tags:
10
- - base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
11
- - lora
12
- - peft
13
- - sft
14
- - transformers
15
- - trl
16
- - code-tape
17
- - subtitle-correction
18
- - chapter-generation
19
  ---
20
 
21
- # code-tape subtitle postprocessor LoRA
22
 
23
- This repository contains the LoRA adapter used by code-tape for subtitle post-processing experiments. It is fine-tuned from `HuggingFaceTB/SmolLM2-135M-Instruct` for a narrow browser-local workflow:
24
 
25
- - correct ASR subtitle text for frontend/code terminology, identifiers, component names, package names, and mixed Chinese/English narration;
26
- - preserve unchanged subtitle segments by returning a sparse `segments` change set;
27
- - generate playback chapter jump points from subtitle content and timestamps;
28
- - output one strict JSON object that the code-tape web app can validate.
29
 
30
- This model is not an ASR model. It expects subtitle segments that were already produced by an ASR pipeline such as Whisper.
31
 
32
- ## Repository role
 
 
33
 
34
- code-tape publishes the same experiment in three forms:
35
-
36
- | Repository | Purpose |
37
- | --- | --- |
38
- | [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
39
- | [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | Full merged Hugging Face model after applying this adapter to the base model. |
40
- | [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | Transformers.js-compatible ONNX export used by the browser app. |
41
-
42
- For the code-tape application, use the ONNX repository. Use this LoRA repository only if you want to inspect, merge, or continue training the adapter.
43
-
44
- ## Intended input and output
45
-
46
- The model is trained on chat-style records. The user message should contain JSON with code-tape subtitle context:
47
-
48
- ```json
49
- {
50
- "context": {
51
- "fileName": "Counter.tsx",
52
- "code": "const [count, setCount] = useState(0);",
53
- "runtimeOutput": "",
54
- "glossary": ["React", "useState", "setCount", "render"]
55
- },
56
- "segments": [
57
- { "id": "subtitle-1", "startMs": 0, "endMs": 1200, "text": "这里用 use state 维护 count" },
58
- { "id": "subtitle-2", "startMs": 1200, "endMs": 2600, "text": "然后 set count 触发 render" }
59
- ]
60
- }
61
- ```
62
-
63
- Expected assistant output:
64
 
65
  ```json
66
- {
67
- "segments": [
68
- { "id": "subtitle-1", "text": "这里用 useState 维护 count" },
69
- { "id": "subtitle-2", "text": "然后 setCount 触发 render" }
70
- ],
71
- "chapters": [
72
- { "title": "使用 useState 维护状态", "startMs": 0, "endMs": 1200 },
73
- { "title": "调用 setCount 触发渲染", "startMs": 1200, "endMs": 2600 }
74
- ]
75
- }
76
  ```
77
 
78
- `segments` may be sparse: unchanged subtitle segments can be omitted, and the application keeps their original text. Returned segment ids must come from the input exactly once. `chapters` must stay inside the input subtitle timeline.
79
-
80
- ## Usage
81
-
82
- ```python
83
- from transformers import AutoModelForCausalLM, AutoTokenizer
84
- from peft import PeftModel
85
-
86
- base_model = "HuggingFaceTB/SmolLM2-135M-Instruct"
87
- adapter_id = "ceilf6/code-tape-subtitle-postprocessor-lora"
88
-
89
- tokenizer = AutoTokenizer.from_pretrained(adapter_id)
90
- base = AutoModelForCausalLM.from_pretrained(base_model)
91
- model = PeftModel.from_pretrained(base, adapter_id)
92
-
93
- messages = [
94
- {
95
- "role": "system",
96
- "content": (
97
- "You are the code-tape subtitle post-processing model.\n"
98
- "Only output one JSON object.\n"
99
- "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points."
100
- ),
101
- },
102
- {"role": "user", "content": "{\"context\":{},\"segments\":[]}"},
103
- ]
104
-
105
- prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
106
- inputs = tokenizer(prompt, return_tensors="pt")
107
- outputs = model.generate(**inputs, max_new_tokens=384, do_sample=False)
108
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
109
- ```
110
-
111
- ## Training data
112
-
113
- The adapter was trained from code-tape subtitle post-processing records. Each record contains:
114
-
115
- - ASR-like subtitle segments with ids and timestamps;
116
- - frontend/code context such as file name, source snippet, runtime output, and glossary terms;
117
- - an assistant JSON response with sparse subtitle corrections and chapter jump points.
118
-
119
- The seed examples are intentionally narrow and project-specific. They cover React, TypeScript, Monaco/editor events, replay scheduler terminology, IndexedDB subtitle storage, Vite/GitHub Pages routing, Tailwind theme tokens, and repo-guard/code-review phrasing.
120
-
121
- ## Evaluation
122
-
123
- This repository does not claim broad language-model benchmark performance. code-tape evaluates this model family with project-specific checks:
124
-
125
- - JSON parseability;
126
- - valid sparse segment references with no unknown or duplicate ids;
127
- - preservation of frontend/code glossary terms after applying sparse corrections;
128
- - chapter ordering, overlap, and timeline bounds.
129
-
130
- The application must still validate model output before applying it.
131
-
132
- ## Limitations
133
-
134
- - Designed for short subtitle batches, not long-form document summarization.
135
- - Optimized for code-tape frontend/code explanation scenarios; quality outside that domain is not guaranteed.
136
- - Small local model behavior can be brittle. Always parse, validate, and fall back to original subtitles on invalid output.
137
- - It does not transcribe audio and does not replace Whisper/ASR.
138
-
139
- ## Privacy and security
140
-
141
- The intended application path is browser-local inference through the ONNX export. No Hugging Face token is required for public model loading, and user audio/subtitles do not need to be uploaded to a hosted inference API.
142
 
143
- Do not include secrets, private source code, access tokens, or credentials in prompts unless you control the full inference environment and storage path.
144
 
145
- ## License
 
 
 
 
146
 
147
- Apache-2.0, following the base model license.
 
1
  ---
2
  license: apache-2.0
3
  base_model: HuggingFaceTB/SmolLM2-135M-Instruct
 
 
 
 
 
4
  tags:
5
+ - code-tape
6
+ - subtitle
7
+ - lora
8
+ - text-generation
 
 
 
 
 
9
  ---
10
 
11
+ # code-tape Subtitle Postprocessor LoRA v12
12
 
13
+ LoRA adapter for the code-tape browser-local subtitle postprocessor. It is trained to correct ASR subtitles for frontend/code terminology and generate playback chapter jump points.
14
 
15
+ ## Contract
 
 
 
16
 
17
+ Input messages contain:
18
 
19
+ - `context`: file name, code/runtime snippets, and glossary.
20
+ - `inputSegments`: subtitle `id` and `text` only.
21
+ - `timeline`: subtitle `id`, `startMs`, and `endMs`.
22
 
23
+ Output must be one JSON object:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
24
 
25
  ```json
26
+ {"segments":[{"id":"subtitle-1","text":"这里用 useState 维护 count"}],"chapters":[{"title":"状态设计","startMs":0,"endMs":1000}]}
 
 
 
 
 
 
 
 
 
27
  ```
28
 
29
+ `segments` should be sparse and contain only changed subtitles.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
+ ## Training Notes
32
 
33
+ - Base: `HuggingFaceTB/SmolLM2-135M-Instruct`
34
+ - Records: 450 curated/distilled examples
35
+ - Epochs: 2
36
+ - Final train loss: 0.2545
37
+ - Corpus gates: JSON valid rate 1.0, sparse output rate 0.9333, unknown segment reference rate 0
38
 
 
adapter_config.json CHANGED
@@ -3,7 +3,7 @@
3
  "alpha_pattern": {},
4
  "arrow_config": null,
5
  "auto_mapping": null,
6
- "base_model_name_or_path": "HuggingFaceTB/SmolLM2-135M-Instruct",
7
  "bias": "none",
8
  "corda_config": null,
9
  "ensure_weight_tying": false,
@@ -30,13 +30,13 @@
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
33
- "up_proj",
34
- "gate_proj",
35
- "o_proj",
36
  "k_proj",
37
- "down_proj",
38
  "v_proj",
39
- "q_proj"
 
 
 
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
 
3
  "alpha_pattern": {},
4
  "arrow_config": null,
5
  "auto_mapping": null,
6
+ "base_model_name_or_path": "/Users/ceilf6/.cache/huggingface/hub/models--HuggingFaceTB--SmolLM2-135M-Instruct/snapshots/12fd25f77366fa6b3b4b768ec3050bf629380bac",
7
  "bias": "none",
8
  "corda_config": null,
9
  "ensure_weight_tying": false,
 
30
  "rank_pattern": {},
31
  "revision": null,
32
  "target_modules": [
 
 
 
33
  "k_proj",
34
+ "q_proj",
35
  "v_proj",
36
+ "up_proj",
37
+ "down_proj",
38
+ "gate_proj",
39
+ "o_proj"
40
  ],
41
  "target_parameters": null,
42
  "task_type": "CAUSAL_LM",
adapter_model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:455ae313667c729be9d4eb44d665834ed5a1088ad87b7f5fda757857a7f9453e
3
  size 19593064
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:43a72e030fe66a18463420dcc707486aba066071cd76f0d485add9fcc1efaeb9
3
  size 19593064
special_tokens_map.json CHANGED
@@ -17,7 +17,13 @@
17
  "rstrip": false,
18
  "single_word": false
19
  },
20
- "pad_token": "<|im_end|>",
 
 
 
 
 
 
21
  "unk_token": {
22
  "content": "<|endoftext|>",
23
  "lstrip": false,
 
17
  "rstrip": false,
18
  "single_word": false
19
  },
20
+ "pad_token": {
21
+ "content": "<|im_end|>",
22
+ "lstrip": false,
23
+ "normalized": false,
24
+ "rstrip": false,
25
+ "single_word": false
26
+ },
27
  "unk_token": {
28
  "content": "<|endoftext|>",
29
  "lstrip": false,
training_args.bin CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7c5bc73070a4694ced65d8556f33cc194bd42f217765fe591da87caf7e93210d
3
- size 6353
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:80cb1e39785edc42b22dc6bf02b41659c6629e26715e2924028840dc7fcab1a8
3
+ size 5841