ceilf6 commited on
Commit
a97fe9c
ยท
1 Parent(s): 798a65f

docs: add code-tape model card

Browse files
Files changed (1) hide show
  1. README.md +156 -0
README.md ADDED
@@ -0,0 +1,156 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: ceilf6/code-tape-subtitle-postprocessor-merged
4
+ library_name: transformers.js
5
+ pipeline_tag: text-generation
6
+ language:
7
+ - zh
8
+ - en
9
+ tags:
10
+ - onnx
11
+ - transformers.js
12
+ - webgpu
13
+ - wasm
14
+ - code-tape
15
+ - subtitle-correction
16
+ - chapter-generation
17
+ ---
18
+
19
+ # code-tape subtitle postprocessor ONNX
20
+
21
+ This is the browser-local ONNX export of the code-tape subtitle post-processing model. It is the default LLM used by the code-tape web app for the "็บ ้”™ๅนถ็”Ÿๆˆ็ซ ่Š‚" workflow.
22
+
23
+ The model receives ASR subtitle segments plus code context and returns strict JSON:
24
+
25
+ - sparse subtitle corrections for frontend/code terminology;
26
+ - playback chapter jump points derived from subtitle timestamps;
27
+ - no Markdown, no explanation, no extra wrapper text.
28
+
29
+ This model is not ASR. In code-tape, ASR is handled separately by Whisper; this ONNX model only post-processes the resulting subtitle text.
30
+
31
+ ## Repository role
32
+
33
+ code-tape publishes this model family in three forms:
34
+
35
+ | Repository | Purpose |
36
+ | --- | --- |
37
+ | [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
38
+ | [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | Full merged Hugging Face model. |
39
+ | [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | This Transformers.js-compatible ONNX export for browser-local inference. |
40
+
41
+ Use this repository when integrating with the browser app.
42
+
43
+ ## Intended contract
44
+
45
+ Input payload:
46
+
47
+ ```json
48
+ {
49
+ "context": {
50
+ "fileName": "SubtitlePanel.tsx",
51
+ "code": "await postProcessor.process({ track, context });",
52
+ "runtimeOutput": "",
53
+ "glossary": ["SubtitlePanel", "postProcessor", "chapters"]
54
+ },
55
+ "segments": [
56
+ { "id": "subtitle-1", "startMs": 0, "endMs": 1600, "text": "่ฟ™้‡Œๅˆ›ๅปบ hugging face ๅญ—ๅน• post processor" },
57
+ { "id": "subtitle-2", "startMs": 1600, "endMs": 3300, "text": "ๆœ€ๅŽ็”Ÿๆˆ corrections ๅ’Œ chapters" }
58
+ ]
59
+ }
60
+ ```
61
+
62
+ Expected output shape:
63
+
64
+ ```json
65
+ {
66
+ "segments": [
67
+ { "id": "subtitle-1", "text": "่ฟ™้‡Œๅˆ›ๅปบ Hugging Face ๅญ—ๅน• postProcessor" }
68
+ ],
69
+ "chapters": [
70
+ { "title": "ๅˆ›ๅปบๅญ—ๅน•ๅŽๅค„็†ๅ™จ", "startMs": 0, "endMs": 1600 },
71
+ { "title": "็”Ÿๆˆ็บ ้”™ๅ’Œ็ซ ่Š‚", "startMs": 1600, "endMs": 3300 }
72
+ ]
73
+ }
74
+ ```
75
+
76
+ `segments` is a sparse change set. Omitted subtitle segments are treated as unchanged by the application.
77
+
78
+ ## Browser usage
79
+
80
+ ```javascript
81
+ import { pipeline } from "@huggingface/transformers";
82
+
83
+ const generator = await pipeline(
84
+ "text-generation",
85
+ "ceilf6/code-tape-subtitle-postprocessor-onnx",
86
+ { device: "webgpu", dtype: "q4f16" },
87
+ );
88
+
89
+ const messages = [
90
+ {
91
+ role: "system",
92
+ content: [
93
+ "You are the code-tape subtitle post-processing model.",
94
+ "Only output one JSON object.",
95
+ "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points.",
96
+ 'Output shape: {"segments":[{"id":"subtitle-1","text":"corrected text"}],"chapters":[{"title":"้—ฎ้ข˜ๅˆ†ๆž","startMs":0,"endMs":1000}]}',
97
+ ].join("\n"),
98
+ },
99
+ {
100
+ role: "user",
101
+ content: JSON.stringify({
102
+ context: { fileName: "Counter.tsx", code: "", runtimeOutput: "", glossary: ["useState"] },
103
+ segments: [{ id: "subtitle-1", startMs: 0, endMs: 1200, text: "่ฟ™้‡Œ็”จ use state" }],
104
+ }),
105
+ },
106
+ ];
107
+
108
+ const output = await generator(messages, {
109
+ max_new_tokens: 384,
110
+ do_sample: false,
111
+ return_full_text: false,
112
+ });
113
+ ```
114
+
115
+ In production, code-tape tries WebGPU first and falls back to WASM/CPU-compatible settings when needed. The application also handles browser cache write failures and validates every model response before applying it.
116
+
117
+ ## Integration notes
118
+
119
+ - Public browser loading does not require a Hugging Face token.
120
+ - Keep prompts short. The code-tape app budgets source code, runtime output, and output token count to keep local inference responsive.
121
+ - Validate JSON before use. Invalid JSON, unknown segment ids, duplicate ids, empty text, overlapping chapters, or chapters outside the subtitle timeline must fall back safely.
122
+ - This model should run after ASR, not before ASR.
123
+
124
+ ## Training and export lineage
125
+
126
+ 1. Fine-tune a LoRA adapter from `HuggingFaceTB/SmolLM2-135M-Instruct`.
127
+ 2. Merge the adapter into a full Hugging Face model.
128
+ 3. Export/quantize the merged model to ONNX for `@huggingface/transformers` browser inference.
129
+
130
+ ## Evaluation
131
+
132
+ code-tape evaluates this model family with project-specific checks:
133
+
134
+ - JSON parseability;
135
+ - sparse segment reference validity;
136
+ - glossary preservation after sparse corrections are applied to the source subtitles;
137
+ - chapter ordering, overlap, and bounds within the subtitle timeline.
138
+
139
+ No broad general-purpose benchmark score is claimed.
140
+
141
+ ## Limitations
142
+
143
+ - The model is small and domain-specific; malformed JSON is possible.
144
+ - It is optimized for frontend/code explanation subtitles, not arbitrary subtitles.
145
+ - It cannot transcribe audio.
146
+ - Long subtitle tracks should be split before local browser inference.
147
+
148
+ ## Privacy and security
149
+
150
+ The intended path is browser-local inference. Audio transcription, subtitle correction, and chapter generation can run without sending media or subtitles to a hosted inference API.
151
+
152
+ Do not include secrets, private source code, credentials, or access tokens in prompts unless you control the full runtime and storage environment.
153
+
154
+ ## License
155
+
156
+ Apache-2.0, following the base model license.