ceilf6 commited on
Commit
7586591
·
1 Parent(s): 8277ab1

docs: add code-tape model card

Browse files
Files changed (1) hide show
  1. README.md +100 -162
README.md CHANGED
@@ -1,209 +1,147 @@
1
  ---
 
2
  base_model: HuggingFaceTB/SmolLM2-135M-Instruct
3
  library_name: peft
4
  pipeline_tag: text-generation
 
 
 
5
  tags:
6
  - base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
7
  - lora
 
8
  - sft
9
  - transformers
10
  - trl
 
 
 
11
  ---
12
 
13
- # Model Card for Model ID
14
 
15
- <!-- Provide a quick summary of what the model is/does. -->
16
 
 
 
 
 
17
 
 
18
 
19
- ## Model Details
20
 
21
- ### Model Description
22
 
23
- <!-- Provide a longer summary of what this model is. -->
 
 
 
 
24
 
 
25
 
 
26
 
27
- - **Developed by:** [More Information Needed]
28
- - **Funded by [optional]:** [More Information Needed]
29
- - **Shared by [optional]:** [More Information Needed]
30
- - **Model type:** [More Information Needed]
31
- - **Language(s) (NLP):** [More Information Needed]
32
- - **License:** [More Information Needed]
33
- - **Finetuned from model [optional]:** [More Information Needed]
34
 
35
- ### Model Sources [optional]
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
- <!-- Provide the basic links for the model. -->
38
 
39
- - **Repository:** [More Information Needed]
40
- - **Paper [optional]:** [More Information Needed]
41
- - **Demo [optional]:** [More Information Needed]
 
 
 
 
 
 
 
 
 
42
 
43
- ## Uses
44
 
45
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
46
 
47
- ### Direct Use
 
 
48
 
49
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
50
 
51
- [More Information Needed]
 
 
52
 
53
- ### Downstream Use [optional]
 
 
 
 
 
 
 
 
 
 
54
 
55
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
56
 
57
- [More Information Needed]
58
 
59
- ### Out-of-Scope Use
60
 
61
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
62
 
63
- [More Information Needed]
64
-
65
- ## Bias, Risks, and Limitations
66
-
67
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
68
-
69
- [More Information Needed]
70
-
71
- ### Recommendations
72
-
73
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
74
-
75
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
76
-
77
- ## How to Get Started with the Model
78
-
79
- Use the code below to get started with the model.
80
-
81
- [More Information Needed]
82
-
83
- ## Training Details
84
-
85
- ### Training Data
86
-
87
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
88
-
89
- [More Information Needed]
90
-
91
- ### Training Procedure
92
-
93
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
94
-
95
- #### Preprocessing [optional]
96
-
97
- [More Information Needed]
98
-
99
-
100
- #### Training Hyperparameters
101
-
102
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
103
-
104
- #### Speeds, Sizes, Times [optional]
105
-
106
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
107
-
108
- [More Information Needed]
109
 
110
  ## Evaluation
111
 
112
- <!-- This section describes the evaluation protocols and provides the results. -->
113
-
114
- ### Testing Data, Factors & Metrics
115
-
116
- #### Testing Data
117
-
118
- <!-- This should link to a Dataset Card if possible. -->
119
-
120
- [More Information Needed]
121
-
122
- #### Factors
123
-
124
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
125
-
126
- [More Information Needed]
127
-
128
- #### Metrics
129
-
130
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
131
-
132
- [More Information Needed]
133
-
134
- ### Results
135
-
136
- [More Information Needed]
137
-
138
- #### Summary
139
-
140
-
141
-
142
- ## Model Examination [optional]
143
-
144
- <!-- Relevant interpretability work for the model goes here -->
145
-
146
- [More Information Needed]
147
-
148
- ## Environmental Impact
149
-
150
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
151
-
152
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
153
-
154
- - **Hardware Type:** [More Information Needed]
155
- - **Hours used:** [More Information Needed]
156
- - **Cloud Provider:** [More Information Needed]
157
- - **Compute Region:** [More Information Needed]
158
- - **Carbon Emitted:** [More Information Needed]
159
-
160
- ## Technical Specifications [optional]
161
-
162
- ### Model Architecture and Objective
163
-
164
- [More Information Needed]
165
-
166
- ### Compute Infrastructure
167
-
168
- [More Information Needed]
169
-
170
- #### Hardware
171
-
172
- [More Information Needed]
173
-
174
- #### Software
175
-
176
- [More Information Needed]
177
-
178
- ## Citation [optional]
179
-
180
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
181
-
182
- **BibTeX:**
183
-
184
- [More Information Needed]
185
-
186
- **APA:**
187
-
188
- [More Information Needed]
189
-
190
- ## Glossary [optional]
191
 
192
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
 
 
 
193
 
194
- [More Information Needed]
195
 
196
- ## More Information [optional]
197
 
198
- [More Information Needed]
 
 
 
199
 
200
- ## Model Card Authors [optional]
201
 
202
- [More Information Needed]
203
 
204
- ## Model Card Contact
205
 
206
- [More Information Needed]
207
- ### Framework versions
208
 
209
- - PEFT 0.19.1
 
1
  ---
2
+ license: apache-2.0
3
  base_model: HuggingFaceTB/SmolLM2-135M-Instruct
4
  library_name: peft
5
  pipeline_tag: text-generation
6
+ language:
7
+ - zh
8
+ - en
9
  tags:
10
  - base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
11
  - lora
12
+ - peft
13
  - sft
14
  - transformers
15
  - trl
16
+ - code-tape
17
+ - subtitle-correction
18
+ - chapter-generation
19
  ---
20
 
21
+ # code-tape subtitle postprocessor LoRA
22
 
23
+ This repository contains the LoRA adapter used by code-tape for subtitle post-processing experiments. It is fine-tuned from `HuggingFaceTB/SmolLM2-135M-Instruct` for a narrow browser-local workflow:
24
 
25
+ - correct ASR subtitle text for frontend/code terminology, identifiers, component names, package names, and mixed Chinese/English narration;
26
+ - preserve unchanged subtitle segments by returning a sparse `segments` change set;
27
+ - generate playback chapter jump points from subtitle content and timestamps;
28
+ - output one strict JSON object that the code-tape web app can validate.
29
 
30
+ This model is not an ASR model. It expects subtitle segments that were already produced by an ASR pipeline such as Whisper.
31
 
32
+ ## Repository role
33
 
34
+ code-tape publishes the same experiment in three forms:
35
 
36
+ | Repository | Purpose |
37
+ | --- | --- |
38
+ | [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
39
+ | [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | Full merged Hugging Face model after applying this adapter to the base model. |
40
+ | [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | Transformers.js-compatible ONNX export used by the browser app. |
41
 
42
+ For the code-tape application, use the ONNX repository. Use this LoRA repository only if you want to inspect, merge, or continue training the adapter.
43
 
44
+ ## Intended input and output
45
 
46
+ The model is trained on chat-style records. The user message should contain JSON with code-tape subtitle context:
 
 
 
 
 
 
47
 
48
+ ```json
49
+ {
50
+ "context": {
51
+ "fileName": "Counter.tsx",
52
+ "code": "const [count, setCount] = useState(0);",
53
+ "runtimeOutput": "",
54
+ "glossary": ["React", "useState", "setCount", "render"]
55
+ },
56
+ "segments": [
57
+ { "id": "subtitle-1", "startMs": 0, "endMs": 1200, "text": "这里用 use state 维护 count" },
58
+ { "id": "subtitle-2", "startMs": 1200, "endMs": 2600, "text": "然后 set count 触发 render" }
59
+ ]
60
+ }
61
+ ```
62
 
63
+ Expected assistant output:
64
 
65
+ ```json
66
+ {
67
+ "segments": [
68
+ { "id": "subtitle-1", "text": "这里用 useState 维护 count" },
69
+ { "id": "subtitle-2", "text": "然后 setCount 触发 render" }
70
+ ],
71
+ "chapters": [
72
+ { "title": "使用 useState 维护状态", "startMs": 0, "endMs": 1200 },
73
+ { "title": "调用 setCount 触发渲染", "startMs": 1200, "endMs": 2600 }
74
+ ]
75
+ }
76
+ ```
77
 
78
+ `segments` may be sparse: unchanged subtitle segments can be omitted, and the application keeps their original text. Returned segment ids must come from the input exactly once. `chapters` must stay inside the input subtitle timeline.
79
 
80
+ ## Usage
81
 
82
+ ```python
83
+ from transformers import AutoModelForCausalLM, AutoTokenizer
84
+ from peft import PeftModel
85
 
86
+ base_model = "HuggingFaceTB/SmolLM2-135M-Instruct"
87
+ adapter_id = "ceilf6/code-tape-subtitle-postprocessor-lora"
88
 
89
+ tokenizer = AutoTokenizer.from_pretrained(adapter_id)
90
+ base = AutoModelForCausalLM.from_pretrained(base_model)
91
+ model = PeftModel.from_pretrained(base, adapter_id)
92
 
93
+ messages = [
94
+ {
95
+ "role": "system",
96
+ "content": (
97
+ "You are the code-tape subtitle post-processing model.\n"
98
+ "Only output one JSON object.\n"
99
+ "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points."
100
+ ),
101
+ },
102
+ {"role": "user", "content": "{\"context\":{},\"segments\":[]}"},
103
+ ]
104
 
105
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
106
+ inputs = tokenizer(prompt, return_tensors="pt")
107
+ outputs = model.generate(**inputs, max_new_tokens=384, do_sample=False)
108
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
109
+ ```
110
 
111
+ ## Training data
112
 
113
+ The adapter was trained from code-tape subtitle post-processing records. Each record contains:
114
 
115
+ - ASR-like subtitle segments with ids and timestamps;
116
+ - frontend/code context such as file name, source snippet, runtime output, and glossary terms;
117
+ - an assistant JSON response with sparse subtitle corrections and chapter jump points.
118
 
119
+ The seed examples are intentionally narrow and project-specific. They cover React, TypeScript, Monaco/editor events, replay scheduler terminology, IndexedDB subtitle storage, Vite/GitHub Pages routing, Tailwind theme tokens, and repo-guard/code-review phrasing.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
120
 
121
  ## Evaluation
122
 
123
+ This repository does not claim broad language-model benchmark performance. code-tape evaluates this model family with project-specific checks:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124
 
125
+ - JSON parseability;
126
+ - valid sparse segment references with no unknown or duplicate ids;
127
+ - preservation of frontend/code glossary terms after applying sparse corrections;
128
+ - chapter ordering, overlap, and timeline bounds.
129
 
130
+ The application must still validate model output before applying it.
131
 
132
+ ## Limitations
133
 
134
+ - Designed for short subtitle batches, not long-form document summarization.
135
+ - Optimized for code-tape frontend/code explanation scenarios; quality outside that domain is not guaranteed.
136
+ - Small local model behavior can be brittle. Always parse, validate, and fall back to original subtitles on invalid output.
137
+ - It does not transcribe audio and does not replace Whisper/ASR.
138
 
139
+ ## Privacy and security
140
 
141
+ The intended application path is browser-local inference through the ONNX export. No Hugging Face token is required for public model loading, and user audio/subtitles do not need to be uploaded to a hosted inference API.
142
 
143
+ Do not include secrets, private source code, access tokens, or credentials in prompts unless you control the full inference environment and storage path.
144
 
145
+ ## License
 
146
 
147
+ Apache-2.0, following the base model license.