ceilf6 commited on
Commit
c4b85b5
·
1 Parent(s): d4efdb1

docs: add code-tape model card

Browse files
Files changed (1) hide show
  1. README.md +149 -0
README.md ADDED
@@ -0,0 +1,149 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model: HuggingFaceTB/SmolLM2-135M-Instruct
4
+ library_name: transformers
5
+ pipeline_tag: text-generation
6
+ language:
7
+ - zh
8
+ - en
9
+ tags:
10
+ - safetensors
11
+ - llama
12
+ - transformers
13
+ - code-tape
14
+ - subtitle-correction
15
+ - chapter-generation
16
+ ---
17
+
18
+ # code-tape subtitle postprocessor merged model
19
+
20
+ This repository contains the full merged Hugging Face model for code-tape subtitle post-processing. It was produced by applying the project LoRA adapter to `HuggingFaceTB/SmolLM2-135M-Instruct`.
21
+
22
+ The model is specialized for a narrow post-ASR task:
23
+
24
+ - fix frontend/code terminology in subtitle text;
25
+ - keep code identifiers, package names, function names, and component names stable;
26
+ - return only changed subtitle segments as a sparse `segments` array;
27
+ - create timestamped playback chapters;
28
+ - output one strict JSON object.
29
+
30
+ This model is not an audio transcription model. It should receive subtitle segments that already have ids, start/end timestamps, and ASR text.
31
+
32
+ ## Repository role
33
+
34
+ code-tape publishes the same model family in three forms:
35
+
36
+ | Repository | Purpose |
37
+ | --- | --- |
38
+ | [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
39
+ | [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | This full merged model, useful for Python/Transformers inspection or re-export. |
40
+ | [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | Transformers.js-compatible ONNX export used by the browser app. |
41
+
42
+ For browser-local inference in code-tape, use the ONNX repository. Use this repository when you need a standard Transformers checkpoint.
43
+
44
+ ## Intended contract
45
+
46
+ Input is a chat message containing JSON:
47
+
48
+ ```json
49
+ {
50
+ "context": {
51
+ "fileName": "ReplayControls.tsx",
52
+ "code": "const canSeek = durationMs > 0;",
53
+ "runtimeOutput": "",
54
+ "glossary": ["ReplayControls", "canSeek", "durationMs"]
55
+ },
56
+ "segments": [
57
+ { "id": "subtitle-1", "startMs": 0, "endMs": 1400, "text": "这里先判断 can seek 是否可用" }
58
+ ]
59
+ }
60
+ ```
61
+
62
+ Expected output shape:
63
+
64
+ ```json
65
+ {
66
+ "segments": [
67
+ { "id": "subtitle-1", "text": "这里先判断 canSeek 是否可用" }
68
+ ],
69
+ "chapters": [
70
+ { "title": "判断回放是否可 seek", "startMs": 0, "endMs": 1400 }
71
+ ]
72
+ }
73
+ ```
74
+
75
+ Rules expected by the code-tape application:
76
+
77
+ - output JSON only, with no Markdown or explanation;
78
+ - `segments` contains only changed segments and may be empty;
79
+ - every returned segment id must exist in the input and must not be duplicated;
80
+ - chapter times must be monotonic, non-overlapping, and inside the subtitle timeline;
81
+ - invalid output is discarded by the application.
82
+
83
+ ## Usage with Transformers
84
+
85
+ ```python
86
+ from transformers import AutoModelForCausalLM, AutoTokenizer
87
+
88
+ model_id = "ceilf6/code-tape-subtitle-postprocessor-merged"
89
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
90
+ model = AutoModelForCausalLM.from_pretrained(model_id)
91
+
92
+ messages = [
93
+ {
94
+ "role": "system",
95
+ "content": (
96
+ "You are the code-tape subtitle post-processing model.\n"
97
+ "Only output one JSON object.\n"
98
+ "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points."
99
+ ),
100
+ },
101
+ {"role": "user", "content": "{\"context\":{},\"segments\":[]}"},
102
+ ]
103
+
104
+ prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
105
+ inputs = tokenizer(prompt, return_tensors="pt")
106
+ outputs = model.generate(**inputs, max_new_tokens=384, do_sample=False)
107
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
108
+ ```
109
+
110
+ ## Training and conversion
111
+
112
+ The model was created from the code-tape subtitle post-processing LoRA workflow:
113
+
114
+ 1. prepare seed records with ASR-like subtitles, code context, runtime output, and glossary terms;
115
+ 2. distill strict JSON correction/chapter examples;
116
+ 3. fine-tune a LoRA adapter on `HuggingFaceTB/SmolLM2-135M-Instruct`;
117
+ 4. merge the adapter into a full model;
118
+ 5. export the merged model to ONNX for browser use.
119
+
120
+ The merged checkpoint is mainly an intermediate artifact for reproducibility and export.
121
+
122
+ ## Evaluation
123
+
124
+ code-tape evaluates this model family with project-specific checks instead of broad language-model benchmarks:
125
+
126
+ - valid JSON object output;
127
+ - valid sparse segment references;
128
+ - glossary preservation after sparse corrections are applied back to the source subtitles;
129
+ - non-empty, ordered, non-overlapping chapter supervision for training/evaluation records;
130
+ - chapter bounds inside the subtitle timeline.
131
+
132
+ The model output must always be validated by the caller.
133
+
134
+ ## Limitations
135
+
136
+ - Narrowly trained for code-tape subtitle correction and chapter generation.
137
+ - Not suitable as a general chat assistant or general summarizer.
138
+ - Not an ASR model and cannot process audio directly.
139
+ - Small local models may produce malformed JSON; callers must keep a fallback path.
140
+
141
+ ## Privacy and security
142
+
143
+ The intended production path is the ONNX export running in the browser with `@huggingface/transformers`. Public browser loading does not require a Hugging Face token.
144
+
145
+ Do not put secrets, credentials, private code, or access tokens in prompts unless your inference environment is trusted.
146
+
147
+ ## License
148
+
149
+ Apache-2.0, following the base model license.