ceilf6
/

code-tape-subtitle-postprocessor-lora

@@ -1,209 +1,147 @@
 ---
 base_model: HuggingFaceTB/SmolLM2-135M-Instruct
 library_name: peft
 pipeline_tag: text-generation
 tags:
 - base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
 - lora
 - sft
 - transformers
 - trl
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]
-### Framework versions
-- PEFT 0.19.1

 ---
+license: apache-2.0
 base_model: HuggingFaceTB/SmolLM2-135M-Instruct
 library_name: peft
 pipeline_tag: text-generation
+language:
+- zh
+- en
 tags:
 - base_model:adapter:HuggingFaceTB/SmolLM2-135M-Instruct
 - lora
+- peft
 - sft
 - transformers
 - trl
+- code-tape
+- subtitle-correction
+- chapter-generation
 ---
+# code-tape subtitle postprocessor LoRA
+This repository contains the LoRA adapter used by code-tape for subtitle post-processing experiments. It is fine-tuned from `HuggingFaceTB/SmolLM2-135M-Instruct` for a narrow browser-local workflow:
+- correct ASR subtitle text for frontend/code terminology, identifiers, component names, package names, and mixed Chinese/English narration;
+- preserve unchanged subtitle segments by returning a sparse `segments` change set;
+- generate playback chapter jump points from subtitle content and timestamps;
+- output one strict JSON object that the code-tape web app can validate.
+This model is not an ASR model. It expects subtitle segments that were already produced by an ASR pipeline such as Whisper.
+## Repository role
+code-tape publishes the same experiment in three forms:
+| Repository | Purpose |
+| --- | --- |
+| [`ceilf6/code-tape-subtitle-postprocessor-lora`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-lora) | LoRA adapter for reproducibility and continued fine-tuning. |
+| [`ceilf6/code-tape-subtitle-postprocessor-merged`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-merged) | Full merged Hugging Face model after applying this adapter to the base model. |
+| [`ceilf6/code-tape-subtitle-postprocessor-onnx`](https://huggingface.co/ceilf6/code-tape-subtitle-postprocessor-onnx) | Transformers.js-compatible ONNX export used by the browser app. |
+For the code-tape application, use the ONNX repository. Use this LoRA repository only if you want to inspect, merge, or continue training the adapter.
+## Intended input and output
+The model is trained on chat-style records. The user message should contain JSON with code-tape subtitle context:
+```json
+{
+  "context": {
+    "fileName": "Counter.tsx",
+    "code": "const [count, setCount] = useState(0);",
+    "runtimeOutput": "",
+    "glossary": ["React", "useState", "setCount", "render"]
+  },
+  "segments": [
+    { "id": "subtitle-1", "startMs": 0, "endMs": 1200, "text": "这里用 use state 维护 count" },
+    { "id": "subtitle-2", "startMs": 1200, "endMs": 2600, "text": "然后 set count 触发 render" }
+  ]
+}
+```
+Expected assistant output:
+```json
+{
+  "segments": [
+    { "id": "subtitle-1", "text": "这里用 useState 维护 count" },
+    { "id": "subtitle-2", "text": "然后 setCount 触发 render" }
+  ],
+  "chapters": [
+    { "title": "使用 useState 维护状态", "startMs": 0, "endMs": 1200 },
+    { "title": "调用 setCount 触发渲染", "startMs": 1200, "endMs": 2600 }
+  ]
+}
+```
+`segments` may be sparse: unchanged subtitle segments can be omitted, and the application keeps their original text. Returned segment ids must come from the input exactly once. `chapters` must stay inside the input subtitle timeline.
+## Usage
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from peft import PeftModel
+base_model = "HuggingFaceTB/SmolLM2-135M-Instruct"
+adapter_id = "ceilf6/code-tape-subtitle-postprocessor-lora"
+tokenizer = AutoTokenizer.from_pretrained(adapter_id)
+base = AutoModelForCausalLM.from_pretrained(base_model)
+model = PeftModel.from_pretrained(base, adapter_id)
+messages = [
+    {
+        "role": "system",
+        "content": (
+            "You are the code-tape subtitle post-processing model.\n"
+            "Only output one JSON object.\n"
+            "Goal: correct ASR subtitle text for frontend/code terms and create playback chapter jump points."
+        ),
+    },
+    {"role": "user", "content": "{\"context\":{},\"segments\":[]}"},
+]
+prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=384, do_sample=False)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+## Training data
+The adapter was trained from code-tape subtitle post-processing records. Each record contains:
+- ASR-like subtitle segments with ids and timestamps;
+- frontend/code context such as file name, source snippet, runtime output, and glossary terms;
+- an assistant JSON response with sparse subtitle corrections and chapter jump points.
+The seed examples are intentionally narrow and project-specific. They cover React, TypeScript, Monaco/editor events, replay scheduler terminology, IndexedDB subtitle storage, Vite/GitHub Pages routing, Tailwind theme tokens, and repo-guard/code-review phrasing.
 ## Evaluation
+This repository does not claim broad language-model benchmark performance. code-tape evaluates this model family with project-specific checks:
+- JSON parseability;
+- valid sparse segment references with no unknown or duplicate ids;
+- preservation of frontend/code glossary terms after applying sparse corrections;
+- chapter ordering, overlap, and timeline bounds.
+The application must still validate model output before applying it.
+## Limitations
+- Designed for short subtitle batches, not long-form document summarization.
+- Optimized for code-tape frontend/code explanation scenarios; quality outside that domain is not guaranteed.
+- Small local model behavior can be brittle. Always parse, validate, and fall back to original subtitles on invalid output.
+- It does not transcribe audio and does not replace Whisper/ASR.
+## Privacy and security
+The intended application path is browser-local inference through the ONNX export. No Hugging Face token is required for public model loading, and user audio/subtitles do not need to be uploaded to a hosted inference API.
+Do not include secrets, private source code, access tokens, or credentials in prompts unless you control the full inference environment and storage path.
+## License
+Apache-2.0, following the base model license.