File size: 3,858 Bytes
740c342
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
BASE_CORPUS = """
System: You are a tiny local assistant.
User: hello
Assistant: Hello. I am a small local language model built from scratch in Python.

User: what can you do
Assistant: I can generate short text, continue patterns, and answer simple prompts based on my tiny training data.

User: explain machine learning
Assistant: Machine learning is a way to build models from data so they can make predictions or find patterns.

User: explain deep learning
Assistant: Deep learning uses neural networks with many layers to learn useful representations from data.

User: explain natural language processing
Assistant: Natural language processing studies how computers read, generate, and understand human language.

User: what is regression
Assistant: Regression predicts a continuous value, such as price, score, temperature, or demand.

User: what is classification
Assistant: Classification predicts a label or category, such as spam or not spam.

User: what is clustering
Assistant: Clustering groups similar data points without needing labeled targets.

User: what is preprocessing
Assistant: Preprocessing cleans and transforms raw data before modeling.

User: what is model evaluation
Assistant: Model evaluation checks how well a model performs on data it did not train on.

User: write a short greeting
Assistant: Hello and welcome. I am ready to help with a short answer.

User: write a short poem
Assistant: Soft morning light on quiet ground, small thoughts wake without a sound.

User: what is Python
Assistant: Python is a popular programming language known for readable syntax and strong libraries.

User: what is a neural network
Assistant: A neural network is a model made of layers that transform input data into useful predictions.

User: how do I learn coding
Assistant: Start small, practice every day, build mini projects, and review your mistakes carefully.

User: summarize good study habits
Assistant: Good study habits include clear goals, spaced practice, breaks, active recall, and consistent review.

User: give motivation
Assistant: Progress grows from small repeated effort, not from waiting for the perfect moment.

User: what is overfitting
Assistant: Overfitting happens when a model memorizes training data and performs poorly on new data.

User: what is underfitting
Assistant: Underfitting happens when a model is too simple to capture important patterns.

User: what is a dataset
Assistant: A dataset is a collection of examples or records used for analysis or modeling.

User: what is a feature
Assistant: A feature is an input variable used by a model to make a prediction.

User: what is a target
Assistant: A target is the value or label the model tries to predict.

User: give a coding tip
Assistant: Keep functions small, name things clearly, and test simple cases first.

User: tell me about local models
Assistant: Local models run on your own machine, giving you direct control over code, files, and training.

User: how should I debug code
Assistant: Reproduce the issue, isolate the failing step, inspect inputs and outputs, and test one fix at a time.

User: what is attention
Assistant: Attention lets a model focus on the most relevant parts of earlier tokens when predicting the next token.

User: what is a tokenizer
Assistant: A tokenizer turns text into smaller pieces so a model can process language as numbers.

User: what is a transformer
Assistant: A transformer is a neural architecture that uses attention to model relationships between tokens.

User: write a short answer about data science
Assistant: Data science combines statistics, programming, and domain knowledge to learn from data.
""".strip()


def build_training_text(extra_text: str = "") -> str:
    extra = " ".join((extra_text or "").split())
    if not extra:
        return BASE_CORPUS
    return BASE_CORPUS + "\n\n" + extra