File size: 4,122 Bytes
c629582
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f073936
fc0882b
f073936
 
c629582
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a1c16d8
c629582
 
 
 
 
31adec0
643196c
c629582
643196c
31adec0
c629582
 
31adec0
 
 
643196c
c629582
 
f97feb5
c629582
 
 
 
a1c16d8
c629582
 
 
 
 
 
 
f97feb5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c629582
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
---
language:
- si
- ta
- en
license: other
tags:
- qwen
- qlora
- cpt
- sri-lanka
- sinhala
- tamil
model_type: causal-lm
pipeline_tag: text-generation
base_model: qwen3.5-4b
---

# Chat2Find-CPT: Qwen 3.5 4B (Sri Lankan Continued Pre-Training)

Chat2Find-CPT is a specialized version of the Qwen 3.5 4B model, enhanced via **Continued Pre-Training (CPT)** using **QLoRA (4-bit)** to excel in Sri Lankan linguistic and cultural contexts. It features robust proficiency in **Sinhala**, **Tamil**, and English.

## Model Details

- **Developed by:** Sentient (Chat2Find)
- **Base Model:** Qwen 3.5-4B
- **Training Method:** Continued Pre-Training (CPT) with QLoRA
- **Languages:** Sinhala (si), Tamil (ta), English (en)
- **Quantization:** 4-bit (bitsandbytes)

## Technical Specifications

### Training Hardware
- **Frameworks:** Unsloth, Hugging Face Transformers, PEFT

### Training Hyperparameters
- **Method:** QLoRA (Rank 32, Alpha 32)
- **Learning Rate:** 5e-5 (Cosine Scheduler)
- **Optimizer:** AdamW (8-bit)
- **Epochs:** 1.0
- **Sequence Length:** 2048 tokens
- **Batch Size:** 2 (local) / 8 (global with Gradient Accumulation)

### Dataset
The model underwent true Continued Pre-Training on a massive 1.38 GB unstructured text corpus. The data was densely packed into:
- **Size:** 270,000 packed sequences of 2048 tokens each (**550 Million total Qwen tokens / approx. 255 Million words**).
- **Epochs:** 1 Epoch (Standard pre-training practice to prevent overfitting).
- **Content:** Sri Lankan News & Media, Cultural Context, and domain-specific raw web data.

## Capabilities

Chat2Find-CPT excels at:
1. **Sinhala & Tamil Generation:** Fluent and contextually relevant text generation.
2. **Code-Switching:** Handling natural language mixes common in Sri Lankan communication.
3. **Local Knowledge:** Understanding entities, locations, and cultural references specific to Sri Lanka.

## Usage

### Using Unsloth (Recommended for Speed)

```python
from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Chat2Find/Chat2Find-CPT",
    max_seq_length = 2048,
    load_in_4bit = True,
)
FastLanguageModel.for_inference(model)

prompt = "ශ්‍රී ලංකාව ගැන කෙටි විස්තරයක්:"

inputs = tokenizer(
    text=[prompt],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256)
# Decode the generated text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Using Standard Transformers (GPU)

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Chat2Find/Chat2Find-CPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

# Note: The model is a merged 16-bit weight set. 
# You can load it in 4-bit/8-bit using BitsAndBytes.
```

### Running on CPU Only

If you do not have a dedicated GPU, you can explicitly map the model to CPU. Note that inference will be significantly slower.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Chat2Find/Chat2Find-CPT"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Force the model to load into CPU RAM
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map="cpu", 
    torch_dtype="auto" # Loads in bfloat16 to save RAM
)

prompt = "ශ්‍රී ලංකාව ගැන කෙටි විස්තරයක්:"
inputs = tokenizer(text=[prompt], return_tensors="pt").to("cpu")

outputs = model.generate(**inputs, max_new_tokens=128)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Limitations & Bias

While Chat2Find-CPT is significantly better at local languages than the base Qwen model, it may still exhibit biases present in the training data or the base model's internal knowledge. Users are encouraged to perform their own safety checks for specific deployment scenarios.

## License

This model is subject to the original **Qwen License Agreement**.