Utkarsh524 commited on
Commit
e775073
·
verified ·
1 Parent(s): 648efb1

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -0
README.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language: c++
4
+ tags:
5
+ - code-generation
6
+ - codellama
7
+ - peft
8
+ - unit-tests
9
+ - causal-lm
10
+ - text-generation
11
+ - lora
12
+ base_model: codellama/CodeLlama-7b-hf
13
+ model_type: llama
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # 🧪 CodeLLaMA Unit Test Generator — LoRA Adapter (v3)
18
+
19
+ This is a **LoRA adapter** trained on embedded C/C++ functions and their corresponding unit tests using the [`athrv/Embedded_Unittest2`](https://huggingface.co/datasets/athrv/Embedded_Unittest2) dataset.
20
+
21
+ The adapter is meant to be used **with `codellama/CodeLlama-7b-hf`** and enhances its ability to generate **production-ready C/C++ unit tests**, especially for embedded systems.
22
+
23
+ ---
24
+
25
+ ## 🚀 Key Improvements in `v3`
26
+
27
+ - ✅ Enhanced instruction prompt tuning using `<|system|>`, `<|user|>`, `<|assistant|>`
28
+ - 🧹 Stripped out `#include`, `main()` and framework boilerplate from training targets
29
+ - 🔚 Appended `// END_OF_TESTS` to each output to guide model termination
30
+ - 🧠 Fine-tuned with sequence length of 4096 tokens for long-context unit tests
31
+ - 🤖 Optimized for frameworks like **CppUTest** or **GoogleTest**
32
+
33
+ ---
34
+
35
+ ## 🔧 How to Use
36
+
37
+ ```python
38
+ from transformers import AutoTokenizer, AutoModelForCausalLM
39
+ from peft import PeftModel
40
+ import torch
41
+
42
+ base_model_id = "codellama/CodeLlama-7b-hf"
43
+ adapter_id = "Utkarsh524/codellama_utests_embedded_v3"
44
+
45
+ # Load tokenizer
46
+ tokenizer = AutoTokenizer.from_pretrained(adapter_id)
47
+ tokenizer.pad_token = tokenizer.eos_token
48
+
49
+ # Load base model
50
+ base = AutoModelForCausalLM.from_pretrained(
51
+ base_model_id,
52
+ device_map="auto",
53
+ torch_dtype=torch.float16,
54
+ trust_remote_code=True
55
+ )
56
+
57
+ # Resize to match tokenizer with special tokens
58
+ base.resize_token_embeddings(len(tokenizer))
59
+
60
+ # Attach LoRA adapter
61
+ model = PeftModel.from_pretrained(base, adapter_id)
62
+
63
+ # Prepare prompt
64
+ prompt = """<|system|>
65
+ Generate comprehensive unit tests for C/C++ code. Cover all edge cases, boundary conditions, and error scenarios.
66
+ Output Constraints:
67
+ 1. ONLY include test code (no explanations, headers, or main functions)
68
+ 2. Start directly with TEST(...)
69
+ 3. End after last test case
70
+ 4. Never include framework boilerplate
71
+ <|user|>
72
+ Create tests for:
73
+ int factorial(int n) { return (n <= 1) ? 1 : n * factorial(n - 1); }
74
+ <|assistant|>
75
+ """
76
+
77
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
78
+ outputs = model.generate(**inputs, max_new_tokens=512, eos_token_id=tokenizer.convert_tokens_to_ids("// END_OF_TESTS"))
79
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))