my2000cup commited on
Commit
174bbf8
·
verified ·
1 Parent(s): c75e3ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +120 -5
README.md CHANGED
@@ -1,5 +1,120 @@
1
- ---
2
- license: mit
3
- tags:
4
- - llama-factory
5
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: other
4
+ base_model: Qwen/Qwen3-4B
5
+ tags:
6
+ - llama-factory
7
+ - full
8
+ - generated_from_trainer
9
+ model-index:
10
+ - name: train_2025-05-04-15-25-21
11
+ results: []
12
+ ---
13
+
14
+
15
+ # train_2025-05-02-18-36-44
16
+
17
+ This model is a fine-tuned version of [../pretrained/Qwen3-4B](https://huggingface.co/../pretrained/Qwen3-1.7B) on the wikipedia_zh and the petro_books datasets.
18
+
19
+ ## Model description
20
+
21
+ Gaia-Petro-LLM is a large language model specialized in the oil and gas industry, fine-tuned from Qwen/Qwen3-4B. It was further pre-trained on a curated 20GB corpus of petroleum engineering texts, including technical documents, academic papers, and domain literature. The model is designed to support domain experts, researchers, and engineers in petroleum-related tasks, providing high-quality, domain-specific language understanding and generation.
22
+ ## Model Details
23
+ Base Model: Qwen/Qwen3-4B
24
+ Domain: Oil & Gas / Petroleum Engineering
25
+ Corpus Size: ~20GB (petroleum engineering)
26
+ Languages: Primarily Chinese; domain-specific English supported
27
+ Repository: my2000cup/Gaia-Petro-LLM
28
+ ## Intended uses & limitations
29
+
30
+ Technical Q&A in petroleum engineering
31
+ Document summarization for oil & gas reports
32
+ Knowledge extraction from unstructured domain texts
33
+ Education & training in oil & gas technologies
34
+
35
+ Not suitable for general domain tasks outside oil & gas.
36
+ May not be up to date with the latest industry developments (post-2023).
37
+ Not to be used for critical, real-time decision-making without expert review.
38
+
39
+ ## Training and evaluation data
40
+
41
+ The model was further pre-trained on an in-house text corpus (~20GB) collected from:
42
+
43
+ Wikipedia (Chinese, petroleum-related entries)
44
+ Open petroleum engineering books and literature
45
+ Technical standards and manuals
46
+
47
+ ## Usage
48
+ ```python
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+
51
+ # Replace with your model repository
52
+ model_name = "my2000cup/Gaia-LLM-4B"
53
+
54
+ # Load tokenizer and model
55
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
56
+ model = AutoModelForCausalLM.from_pretrained(
57
+ model_name,
58
+ torch_dtype="auto",
59
+ device_map="auto"
60
+ )
61
+
62
+ # Prepare a petroleum engineering prompt
63
+ prompt = "What are the main challenges in enhanced oil recovery (EOR) methods?"
64
+ messages = [
65
+ {"role": "user", "content": prompt}
66
+ ]
67
+ text = tokenizer.apply_chat_template(
68
+ messages,
69
+ tokenize=False,
70
+ add_generation_prompt=True,
71
+ enable_thinking=True # Optional: enables model's 'thinking' mode
72
+ )
73
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
74
+
75
+ # Generate the model's response
76
+ generated_ids = model.generate(
77
+ **model_inputs,
78
+ max_new_tokens=1024 # adjust as needed
79
+ )
80
+ output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
81
+
82
+ # Optional: parse 'thinking' content, if your template uses it
83
+ try:
84
+ # Find the index of the </think> token (ID may differ in your tokenizer!)
85
+ think_token_id = 151668 # double-check this ID in your tokenizer
86
+ index = len(output_ids) - output_ids[::-1].index(think_token_id)
87
+ except ValueError:
88
+ index = 0
89
+
90
+ thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
91
+ content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
92
+
93
+ print("Thinking content:", thinking_content)
94
+ print("Answer:", content)
95
+ ```
96
+
97
+ ### Training hyperparameters
98
+
99
+ The following hyperparameters were used during training:
100
+ - learning_rate: 2e-05
101
+ - train_batch_size: 1
102
+ - eval_batch_size: 8
103
+ - seed: 42
104
+ - gradient_accumulation_steps: 8
105
+ - total_train_batch_size: 8
106
+ - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
107
+ - lr_scheduler_type: cosine
108
+ - lr_scheduler_warmup_steps: 16
109
+ - num_epochs: 3.0
110
+
111
+ ### Training results
112
+
113
+
114
+
115
+ ### Framework versions
116
+
117
+ - Transformers 4.51.3
118
+ - Pytorch 2.6.0+cu124
119
+ - Datasets 3.5.0
120
+ - Tokenizers 0.21.1