File size: 3,881 Bytes
174bbf8
 
 
 
 
 
 
 
 
 
 
 
 
 
78006c2
174bbf8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
library_name: transformers
license: other
base_model: Qwen/Qwen3-4B
tags:
- llama-factory
- full
- generated_from_trainer
model-index:
- name: train_2025-05-04-15-25-21
  results: []
---


# train_2025-05-04-15-25-21

This model is a fine-tuned version of [../pretrained/Qwen3-4B](https://huggingface.co/../pretrained/Qwen3-1.7B) on the wikipedia_zh and the petro_books datasets.

## Model description

Gaia-Petro-LLM is a large language model specialized in the oil and gas industry, fine-tuned from Qwen/Qwen3-4B. It was further pre-trained on a curated 20GB corpus of petroleum engineering texts, including technical documents, academic papers, and domain literature. The model is designed to support domain experts, researchers, and engineers in petroleum-related tasks, providing high-quality, domain-specific language understanding and generation.
## Model Details
Base Model: Qwen/Qwen3-4B
Domain: Oil & Gas / Petroleum Engineering
Corpus Size: ~20GB (petroleum engineering)
Languages: Primarily Chinese; domain-specific English supported
Repository: my2000cup/Gaia-Petro-LLM
## Intended uses & limitations

Technical Q&A in petroleum engineering
Document summarization for oil & gas reports
Knowledge extraction from unstructured domain texts
Education & training in oil & gas technologies

Not suitable for general domain tasks outside oil & gas.
May not be up to date with the latest industry developments (post-2023).
Not to be used for critical, real-time decision-making without expert review.

## Training and evaluation data

The model was further pre-trained on an in-house text corpus (~20GB) collected from:

Wikipedia (Chinese, petroleum-related entries)
Open petroleum engineering books and literature
Technical standards and manuals

## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

# Replace with your model repository
model_name = "my2000cup/Gaia-LLM-4B"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# Prepare a petroleum engineering prompt
prompt = "What are the main challenges in enhanced oil recovery (EOR) methods?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True  # Optional: enables model's 'thinking' mode
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# Generate the model's response
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=1024  # adjust as needed
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# Optional: parse 'thinking' content, if your template uses it
try:
    # Find the index of the </think> token (ID may differ in your tokenizer!)
    think_token_id = 151668  # double-check this ID in your tokenizer
    index = len(output_ids) - output_ids[::-1].index(think_token_id)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("Thinking content:", thinking_content)
print("Answer:", content)
```

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 2e-05
- train_batch_size: 1
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 8
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 16
- num_epochs: 3.0

### Training results



### Framework versions

- Transformers 4.51.3
- Pytorch 2.6.0+cu124
- Datasets 3.5.0
- Tokenizers 0.21.1