siddhantoon commited on
Commit
a2ab625
·
verified ·
1 Parent(s): df3ce44

Upload folder using huggingface_hub

Browse files
__pycache__/server.cpython-310.pyc CHANGED
Binary files a/__pycache__/server.cpython-310.pyc and b/__pycache__/server.cpython-310.pyc differ
 
attention_mask_research.md ADDED
@@ -0,0 +1,186 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Attention Masks and Pad Tokens in Transformer Generation: Research Questions
2
+
3
+ ## Core Problem Statement
4
+
5
+ When running transformer models (specifically Llama-3.2-1B-Instruct) for text generation, we encounter warnings about missing attention masks and pad tokens, even for single input sequences. This leads to inconsistent generation outputs despite identical inputs.
6
+
7
+ ### Warning Messages Observed
8
+ ```
9
+ The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
10
+ Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
11
+ The attention mask is not set and cannot be inferred from input because pad token is same as eos token.
12
+ ```
13
+
14
+ ## Key Research Questions
15
+
16
+ ### 1. Why do single inputs require attention masks?
17
+ **Initial Assumption**: Single sequences without padding shouldn't need attention masks.
18
+ **Observed Reality**: Even single inputs show different generation outputs when attention masks are missing.
19
+
20
+ ### 2. What is the relationship between pad tokens and attention masks?
21
+ **Question**: How do pad_token_id and attention_mask work together in the generation process?
22
+
23
+ ### 3. Why does pad_token_id = eos_token_id cause issues?
24
+ **Specific Issue**: When padding token equals end-of-sequence token, what ambiguity does this create?
25
+
26
+ ## Code Analysis
27
+
28
+ ### Current Implementation (Problematic)
29
+ ```python
30
+ def chat_current(system_prompt: str, user_prompt: str) -> str:
31
+ messages = [
32
+ {"role": "system", "content": system_prompt},
33
+ {"role": "user", "content": user_prompt},
34
+ ]
35
+
36
+ # Only returns input_ids tensor
37
+ input_ids = tok.apply_chat_template(
38
+ messages,
39
+ add_generation_prompt=True,
40
+ return_tensors="pt"
41
+ ).to(lm.device)
42
+
43
+ with torch.inference_mode():
44
+ output_ids = lm.generate(
45
+ input_ids, # Missing: attention_mask, pad_token_id
46
+ max_new_tokens=2048,
47
+ do_sample=True,
48
+ temperature=0.2,
49
+ repetition_penalty=1.1,
50
+ top_k=100,
51
+ top_p=0.95,
52
+ )
53
+
54
+ return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
55
+ ```
56
+
57
+ ### Fixed Implementation
58
+ ```python
59
+ def chat_fixed(system_prompt: str, user_prompt: str) -> str:
60
+ messages = [
61
+ {"role": "system", "content": system_prompt},
62
+ {"role": "user", "content": user_prompt},
63
+ ]
64
+
65
+ # Returns dictionary with input_ids AND attention_mask
66
+ inputs = tok.apply_chat_template(
67
+ messages,
68
+ add_generation_prompt=True,
69
+ return_tensors="pt",
70
+ return_dict=True # KEY CHANGE: Get both components
71
+ )
72
+
73
+ input_ids = inputs["input_ids"].to(lm.device)
74
+ attention_mask = inputs["attention_mask"].to(lm.device)
75
+
76
+ with torch.inference_mode():
77
+ output_ids = lm.generate(
78
+ input_ids=input_ids,
79
+ attention_mask=attention_mask, # Explicit attention guidance
80
+ pad_token_id=tok.eos_token_id, # Explicit pad token
81
+ max_new_tokens=2048,
82
+ do_sample=True,
83
+ temperature=0.2,
84
+ repetition_penalty=1.1,
85
+ top_k=100,
86
+ top_p=0.95,
87
+ )
88
+
89
+ return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
90
+ ```
91
+
92
+ ### Model and Tokenizer Setup
93
+ ```python
94
+ model_name = "models/Llama-3.2-1B-Instruct"
95
+ tok = AutoTokenizer.from_pretrained(model_name)
96
+ # Critical: Set pad token if not available
97
+ if tok.pad_token is None:
98
+ tok.pad_token = tok.eos_token
99
+
100
+ lm = AutoModelForCausalLM.from_pretrained(
101
+ model_name,
102
+ torch_dtype=torch.bfloat16,
103
+ device_map="cuda",
104
+ ).eval()
105
+ ```
106
+
107
+ ## Observed Behavioral Differences
108
+
109
+ ### Input Structure Analysis
110
+ ```python
111
+ # Single input contains multiple components:
112
+ messages = [
113
+ {"role": "system", "content": "You are a helpful assistant..."},
114
+ {"role": "user", "content": "What is the capital of France?"},
115
+ ]
116
+
117
+ # After apply_chat_template, becomes token sequence:
118
+ # [system_tokens, user_tokens, assistant_start_token]
119
+ ```
120
+
121
+ ## Technical Hypotheses for Investigation
122
+
123
+ ### Hypothesis 1: Internal Masking Ambiguity
124
+ When attention_mask is missing, the model cannot distinguish between:
125
+ - Real input tokens that should influence generation
126
+ - Structural tokens (system prompts, role markers)
127
+ - Token boundaries between different message roles
128
+
129
+ ### Hypothesis 2: EOS Token Dual Purpose Confusion
130
+ When `pad_token_id == eos_token_id`, the model faces ambiguity:
131
+ ```python
132
+ # Same token (128001) serves dual purposes:
133
+ # 1. End of sequence marker
134
+ # 2. Padding token for batch processing
135
+ # Model cannot infer which purpose applies in context
136
+ ```
137
+
138
+ ### Hypothesis 3: Autoregressive Generation Context Boundary Issues
139
+ During generation, model needs to know:
140
+ - Which input tokens provide valid context for next token prediction
141
+ - Where the "prompt" ends and "generation" begins
142
+ - How to weight attention across different input components
143
+
144
+ ## Research Objectives
145
+
146
+ ### Primary Questions
147
+ 1. **Mechanism Analysis**: How exactly does missing attention_mask affect the internal attention computation?
148
+ 2. **Consistency Impact**: Why do identical inputs produce different outputs without proper masking?
149
+ 3. **Single vs Batch Behavior**: What differences exist between single sequence and batched sequence processing?
150
+
151
+ ### Secondary Questions
152
+ 1. **Model-Specific Behavior**: Do different transformer architectures handle missing attention masks differently?
153
+ 2. **Generation Parameter Interaction**: How do attention mask issues interact with sampling parameters (temperature, top_p, etc.)?
154
+ 3. **Performance Impact**: What computational overhead does proper attention masking add?
155
+
156
+ ## Key Technical Areas for Deep Research
157
+
158
+ ### Attention Mechanism Internals
159
+ - How attention weights are computed with/without explicit masks
160
+ - Impact on multi-head attention distributions
161
+ - Interaction with causal masking in autoregressive models
162
+
163
+ ### Tokenizer Behavior
164
+ - How `apply_chat_template` constructs input sequences
165
+ - Default attention mask generation behavior
166
+ - Role of special tokens in attention computation
167
+
168
+ ### Generation Process
169
+ - How `model.generate()` handles missing parameters
170
+ - Internal assumptions and fallback behaviors
171
+ - Impact on sampling and beam search algorithms
172
+
173
+ ## Expected Research Outcomes
174
+
175
+ Understanding of:
176
+ 1. Exact mechanism causing output inconsistency
177
+ 2. Best practices for single sequence generation
178
+ 3. Relationship between attention masking and generation quality
179
+ 4. Guidelines for production transformer deployment
180
+
181
+ ## References for Deep Research
182
+
183
+ - Hugging Face Transformers documentation on attention masks
184
+ - Technical blogs on transformer attention mechanisms (2024)
185
+ - Community discussions on pad token vs attention mask differences
186
+ - Official model documentation for Llama architecture attention handling
compare_generation.py ADDED
@@ -0,0 +1,129 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+
3
+ import torch
4
+ from transformers import AutoModelForCausalLM, AutoTokenizer
5
+
6
+ # Load model and tokenizer (same as server.py)
7
+ model_name = "models/Llama-3.2-1B-Instruct"
8
+ tok = AutoTokenizer.from_pretrained(model_name)
9
+ lm = AutoModelForCausalLM.from_pretrained(
10
+ model_name,
11
+ torch_dtype=torch.bfloat16,
12
+ device_map="cuda",
13
+ ).eval()
14
+
15
+ def chat_current(system_prompt: str, user_prompt: str) -> str:
16
+ """
17
+ Current implementation (same as server.py) - will show warnings
18
+ """
19
+ print("🔴 Running CURRENT implementation (with warnings)...")
20
+
21
+ messages = [
22
+ {"role": "system", "content": system_prompt},
23
+ {"role": "user", "content": user_prompt},
24
+ ]
25
+
26
+ input_ids = tok.apply_chat_template(
27
+ messages,
28
+ add_generation_prompt=True,
29
+ return_tensors="pt"
30
+ ).to(lm.device)
31
+
32
+ with torch.inference_mode():
33
+ output_ids = lm.generate(
34
+ input_ids, # No attention_mask, no pad_token_id
35
+ max_new_tokens=2048,
36
+ do_sample=True,
37
+ temperature=0.2,
38
+ repetition_penalty=1.1,
39
+ top_k=100,
40
+ top_p=0.95,
41
+ )
42
+
43
+ answer = tok.decode(
44
+ output_ids[0][input_ids.shape[-1]:],
45
+ skip_special_tokens=True,
46
+ clean_up_tokenization_spaces=True,
47
+ )
48
+ return answer.strip()
49
+
50
+
51
+ def chat_fixed(system_prompt: str, user_prompt: str) -> str:
52
+ """
53
+ Fixed implementation - proper attention mask and pad token
54
+ """
55
+ print("🟢 Running FIXED implementation (no warnings)...")
56
+
57
+ messages = [
58
+ {"role": "system", "content": system_prompt},
59
+ {"role": "user", "content": user_prompt},
60
+ ]
61
+
62
+ # Get both input_ids and attention_mask
63
+ inputs = tok.apply_chat_template(
64
+ messages,
65
+ add_generation_prompt=True,
66
+ return_tensors="pt",
67
+ return_dict=True # Returns dict with input_ids and attention_mask
68
+ )
69
+
70
+ # Move to device
71
+ input_ids = inputs["input_ids"].to(lm.device)
72
+ attention_mask = inputs["attention_mask"].to(lm.device)
73
+
74
+ with torch.inference_mode():
75
+ output_ids = lm.generate(
76
+ input_ids=input_ids,
77
+ attention_mask=attention_mask, # Proper attention mask
78
+ pad_token_id=tok.eos_token_id, # Explicit pad token
79
+ max_new_tokens=2048,
80
+ do_sample=True,
81
+ temperature=0.2,
82
+ repetition_penalty=1.1,
83
+ top_k=100,
84
+ top_p=0.95,
85
+ )
86
+
87
+ answer = tok.decode(
88
+ output_ids[0][input_ids.shape[-1]:],
89
+ skip_special_tokens=True,
90
+ clean_up_tokenization_spaces=True,
91
+ )
92
+ return answer.strip()
93
+
94
+
95
+ def compare_generations():
96
+ """Compare both implementations"""
97
+ system_prompt = "You are a helpful assistant who tries to help answer the user's question."
98
+ user_prompt = "Create a report on anxiety in work. How do I manage time and stress effectively?"
99
+
100
+ print("=" * 60)
101
+ print("COMPARING GENERATION METHODS")
102
+ print("=" * 60)
103
+ print(f"System: {system_prompt}")
104
+ print(f"User: {user_prompt}")
105
+ print("=" * 60)
106
+
107
+ # Test current implementation
108
+ print("\n" + "=" * 60)
109
+ current_output = chat_current(system_prompt, user_prompt)
110
+ print(f"CURRENT OUTPUT:\n{current_output}")
111
+
112
+ print("\n" + "=" * 60)
113
+ # Test fixed implementation
114
+ fixed_output = chat_fixed(system_prompt, user_prompt)
115
+ print(f"FIXED OUTPUT:\n{fixed_output}")
116
+
117
+ print("\n" + "=" * 60)
118
+ print("COMPARISON:")
119
+ print(f"Outputs are identical: {current_output == fixed_output}")
120
+ print(f"Current length: {len(current_output)} chars")
121
+ print(f"Fixed length: {len(fixed_output)} chars")
122
+
123
+
124
+ if __name__ == "__main__":
125
+ # Set pad token for the fixed version
126
+ if tok.pad_token is None:
127
+ tok.pad_token = tok.eos_token
128
+
129
+ compare_generations()
server.py CHANGED
@@ -51,15 +51,23 @@ def chat(system_prompt: str, user_prompt: str) -> str:
51
 
52
  # `add_generation_prompt=True` automatically appends the
53
  # <|start_header_id|>assistant … header so the model knows to respond.
54
- input_ids = tok.apply_chat_template(
 
55
  messages,
56
  add_generation_prompt=True,
57
- return_tensors="pt"
58
- ).to(lm.device)
 
 
 
 
 
59
 
60
  with torch.inference_mode():
61
  output_ids = lm.generate(
62
- input_ids,
 
 
63
  max_new_tokens=2048,
64
  do_sample=True,
65
  temperature=0.2,
 
51
 
52
  # `add_generation_prompt=True` automatically appends the
53
  # <|start_header_id|>assistant … header so the model knows to respond.
54
+ # Get both input_ids and attention_mask
55
+ inputs = tok.apply_chat_template(
56
  messages,
57
  add_generation_prompt=True,
58
+ return_tensors="pt",
59
+ return_dict=True # Returns dict with input_ids and attention_mask
60
+ )
61
+
62
+ # Move to device
63
+ input_ids = inputs["input_ids"].to(lm.device)
64
+ attention_mask = inputs["attention_mask"].to(lm.device)
65
 
66
  with torch.inference_mode():
67
  output_ids = lm.generate(
68
+ input_ids=input_ids,
69
+ attention_mask=attention_mask, # Proper attention mask
70
+ pad_token_id=tok.eos_token_id, # Explicit pad token
71
  max_new_tokens=2048,
72
  do_sample=True,
73
  temperature=0.2,