Upload folder using huggingface_hub

Browse files

Files changed (4) hide show

__pycache__/server.cpython-310.pyc +0 -0
attention_mask_research.md +186 -0
compare_generation.py +129 -0
server.py +12 -4

__pycache__/server.cpython-310.pyc CHANGED Viewed

Binary files a/__pycache__/server.cpython-310.pyc and b/__pycache__/server.cpython-310.pyc differ

attention_mask_research.md ADDED Viewed

	@@ -0,0 +1,186 @@

+# Attention Masks and Pad Tokens in Transformer Generation: Research Questions
+## Core Problem Statement
+When running transformer models (specifically Llama-3.2-1B-Instruct) for text generation, we encounter warnings about missing attention masks and pad tokens, even for single input sequences. This leads to inconsistent generation outputs despite identical inputs.
+### Warning Messages Observed
+```
+The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
+Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
+The attention mask is not set and cannot be inferred from input because pad token is same as eos token.
+```
+## Key Research Questions
+### 1. Why do single inputs require attention masks?
+**Initial Assumption**: Single sequences without padding shouldn't need attention masks.
+**Observed Reality**: Even single inputs show different generation outputs when attention masks are missing.
+### 2. What is the relationship between pad tokens and attention masks?
+**Question**: How do pad_token_id and attention_mask work together in the generation process?
+### 3. Why does pad_token_id = eos_token_id cause issues?
+**Specific Issue**: When padding token equals end-of-sequence token, what ambiguity does this create?
+## Code Analysis
+### Current Implementation (Problematic)
+```python
+def chat_current(system_prompt: str, user_prompt: str) -> str:
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    # Only returns input_ids tensor
+    input_ids = tok.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt"
+    ).to(lm.device)
+    with torch.inference_mode():
+        output_ids = lm.generate(
+            input_ids,  # Missing: attention_mask, pad_token_id
+            max_new_tokens=2048,
+            do_sample=True,
+            temperature=0.2,
+            repetition_penalty=1.1,
+            top_k=100,
+            top_p=0.95,
+        )
+    return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
+```
+### Fixed Implementation
+```python
+def chat_fixed(system_prompt: str, user_prompt: str) -> str:
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    # Returns dictionary with input_ids AND attention_mask
+    inputs = tok.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt",
+        return_dict=True  # KEY CHANGE: Get both components
+    )
+    input_ids = inputs["input_ids"].to(lm.device)
+    attention_mask = inputs["attention_mask"].to(lm.device)
+    with torch.inference_mode():
+        output_ids = lm.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,  # Explicit attention guidance
+            pad_token_id=tok.eos_token_id,  # Explicit pad token
+            max_new_tokens=2048,
+            do_sample=True,
+            temperature=0.2,
+            repetition_penalty=1.1,
+            top_k=100,
+            top_p=0.95,
+        )
+    return tok.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
+```
+### Model and Tokenizer Setup
+```python
+model_name = "models/Llama-3.2-1B-Instruct"
+tok = AutoTokenizer.from_pretrained(model_name)
+# Critical: Set pad token if not available
+if tok.pad_token is None:
+    tok.pad_token = tok.eos_token
+lm = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",
+).eval()
+```
+## Observed Behavioral Differences
+### Input Structure Analysis
+```python
+# Single input contains multiple components:
+messages = [
+    {"role": "system", "content": "You are a helpful assistant..."},
+    {"role": "user", "content": "What is the capital of France?"},
+]
+# After apply_chat_template, becomes token sequence:
+# [system_tokens, user_tokens, assistant_start_token]
+```
+## Technical Hypotheses for Investigation
+### Hypothesis 1: Internal Masking Ambiguity
+When attention_mask is missing, the model cannot distinguish between:
+- Real input tokens that should influence generation
+- Structural tokens (system prompts, role markers)
+- Token boundaries between different message roles
+### Hypothesis 2: EOS Token Dual Purpose Confusion
+When `pad_token_id == eos_token_id`, the model faces ambiguity:
+```python
+# Same token (128001) serves dual purposes:
+# 1. End of sequence marker
+# 2. Padding token for batch processing
+# Model cannot infer which purpose applies in context
+```
+### Hypothesis 3: Autoregressive Generation Context Boundary Issues
+During generation, model needs to know:
+- Which input tokens provide valid context for next token prediction
+- Where the "prompt" ends and "generation" begins
+- How to weight attention across different input components
+## Research Objectives
+### Primary Questions
+1. **Mechanism Analysis**: How exactly does missing attention_mask affect the internal attention computation?
+2. **Consistency Impact**: Why do identical inputs produce different outputs without proper masking?
+3. **Single vs Batch Behavior**: What differences exist between single sequence and batched sequence processing?
+### Secondary Questions
+1. **Model-Specific Behavior**: Do different transformer architectures handle missing attention masks differently?
+2. **Generation Parameter Interaction**: How do attention mask issues interact with sampling parameters (temperature, top_p, etc.)?
+3. **Performance Impact**: What computational overhead does proper attention masking add?
+## Key Technical Areas for Deep Research
+### Attention Mechanism Internals
+- How attention weights are computed with/without explicit masks
+- Impact on multi-head attention distributions
+- Interaction with causal masking in autoregressive models
+### Tokenizer Behavior
+- How `apply_chat_template` constructs input sequences
+- Default attention mask generation behavior
+- Role of special tokens in attention computation
+### Generation Process
+- How `model.generate()` handles missing parameters
+- Internal assumptions and fallback behaviors
+- Impact on sampling and beam search algorithms
+## Expected Research Outcomes
+Understanding of:
+1. Exact mechanism causing output inconsistency
+2. Best practices for single sequence generation
+3. Relationship between attention masking and generation quality
+4. Guidelines for production transformer deployment
+## References for Deep Research
+- Hugging Face Transformers documentation on attention masks
+- Technical blogs on transformer attention mechanisms (2024)
+- Community discussions on pad token vs attention mask differences
+- Official model documentation for Llama architecture attention handling

compare_generation.py ADDED Viewed

	@@ -0,0 +1,129 @@

+#!/usr/bin/env python3
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# Load model and tokenizer (same as server.py)
+model_name = "models/Llama-3.2-1B-Instruct"
+tok = AutoTokenizer.from_pretrained(model_name)
+lm = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    torch_dtype=torch.bfloat16,
+    device_map="cuda",
+).eval()
+def chat_current(system_prompt: str, user_prompt: str) -> str:
+    """
+    Current implementation (same as server.py) - will show warnings
+    """
+    print("🔴 Running CURRENT implementation (with warnings)...")
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    input_ids = tok.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt"
+    ).to(lm.device)
+    with torch.inference_mode():
+        output_ids = lm.generate(
+            input_ids,  # No attention_mask, no pad_token_id
+            max_new_tokens=2048,
+            do_sample=True,
+            temperature=0.2,
+            repetition_penalty=1.1,
+            top_k=100,
+            top_p=0.95,
+        )
+    answer = tok.decode(
+        output_ids[0][input_ids.shape[-1]:],
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=True,
+    )
+    return answer.strip()
+def chat_fixed(system_prompt: str, user_prompt: str) -> str:
+    """
+    Fixed implementation - proper attention mask and pad token
+    """
+    print("🟢 Running FIXED implementation (no warnings)...")
+    messages = [
+        {"role": "system", "content": system_prompt},
+        {"role": "user", "content": user_prompt},
+    ]
+    # Get both input_ids and attention_mask
+    inputs = tok.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        return_tensors="pt",
+        return_dict=True  # Returns dict with input_ids and attention_mask
+    )
+    # Move to device
+    input_ids = inputs["input_ids"].to(lm.device)
+    attention_mask = inputs["attention_mask"].to(lm.device)
+    with torch.inference_mode():
+        output_ids = lm.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,  # Proper attention mask
+            pad_token_id=tok.eos_token_id,  # Explicit pad token
+            max_new_tokens=2048,
+            do_sample=True,
+            temperature=0.2,
+            repetition_penalty=1.1,
+            top_k=100,
+            top_p=0.95,
+        )
+    answer = tok.decode(
+        output_ids[0][input_ids.shape[-1]:],
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=True,
+    )
+    return answer.strip()
+def compare_generations():
+    """Compare both implementations"""
+    system_prompt = "You are a helpful assistant who tries to help answer the user's question."
+    user_prompt = "Create a report on anxiety in work. How do I manage time and stress effectively?"
+    print("=" * 60)
+    print("COMPARING GENERATION METHODS")
+    print("=" * 60)
+    print(f"System: {system_prompt}")
+    print(f"User: {user_prompt}")
+    print("=" * 60)
+    # Test current implementation
+    print("\n" + "=" * 60)
+    current_output = chat_current(system_prompt, user_prompt)
+    print(f"CURRENT OUTPUT:\n{current_output}")
+    print("\n" + "=" * 60)
+    # Test fixed implementation
+    fixed_output = chat_fixed(system_prompt, user_prompt)
+    print(f"FIXED OUTPUT:\n{fixed_output}")
+    print("\n" + "=" * 60)
+    print("COMPARISON:")
+    print(f"Outputs are identical: {current_output == fixed_output}")
+    print(f"Current length: {len(current_output)} chars")
+    print(f"Fixed length: {len(fixed_output)} chars")
+if __name__ == "__main__":
+    # Set pad token for the fixed version
+    if tok.pad_token is None:
+        tok.pad_token = tok.eos_token
+    compare_generations()

server.py CHANGED Viewed

@@ -51,15 +51,23 @@ def chat(system_prompt: str, user_prompt: str) -> str:
     # `add_generation_prompt=True` automatically appends the
     #   <|start_header_id|>assistant … header so the model knows to respond.
-    input_ids = tok.apply_chat_template(
         messages,
         add_generation_prompt=True,
-        return_tensors="pt"
-    ).to(lm.device)
     with torch.inference_mode():
         output_ids = lm.generate(
-            input_ids,
             max_new_tokens=2048,
             do_sample=True,
             temperature=0.2,

     # `add_generation_prompt=True` automatically appends the
     #   <|start_header_id|>assistant … header so the model knows to respond.
+    # Get both input_ids and attention_mask
+    inputs = tok.apply_chat_template(
         messages,
         add_generation_prompt=True,
+        return_tensors="pt",
+        return_dict=True  # Returns dict with input_ids and attention_mask
+    )
+    # Move to device
+    input_ids = inputs["input_ids"].to(lm.device)
+    attention_mask = inputs["attention_mask"].to(lm.device)
     with torch.inference_mode():
         output_ids = lm.generate(
+            input_ids=input_ids,
+            attention_mask=attention_mask,  # Proper attention mask
+            pad_token_id=tok.eos_token_id,  # Explicit pad token
             max_new_tokens=2048,
             do_sample=True,
             temperature=0.2,