Spaces:

aamanlamba
/

Lineage-graph-accelerator

Sleeping

File size: 11,338 Bytes

60ac2eb

# Local Setup Guide - Lineage Graph Extractor

This guide provides detailed instructions for setting up and running the Lineage Graph Extractor agent locally.

## Table of Contents
1. [System Requirements](#system-requirements)
2. [Installation Methods](#installation-methods)
3. [Configuration](#configuration)
4. [Usage Scenarios](#usage-scenarios)
5. [Advanced Configuration](#advanced-configuration)
6. [Troubleshooting](#troubleshooting)

## System Requirements

### Minimum Requirements
- **OS**: Windows 10+, macOS 10.15+, or Linux
- **Python**: 3.9 or higher
- **Memory**: 2GB RAM minimum
- **Disk Space**: 100MB for agent files

### Recommended Requirements
- **Python**: 3.10+
- **Memory**: 4GB RAM
- **Internet**: Stable connection for API calls

## Installation Methods

### Method 1: Standalone Use (Recommended)

This method uses the agent configuration files with any platform that supports the Anthropic API.

1. **Download the agent**
   ```bash
   # If you have a git repository
   git clone <repository-url>
   cd local_clone
   
   # Or extract from downloaded archive
   unzip lineage-graph-extractor.zip
   cd lineage-graph-extractor
   ```

2. **Set up environment**
   ```bash
   # Copy environment template
   cp .env.example .env
   ```

3. **Edit .env file**
   ```bash
   # Edit with your preferred editor
   nano .env
   # or
   vim .env
   # or
   code .env  # VS Code
   ```

   Add your credentials:
   ```
   ANTHROPIC_API_KEY=sk-ant-your-key-here
   GOOGLE_CLOUD_PROJECT=your-gcp-project
   GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json
   ```

4. **Install Python dependencies** (optional, for examples)
   ```bash
   pip install anthropic google-cloud-bigquery requests pyyaml
   ```

### Method 2: Claude Desktop Integration

If you're using Claude Desktop or similar platforms:

1. **Locate your agent configuration directory**
   - Claude Desktop: `~/.config/claude/agents/` (Linux/Mac) or `%APPDATA%\claude\agents\` (Windows)
   - Other platforms: Check platform documentation

2. **Copy the memories folder**
   ```bash
   # Linux/Mac
   cp -r memories ~/.config/claude/agents/lineage-extractor/
   
   # Windows
   xcopy /E /I memories %APPDATA%\claude\agents\lineage-extractor\
   ```

3. **Configure API credentials** in your platform's settings

4. **Restart the application**

### Method 3: Python Integration

To integrate into your own Python application:

1. **Install dependencies**
   ```bash
   pip install anthropic python-dotenv
   ```

2. **Use the integration example**
   ```python
   from anthropic import Anthropic
   from dotenv import load_dotenv
   import os
   
   # Load environment variables
   load_dotenv()
   
   # Initialize client
   client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
   
   # Load agent configuration
   with open("memories/agent.md", "r") as f:
       system_prompt = f.read()
   
   # Use the agent
   response = client.messages.create(
       model="claude-3-5-sonnet-20241022",
       max_tokens=4000,
       system=system_prompt,
       messages=[{
           "role": "user",
           "content": "Extract lineage from this metadata: ..."
       }]
   )
   
   print(response.content[0].text)
   ```

## Configuration

### API Keys Setup

#### Anthropic API Key
1. Go to https://console.anthropic.com/
2. Create an account or sign in
3. Navigate to API Keys
4. Create a new key
5. Copy to `.env` file

#### Google Cloud (for BigQuery)
1. Go to https://console.cloud.google.com/
2. Create a project or select existing
3. Enable BigQuery API
4. Create a service account:
   - Go to IAM & Admin → Service Accounts
   - Create service account
   - Grant "BigQuery Data Viewer" role
   - Create JSON key
5. Download JSON and reference in `.env`

#### Tavily (for web search)
1. Go to https://tavily.com/
2. Sign up for an account
3. Get your API key
4. Add to `.env` file

### Tool Configuration

Edit `memories/tools.json` to customize available tools:

```json
{
  "tools": [
    "bigquery_execute_query",      // Query BigQuery
    "read_url_content",             // Fetch from URLs
    "google_sheets_read_range",     // Read Google Sheets
    "tavily_web_search"             // Web search
  ],
  "interrupt_config": {
    "bigquery_execute_query": false,
    "read_url_content": false,
    "google_sheets_read_range": false,
    "tavily_web_search": false
  }
}
```

**Available Tools:**
- `bigquery_execute_query`: Execute SQL queries on BigQuery
- `read_url_content`: Fetch content from URLs/APIs
- `google_sheets_read_range`: Read data from Google Sheets
- `tavily_web_search`: Perform web searches

### Subagent Configuration

Customize subagents by editing their configuration files:

**Metadata Parser** (`memories/subagents/metadata_parser/`)
- `agent.md`: Instructions for parsing metadata
- `tools.json`: Tools available to parser

**Graph Visualizer** (`memories/subagents/graph_visualizer/`)
- `agent.md`: Instructions for creating visualizations
- `tools.json`: Tools available to visualizer

## Usage Scenarios

### Scenario 1: BigQuery Lineage Extraction

```python
from anthropic import Anthropic
import os

client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))

with open("memories/agent.md", "r") as f:
    system_prompt = f.read()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4000,
    system=system_prompt,
    messages=[{
        "role": "user",
        "content": "Extract lineage from BigQuery project: my-project, dataset: analytics"
    }]
)

print(response.content[0].text)
```

### Scenario 2: File-Based Metadata

```python
# Read metadata from file
with open("dbt_manifest.json", "r") as f:
    metadata = f.read()

response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4000,
    system=system_prompt,
    messages=[{
        "role": "user",
        "content": f"Extract lineage from this dbt manifest:\n\n{metadata}"
    }]
)
```

### Scenario 3: API Metadata

```python
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=4000,
    system=system_prompt,
    messages=[{
        "role": "user",
        "content": "Extract lineage from API: https://api.example.com/metadata"
    }]
)
```

## Advanced Configuration

### Custom Visualization Formats

To add custom visualization formats, edit `memories/subagents/graph_visualizer/agent.md`:

```markdown
### 4. Custom Format
Generate a custom format with:
- Your specific requirements
- Custom styling rules
- Special formatting needs
```

### Adding New Metadata Sources

To support new metadata sources:

1. Add tool to `memories/tools.json`
2. Update `memories/agent.md` with source-specific instructions
3. Update `memories/subagents/metadata_parser/agent.md` if needed

### MCP Integration

To integrate with Model Context Protocol servers:

1. Check if MCP tools are available: `/tools` directory
2. Add MCP tools to `memories/tools.json`
3. Configure MCP server connection
4. See `memories/mcp_integration.md` (if available)

## Troubleshooting

### Common Issues

#### 1. Authentication Errors

**Problem**: API authentication fails
**Solutions**:
- Verify API key is correct in `.env`
- Check key hasn't expired
- Ensure environment variables are loaded
- Try regenerating the API key

```bash
# Test Anthropic API key
python -c "from anthropic import Anthropic; import os; from dotenv import load_dotenv; load_dotenv(); client = Anthropic(api_key=os.getenv('ANTHROPIC_API_KEY')); print('✓ API key works')"
```

#### 2. BigQuery Access Issues

**Problem**: Cannot access BigQuery
**Solutions**:
- Verify service account has BigQuery permissions
- Check project ID is correct
- Ensure JSON key file path is correct
- Test credentials:

```bash
# Test BigQuery access
gcloud auth activate-service-account --key-file=/path/to/key.json
bq ls --project_id=your-project-id
```

#### 3. Import Errors

**Problem**: `ModuleNotFoundError`
**Solutions**:
```bash
# Install missing packages
pip install anthropic google-cloud-bigquery requests pyyaml python-dotenv

# Or install all at once
pip install -r requirements.txt  # if you create one
```

#### 4. Environment Variables Not Loading

**Problem**: `.env` file not being read
**Solutions**:
```python
# Explicitly load .env
from dotenv import load_dotenv
load_dotenv()

# Or specify path
load_dotenv(".env")

# Verify loading
import os
print(os.getenv("ANTHROPIC_API_KEY"))  # Should not be None
```

#### 5. File Path Issues

**Problem**: Cannot find `memories/agent.md`
**Solutions**:
```python
# Use absolute path
import os
base_dir = os.path.dirname(os.path.abspath(__file__))
agent_path = os.path.join(base_dir, "memories", "agent.md")

# Or change working directory
os.chdir("/path/to/local_clone")
```

### Performance Issues

#### Slow Response Times

**Causes**:
- Large metadata files
- Complex lineage graphs
- Network latency

**Solutions**:
- Break large metadata into chunks
- Use filtering to focus on specific entities
- Increase API timeout settings
- Cache frequently used results

### Debugging Tips

1. **Enable verbose logging**
   ```python
   import logging
   logging.basicConfig(level=logging.DEBUG)
   ```

2. **Test each component separately**
   - Test API connection first
   - Test metadata retrieval
   - Test parsing separately
   - Test visualization separately

3. **Validate metadata format**
   - Ensure JSON is valid
   - Check for required fields
   - Verify structure matches expected format

4. **Check agent configuration**
   - Verify `memories/agent.md` is readable
   - Check `tools.json` syntax
   - Ensure subagent files exist

## Getting Help

### Documentation
- Agent instructions: `memories/agent.md`
- Subagent docs: `memories/subagents/*/agent.md`
- Anthropic API: https://docs.anthropic.com/

### Testing Your Setup

Run this complete test:

```python
from anthropic import Anthropic
from dotenv import load_dotenv
import os

# Load environment
load_dotenv()

# Test 1: API Connection
try:
    client = Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))
    print("✓ Anthropic API connection successful")
except Exception as e:
    print(f"✗ API connection failed: {e}")
    exit(1)

# Test 2: Load Agent Config
try:
    with open("memories/agent.md", "r") as f:
        system_prompt = f.read()
    print("✓ Agent configuration loaded")
except Exception as e:
    print(f"✗ Failed to load agent config: {e}")
    exit(1)

# Test 3: Simple Query
try:
    response = client.messages.create(
        model="claude-3-5-sonnet-20241022",
        max_tokens=1000,
        system=system_prompt,
        messages=[{
            "role": "user",
            "content": "Hello, what can you help me with?"
        }]
    )
    print("✓ Agent response successful")
    print(f"\nAgent says: {response.content[0].text}")
except Exception as e:
    print(f"✗ Agent query failed: {e}")
    exit(1)

print("\n✓ All tests passed! Your setup is ready.")
```

Save as `test_setup.py` and run:
```bash
python test_setup.py
```

## Next Steps

1. ✅ Complete setup
2. ✅ Test with sample metadata
3. 📊 Extract your first lineage
4. 🎨 Customize visualization preferences
5. 🔧 Integrate with your workflow

---

**Setup complete?** Try the usage examples in README.md or run your own lineage extraction!