USER_GUIDE.md · OpenPeerAI/CancerAtHomeV2 at main

File size: 10,381 Bytes

7a92197

# Cancer@Home v2 - User Guide

## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Getting Started](#getting-started)
4. [Dashboard Guide](#dashboard-guide)
5. [Working with Data](#working-with-data)
6. [Analysis Pipeline](#analysis-pipeline)
7. [Advanced Usage](#advanced-usage)

---

## Introduction

Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
- **BOINC**: Distributed computing for computationally intensive tasks
- **GDC Portal**: Access to comprehensive cancer genomics datasets
- **Neo4j**: Graph database for modeling complex relationships
- **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling

### Key Features
✓ Interactive web dashboard  
✓ Real-time graph visualization  
✓ GraphQL API for flexible data queries  
✓ Distributed task processing  
✓ Cancer genomics data integration  

---

## System Architecture

```

┌─────────────────────────────────────────────────┐

│              Web Dashboard (Port 5000)          │

│  Dashboard | Neo4j Viz | BOINC | GDC | Pipeline│

└────────────────────┬────────────────────────────┘

                     │

┌────────────────────┴────────────────────────────┐

│           FastAPI Backend (REST + GraphQL)      │

└─────┬──────┬──────┬──────┬──────┬──────────────┘

      │      │      │      │      │

   ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴─┐ ┌──┴──────┐

   │Neo4j│ │BOINC│ │GDC │ │FASTQ│ │BLAST/VCF│

   │7687 │ │Client│ │API │ │Proc │ │ Caller  │

   └─────┘ └─────┘ └────┘ └─────┘ └─────────┘

```

---

## Getting Started

### Quick Installation (5 minutes)

**Windows:**
```powershell

.\setup.ps1

python run.py

```

**Linux/Mac:**
```bash

./setup.sh

python run.py

```

### Access Points
- **Main Application**: http://localhost:5000
- **API Documentation**: http://localhost:5000/docs
- **GraphQL Playground**: http://localhost:5000/graphql
- **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123)

---

## Dashboard Guide

### 1. Overview Tab
Shows key statistics:
- Total genes in database
- Total mutations identified
- Number of patients
- Cancer types catalogued

**Chart**: Mutation distribution across cancer types

### 2. Neo4j Visualization Tab
Interactive graph showing:
- **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.)
- **Purple nodes**: Patients
- **Pink nodes**: Cancer types
- **Lines**: Relationships between entities

**Navigation**:
- Click and drag nodes to rearrange
- Hover over nodes for details
- Zoom in/out with mouse wheel

### 3. BOINC Tasks Tab
Manage distributed computing workloads:

**Submit Task**:
1. Select task type (Variant Calling, BLAST, Alignment)
2. Enter input file path
3. Click "Submit Task"

**Monitor Tasks**:
- View all tasks with status (Pending, Running, Completed)
- See task creation time and type
- Check overall statistics

### 4. GDC Data Tab
Browse available cancer projects:
- TCGA-BRCA: Breast Cancer (1,098 cases)
- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
- TCGA-COAD: Colon Adenocarcinoma (461 cases)
- TCGA-GBM: Glioblastoma (617 cases)
- TARGET-AML: Acute Myeloid Leukemia (238 cases)

Click on a project to explore available datasets.

### 5. Pipeline Tab
Quick access to bioinformatics tools:
- **FASTQ QC**: Quality control for sequencing data
- **BLAST Search**: Sequence alignment and homology
- **Variant Calling**: Identify genetic variants

---

## Working with Data

### Querying with GraphQL

Access the GraphQL playground at http://localhost:5000/graphql

**Example 1: Find mutations in TP53 gene**
```graphql

query {

  mutations(gene: "TP53") {

    mutation_id

    chromosome

    position

    consequence

  }

}

```

**Example 2: Get patient information**
```graphql

query {

  patients(project_id: "TCGA-BRCA", limit: 10) {

    patient_id

    age

    gender

    vital_status

  }

}

```

**Example 3: Cancer statistics**
```graphql

query {

  cancerStatistics(cancer_type_id: "BRCA") {

    total_patients

    total_mutations

    avg_mutations_per_patient

  }

}

```

### Using the REST API

**Get database summary:**
```bash

curl http://localhost:5000/api/neo4j/summary

```

**Search GDC files:**
```bash

curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"

```

**Submit BOINC task:**
```bash

curl -X POST http://localhost:5000/api/boinc/submit \

  -H "Content-Type: application/json" \

  -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'

```

---

## Analysis Pipeline

### 1. FASTQ Processing

**Quality Control:**
```python

from backend.pipeline import FASTQProcessor



processor = FASTQProcessor()

stats = processor.calculate_statistics("input.fastq")

print(f"Total reads: {stats['total_reads']}")

print(f"Average quality: {stats['avg_quality']}")

```

**Filter by quality:**
```python

filtered = processor.quality_filter("input.fastq", "filtered.fastq")

print(f"Pass rate: {filtered['pass_rate']:.2%}")

```

### 2. BLAST Alignment

**Run BLAST search:**
```python

from backend.pipeline import BLASTRunner



blast = BLASTRunner()

results = blast.run_blastn("query.fasta")

hits = blast.parse_results(results)



print(f"Found {len(hits)} alignments")

```

**Filter high-quality hits:**
```python

filtered_hits = blast.filter_hits(hits, min_identity=0.95)

```

### 3. Variant Calling

**Identify variants:**
```python

from backend.pipeline import VariantCaller



caller = VariantCaller()

vcf_file = caller.call_variants("alignment.bam", "reference.fa")

variants = caller.filter_variants(vcf_file, min_quality=30)



print(f"Identified {len(variants)} high-quality variants")

```

**Find cancer-associated variants:**
```python

from backend.pipeline import VariantAnalyzer



analyzer = VariantAnalyzer()

cancer_variants = analyzer.identify_cancer_variants(variants)

tmb = analyzer.calculate_mutation_burden(variants)



print(f"Cancer variants: {len(cancer_variants)}")

print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")

```

---

## Advanced Usage

### Custom Neo4j Queries

**Direct Cypher queries:**
```python

from backend.neo4j import DatabaseManager



db = DatabaseManager()



# Find patients with TP53 mutations

query = """

MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})

RETURN p.patient_id, m.position, m.consequence

"""



results = db.execute_query(query)

for result in results:

    print(result)



db.close()

```

### Batch Data Import

**Import GDC data:**
```python

from backend.gdc import GDCClient

from backend.neo4j import DataImporter



# Download mutation data

gdc = GDCClient()

files = gdc.get_mutation_data("TCGA-BRCA", limit=10)



for file in files:

    gdc.download_file(file.file_id)



# Import to Neo4j

importer = DataImporter()

importer.import_gdc_data(files)

```

### Custom BOINC Tasks

**Submit custom analysis:**
```python

from backend.boinc import BOINCClient



client = BOINCClient()



# Submit multiple tasks

input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]

task_ids = []



for file in input_files:

    task_id = client.submit_task("variant_calling", file)

    task_ids.append(task_id)



# Monitor progress

for task_id in task_ids:

    status = client.get_task_status(task_id)

    print(f"Task {task_id}: {status.status}")

```

### Configuration Customization

Edit `config.yml`:

```yaml

neo4j:

  uri: "bolt://localhost:7687"

  password: "your_password"



gdc:

  download_dir: "./data/gdc"

  max_retries: 3



pipeline:

  fastq:

    quality_threshold: 25  # Increase quality threshold

    min_length: 75         # Increase minimum read length

  

  blast:

    evalue: 0.0001         # More stringent e-value

    num_threads: 8         # Use more CPU cores

```

---

## Troubleshooting

### Neo4j Connection Issues
```bash

# Check Neo4j status

docker ps | grep neo4j



# Restart Neo4j

docker-compose restart neo4j



# View Neo4j logs

docker-compose logs neo4j

```

### Memory Issues
Increase Docker memory allocation:
1. Open Docker Desktop Settings
2. Resources → Memory
3. Increase to at least 8GB
4. Click "Apply & Restart"

### API Errors
Check logs:
```bash

# View application logs

cat logs/cancer_at_home.log



# Follow logs in real-time

tail -f logs/cancer_at_home.log

```

---

## Best Practices

1. **Data Management**: Regularly clean up downloaded data to free space
2. **Task Monitoring**: Check BOINC tasks periodically for failures
3. **Database Backup**: Backup Neo4j data volume regularly
4. **Resource Limits**: Monitor system resources when running large analyses
5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads

---

## Support & Resources

- **Documentation**: See README.md and QUICKSTART.md
- **API Reference**: http://localhost:5000/docs
- **GraphQL Examples**: See GRAPHQL_EXAMPLES.md

- **Logs**: Check `logs/cancer_at_home.log`



### Useful Cypher Queries



**Most common mutations:**

```cypher

MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count

RETURN m.mutation_id, patient_count

ORDER BY patient_count DESC
LIMIT 10
```



**Genes with most mutations:**

```cypher

MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)

WITH g, count(m) as mutation_count

RETURN g.symbol, mutation_count

ORDER BY mutation_count DESC

LIMIT 10

```

**Patient mutation profile:**
```cypher

MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)

RETURN g.symbol, m.consequence, m.position

```