File size: 10,381 Bytes
7a92197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
# Cancer@Home v2 - User Guide

## Table of Contents
1. [Introduction](#introduction)
2. [System Architecture](#system-architecture)
3. [Getting Started](#getting-started)
4. [Dashboard Guide](#dashboard-guide)
5. [Working with Data](#working-with-data)
6. [Analysis Pipeline](#analysis-pipeline)
7. [Advanced Usage](#advanced-usage)

---

## Introduction

Cancer@Home v2 is a distributed computing platform for cancer genomics research that combines:
- **BOINC**: Distributed computing for computationally intensive tasks
- **GDC Portal**: Access to comprehensive cancer genomics datasets
- **Neo4j**: Graph database for modeling complex relationships
- **Bioinformatics Pipeline**: FASTQ processing, BLAST alignment, and variant calling

### Key Features
βœ“ Interactive web dashboard  
βœ“ Real-time graph visualization  
βœ“ GraphQL API for flexible data queries  
βœ“ Distributed task processing  
βœ“ Cancer genomics data integration  

---

## System Architecture

```

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚              Web Dashboard (Port 5000)          β”‚

β”‚  Dashboard | Neo4j Viz | BOINC | GDC | Pipelineβ”‚

β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                     β”‚

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”

β”‚           FastAPI Backend (REST + GraphQL)      β”‚

β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

      β”‚      β”‚      β”‚      β”‚      β”‚

   β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β” β”Œβ”€β”€β”΄β”€β”€β”€β”€β”€β”€β”

   β”‚Neo4jβ”‚ β”‚BOINCβ”‚ β”‚GDC β”‚ β”‚FASTQβ”‚ β”‚BLAST/VCFβ”‚

   β”‚7687 β”‚ β”‚Clientβ”‚ β”‚API β”‚ β”‚Proc β”‚ β”‚ Caller  β”‚

   β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

```

---

## Getting Started

### Quick Installation (5 minutes)

**Windows:**
```powershell

.\setup.ps1

python run.py

```

**Linux/Mac:**
```bash

./setup.sh

python run.py

```

### Access Points
- **Main Application**: http://localhost:5000
- **API Documentation**: http://localhost:5000/docs
- **GraphQL Playground**: http://localhost:5000/graphql
- **Neo4j Browser**: http://localhost:7474 (neo4j/cancer123)

---

## Dashboard Guide

### 1. Overview Tab
Shows key statistics:
- Total genes in database
- Total mutations identified
- Number of patients
- Cancer types catalogued

**Chart**: Mutation distribution across cancer types

### 2. Neo4j Visualization Tab
Interactive graph showing:
- **Blue nodes**: Genes (TP53, BRCA1, KRAS, etc.)
- **Purple nodes**: Patients
- **Pink nodes**: Cancer types
- **Lines**: Relationships between entities

**Navigation**:
- Click and drag nodes to rearrange
- Hover over nodes for details
- Zoom in/out with mouse wheel

### 3. BOINC Tasks Tab
Manage distributed computing workloads:

**Submit Task**:
1. Select task type (Variant Calling, BLAST, Alignment)
2. Enter input file path
3. Click "Submit Task"

**Monitor Tasks**:
- View all tasks with status (Pending, Running, Completed)
- See task creation time and type
- Check overall statistics

### 4. GDC Data Tab
Browse available cancer projects:
- TCGA-BRCA: Breast Cancer (1,098 cases)
- TCGA-LUAD: Lung Adenocarcinoma (585 cases)
- TCGA-COAD: Colon Adenocarcinoma (461 cases)
- TCGA-GBM: Glioblastoma (617 cases)
- TARGET-AML: Acute Myeloid Leukemia (238 cases)

Click on a project to explore available datasets.

### 5. Pipeline Tab
Quick access to bioinformatics tools:
- **FASTQ QC**: Quality control for sequencing data
- **BLAST Search**: Sequence alignment and homology
- **Variant Calling**: Identify genetic variants

---

## Working with Data

### Querying with GraphQL

Access the GraphQL playground at http://localhost:5000/graphql

**Example 1: Find mutations in TP53 gene**
```graphql

query {

  mutations(gene: "TP53") {

    mutation_id

    chromosome

    position

    consequence

  }

}

```

**Example 2: Get patient information**
```graphql

query {

  patients(project_id: "TCGA-BRCA", limit: 10) {

    patient_id

    age

    gender

    vital_status

  }

}

```

**Example 3: Cancer statistics**
```graphql

query {

  cancerStatistics(cancer_type_id: "BRCA") {

    total_patients

    total_mutations

    avg_mutations_per_patient

  }

}

```

### Using the REST API

**Get database summary:**
```bash

curl http://localhost:5000/api/neo4j/summary

```

**Search GDC files:**
```bash

curl "http://localhost:5000/api/gdc/files/TCGA-BRCA?limit=10"

```

**Submit BOINC task:**
```bash

curl -X POST http://localhost:5000/api/boinc/submit \

  -H "Content-Type: application/json" \

  -d '{"workunit_type": "variant_calling", "input_file": "data/sample.fastq"}'

```

---

## Analysis Pipeline

### 1. FASTQ Processing

**Quality Control:**
```python

from backend.pipeline import FASTQProcessor



processor = FASTQProcessor()

stats = processor.calculate_statistics("input.fastq")

print(f"Total reads: {stats['total_reads']}")

print(f"Average quality: {stats['avg_quality']}")

```

**Filter by quality:**
```python

filtered = processor.quality_filter("input.fastq", "filtered.fastq")

print(f"Pass rate: {filtered['pass_rate']:.2%}")

```

### 2. BLAST Alignment

**Run BLAST search:**
```python

from backend.pipeline import BLASTRunner



blast = BLASTRunner()

results = blast.run_blastn("query.fasta")

hits = blast.parse_results(results)



print(f"Found {len(hits)} alignments")

```

**Filter high-quality hits:**
```python

filtered_hits = blast.filter_hits(hits, min_identity=0.95)

```

### 3. Variant Calling

**Identify variants:**
```python

from backend.pipeline import VariantCaller



caller = VariantCaller()

vcf_file = caller.call_variants("alignment.bam", "reference.fa")

variants = caller.filter_variants(vcf_file, min_quality=30)



print(f"Identified {len(variants)} high-quality variants")

```

**Find cancer-associated variants:**
```python

from backend.pipeline import VariantAnalyzer



analyzer = VariantAnalyzer()

cancer_variants = analyzer.identify_cancer_variants(variants)

tmb = analyzer.calculate_mutation_burden(variants)



print(f"Cancer variants: {len(cancer_variants)}")

print(f"Tumor Mutation Burden: {tmb:.2f} mutations/Mb")

```

---

## Advanced Usage

### Custom Neo4j Queries

**Direct Cypher queries:**
```python

from backend.neo4j import DatabaseManager



db = DatabaseManager()



# Find patients with TP53 mutations

query = """

MATCH (p:Patient)-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene {symbol: 'TP53'})

RETURN p.patient_id, m.position, m.consequence

"""



results = db.execute_query(query)

for result in results:

    print(result)



db.close()

```

### Batch Data Import

**Import GDC data:**
```python

from backend.gdc import GDCClient

from backend.neo4j import DataImporter



# Download mutation data

gdc = GDCClient()

files = gdc.get_mutation_data("TCGA-BRCA", limit=10)



for file in files:

    gdc.download_file(file.file_id)



# Import to Neo4j

importer = DataImporter()

importer.import_gdc_data(files)

```

### Custom BOINC Tasks

**Submit custom analysis:**
```python

from backend.boinc import BOINCClient



client = BOINCClient()



# Submit multiple tasks

input_files = ["sample1.fastq", "sample2.fastq", "sample3.fastq"]

task_ids = []



for file in input_files:

    task_id = client.submit_task("variant_calling", file)

    task_ids.append(task_id)



# Monitor progress

for task_id in task_ids:

    status = client.get_task_status(task_id)

    print(f"Task {task_id}: {status.status}")

```

### Configuration Customization

Edit `config.yml`:

```yaml

neo4j:

  uri: "bolt://localhost:7687"

  password: "your_password"



gdc:

  download_dir: "./data/gdc"

  max_retries: 3



pipeline:

  fastq:

    quality_threshold: 25  # Increase quality threshold

    min_length: 75         # Increase minimum read length

  

  blast:

    evalue: 0.0001         # More stringent e-value

    num_threads: 8         # Use more CPU cores

```

---

## Troubleshooting

### Neo4j Connection Issues
```bash

# Check Neo4j status

docker ps | grep neo4j



# Restart Neo4j

docker-compose restart neo4j



# View Neo4j logs

docker-compose logs neo4j

```

### Memory Issues
Increase Docker memory allocation:
1. Open Docker Desktop Settings
2. Resources β†’ Memory
3. Increase to at least 8GB
4. Click "Apply & Restart"

### API Errors
Check logs:
```bash

# View application logs

cat logs/cancer_at_home.log



# Follow logs in real-time

tail -f logs/cancer_at_home.log

```

---

## Best Practices

1. **Data Management**: Regularly clean up downloaded data to free space
2. **Task Monitoring**: Check BOINC tasks periodically for failures
3. **Database Backup**: Backup Neo4j data volume regularly
4. **Resource Limits**: Monitor system resources when running large analyses
5. **API Rate Limits**: Be mindful of GDC API rate limits for bulk downloads

---

## Support & Resources

- **Documentation**: See README.md and QUICKSTART.md
- **API Reference**: http://localhost:5000/docs
- **GraphQL Examples**: See GRAPHQL_EXAMPLES.md

- **Logs**: Check `logs/cancer_at_home.log`



### Useful Cypher Queries



**Most common mutations:**

```cypher

MATCH (m:Mutation)<-[:HAS_MUTATION]-(p:Patient)
WITH m, count(p) as patient_count

RETURN m.mutation_id, patient_count

ORDER BY patient_count DESC
LIMIT 10
```



**Genes with most mutations:**

```cypher

MATCH (g:Gene)<-[:AFFECTS]-(m:Mutation)

WITH g, count(m) as mutation_count

RETURN g.symbol, mutation_count

ORDER BY mutation_count DESC

LIMIT 10

```

**Patient mutation profile:**
```cypher

MATCH (p:Patient {patient_id: 'TCGA-A1-001'})-[:HAS_MUTATION]->(m:Mutation)-[:AFFECTS]->(g:Gene)

RETURN g.symbol, m.consequence, m.position

```