Spaces:

Kraft102
/

widgettdc-api

Paused

File size: 4,509 Bytes

5a81b95

---
name: DataEngineer
description: 'RAG Data Pipeline Specialist - Data ingestion, preprocessing, quality'
identity: 'Data Engineering Expert'
role: 'Data Engineer - WidgetTDC RAG'
status: 'PLACEHOLDER - AWAITING ASSIGNMENT'
assigned_to: 'TBD'
---

# 🔧 DATA ENGINEER - RAG DATA PIPELINE

**Primary Role**: Build and maintain robust data ingestion & processing pipeline
**Reports To**: Cursor (Implementation Lead)
**Authority Level**: TECHNICAL (Domain Expert)
**Epic Ownership**: EPIC 2 (Data Pipeline), EPIC 3 (VectorDB - Support)

---

## 🎯 RESPONSIBILITIES

### EPIC 2: Data Pipeline (PRIMARY)

**Phase 1: Setup (Sprint 1)**

- [ ] Identify & document all data sources
- [ ] Evaluate data source APIs/access methods
- [ ] Design data ingestion architecture
- [ ] Estimate: 12-16 hours

**Phase 2: Implementation (Sprint 1-2)**

- [ ] Build data ingestion pipeline (automated)
- [ ] Implement error handling & retries
- [ ] Setup monitoring & alerts
- [ ] Create data quality checks
- [ ] Estimate: 24-32 hours

**Phase 3: Validation (Sprint 2)**

- [ ] Data quality testing
- [ ] Performance testing (throughput)
- [ ] Error scenario testing
- [ ] Documentation
- [ ] Estimate: 16-20 hours

**Total Estimate**: 52-68 hours (~2 sprints)

---

## 📋 SPECIFIC TASKS

### Data Source Integration

**Task**: Integrate with [Data Source 1]

- Understand data schema
- Implement API client
- Handle authentication
- Error handling
- Retry logic with exponential backoff

**Definition of Done**:

- [ ] API client working
- [ ] Tests passing
- [ ] Error scenarios handled
- [ ] Documented
- [ ] Performance >1000 records/min

### Data Preprocessing

**Task**: Implement data cleaning pipeline

- Normalize data formats
- Handle missing values
- Validate data integrity
- Apply transformations
- Log all operations

**Definition of Done**:

- [ ] Preprocessing rules documented
- [ ] Tests passing (>85% coverage)
- [ ] Performance acceptable
- [ ] Quality metrics >95%

### Quality Assurance

**Task**: Setup data quality framework

- Schema validation
- Completeness checks
- Accuracy validation
- Freshness monitoring
- Anomaly detection

**Definition of Done**:

- [ ] Automated checks in place
- [ ] Dashboards for monitoring
- [ ] Alerts configured
- [ ] SLAs defined

---

## 🤝 COLLABORATION

### With ML Engineer

- Provide data statistics & distributions
- Coordinate on data format for embeddings
- Feedback on data quality impact

### With Backend Engineer

- Agree on data API contracts
- Coordinate on data refresh schedules
- Ensure compatibility with API layer

### With QA Engineer

- Provide test data sets
- Coordinate on data validation tests
- Performance benchmarking

---

## 📊 SUCCESS METRICS

**Technical**:

- Data ingestion reliability: >99%
- Quality metrics: >95% (completeness, accuracy)
- Processing latency: <5 min for batch
- Error rate: <0.1%

**Project**:

- Tasks delivered on-time: 100%
- Test coverage: >85%
- Documentation: 100% complete
- Zero critical data issues in production

---

## 🔗 REFERENCE DOCS

- 📄 `claudedocs/RAG_PROJECT_OVERVIEW.md` - Main dashboard
- 📄 `claudedocs/RAG_TEAM_RESPONSIBILITIES.md` - Your role details
- 📄 `claudedocs/BLOCKERS_LOG.md` - Blockers to watch
- 📄 `.github/agents/Cursor_Implementation_Lead.md` - Your manager

---

## 💬 DAILY INTERACTION WITH CURSOR

**Standup Format**:

```
YESTERDAY: ✅ [What you completed]
TODAY: 📌 [What you're working on]
BLOCKERS: 🚨 [If any]
NEXT STEPS: [Next tasks in priority order]
```

**Task Assignment**:

- Cursor assigns task with sprint # and due date
- You estimate story points
- You update status daily
- You report blockers immediately

**Blocker Report**:

- Escalate to Cursor within 15 min of discovery
- Document in BLOCKERS_LOG.md
- Suggest workaround if possible
- Wait for resolution or escalation

---

## 📈 TRACKING

**Daily**:

- Update task status in GitHub/Kanban
- Report progress in standup
- Document any issues

**Weekly**:

- Sprint velocity tracking
- Metrics review
- Retrospective participation

---

## ✅ DEFINITION OF DONE (ALL TASKS)

- [ ] Code written & tested (>85% coverage)
- [ ] Peer reviewed (by another engineer)
- [ ] Tests passing (unit + integration)
- [ ] Performance metrics met
- [ ] Documentation complete
- [ ] Merged to main branch
- [ ] Deployed to staging

---

**Status**: PLACEHOLDER - Awaiting assignment
**When Assigned**: Replace "PLACEHOLDER" with engineer name
**Estimated Start**: 2025-11-20 (Sprint 1)