musaw commited on Feb 12

Commit

f725a8a

0 Parent(s):

chore: initial community foundation structure

Files changed (29) hide show

.github/ISSUE_TEMPLATE/bug_report.md +22 -0
.github/ISSUE_TEMPLATE/dataset_task.md +19 -0
.github/ISSUE_TEMPLATE/feature_request.md +15 -0
.github/PULL_REQUEST_TEMPLATE.md +18 -0
.gitignore +26 -0
CODE_OF_CONDUCT.md +17 -0
CONTRIBUTING.md +30 -0
GOVERNANCE.md +21 -0
LICENSE_POLICY.md +13 -0
PROJECT_PURPOSE.md +37 -0
README.md +22 -0
ROADMAP.md +27 -0
apps/desktop/README.md +3 -0
asr/README.md +3 -0
benchmarks/README.md +3 -0
community/COMMUNICATION.md +14 -0
community/RECOGNITION.md +6 -0
data/README.md +5 -0
data/metadata/.gitkeep +1 -0
data/processed/.gitkeep +1 -0
data/raw/.gitkeep +1 -0
docs/dataset_guidelines.md +18 -0
docs/platforms.md +10 -0
docs/release_process.md +14 -0
docs/workstreams.md +16 -0
models/asr/.gitkeep +1 -0
models/tts/.gitkeep +1 -0
scripts/README.md +3 -0
tts/README.md +3 -0

.github/ISSUE_TEMPLATE/bug_report.md ADDED Viewed

	@@ -0,0 +1,22 @@

+---
+name: Bug report
+about: Report a reproducible bug
+title: "[Bug] "
+labels: bug
+assignees: ''
+---
+## Description
+## Steps to reproduce
+1.
+2.
+3.
+## Expected behavior
+## Environment
+- OS:
+- Branch/commit:
+## Logs / screenshots

.github/ISSUE_TEMPLATE/dataset_task.md ADDED Viewed

	@@ -0,0 +1,19 @@

+---
+name: Dataset task
+about: Propose/track a data collection or curation task
+title: "[Data] "
+labels: data
+assignees: ''
+---
+## Task type
+- [ ] Collection
+- [ ] Validation
+- [ ] Normalization
+- [ ] Metadata QA
+## Scope
+## Acceptance criteria
+## Notes

.github/ISSUE_TEMPLATE/feature_request.md ADDED Viewed

	@@ -0,0 +1,15 @@

+---
+name: Feature request
+about: Suggest an improvement
+title: "[Feature] "
+labels: enhancement
+assignees: ''
+---
+## Problem
+## Proposed solution
+## Alternatives considered
+## Additional context

.github/PULL_REQUEST_TEMPLATE.md ADDED Viewed

	@@ -0,0 +1,18 @@

+## Summary
+- What changed and why
+## Type of change
+- [ ] Data
+- [ ] ASR
+- [ ] TTS
+- [ ] Benchmark
+- [ ] Docs
+## Validation
+- Steps used to validate
+- Key results/metrics
+## Checklist
+- [ ] Linked issue
+- [ ] Reproducible steps included
+- [ ] Docs updated if needed

.gitignore ADDED Viewed

	@@ -0,0 +1,26 @@

+# Python
+__pycache__/
+*.py[cod]
+.venv/
+# Data/model artifacts
+*.wav
+*.mp3
+*.flac
+*.m4a
+*.mp4
+*.mkv
+*.mov
+*.avi
+*.zip
+# Large/generated folders
+outputs/
+checkpoints/
+artifacts/
+# OS/editor
+.DS_Store
+Thumbs.db
+.vscode/
+.idea/

CODE_OF_CONDUCT.md ADDED Viewed

	@@ -0,0 +1,17 @@

+# Code of Conduct
+We are committed to a welcoming, respectful, and inclusive community.
+## Expected behavior
+- Be respectful and constructive.
+- Assume good intent and communicate clearly.
+- Give actionable feedback, not personal criticism.
+## Unacceptable behavior
+- Harassment, hate speech, or discrimination.
+- Doxxing, threats, or abusive language.
+- Repeated disruptive behavior after warnings.
+## Enforcement
+- Maintainers may warn, mute, or remove participants for violations.
+- Serious cases may be escalated and documented.

CONTRIBUTING.md ADDED Viewed

	@@ -0,0 +1,30 @@

+# Contributing
+Thanks for helping build open Pashto AI resources.
+## Ways to contribute
+- Data recording and validation
+- Text normalization and terminology fixes
+- Model training/evaluation scripts
+- Documentation, issue triage, and testing
+## Contribution flow
+1. Open or pick an issue.
+2. Comment your plan.
+3. Create a branch and make focused changes.
+4. Open a PR with clear summary and testing notes.
+## Standards
+- Keep changes small and reviewable.
+- Include reproducible steps for data/model changes.
+- Document assumptions, limitations, and risks.
+- Respect contributors and community guidelines.
+## Priority labels (recommended)
+- `good first issue`
+- `data`
+- `asr`
+- `tts`
+- `benchmark`
+- `docs`
+- `help wanted`

GOVERNANCE.md ADDED Viewed

	@@ -0,0 +1,21 @@

+# Governance
+## Model
+Lightweight maintainer model with transparent decision-making.
+## Roles
+- **Maintainers**: review/merge PRs, release planning, quality control.
+- **Contributors**: submit code/data/docs, review, and improve workflows.
+- **Community moderators**: keep discussion spaces healthy and productive.
+## Decision process
+- Default: consensus in issues/PR discussion.
+- If blocked: maintainer vote with rationale posted publicly.
+- Major changes: RFC issue with at least 7 days for feedback.
+## Release ownership
+- Each release has one responsible maintainer and one backup reviewer.
+## Conflict resolution
+- Follow `CODE_OF_CONDUCT.md`.
+- Report issues privately to maintainers when needed.

LICENSE_POLICY.md ADDED Viewed

	@@ -0,0 +1,13 @@

+# License Policy (Draft)
+Use separate licenses for:
+- Code
+- Datasets
+- Model weights
+Recommended default:
+- Code: Apache-2.0
+- Dataset: a clear open data license with attribution terms
+- Models: aligned with training data and dependency licenses
+Finalize this file before first public release.

PROJECT_PURPOSE.md ADDED Viewed

	@@ -0,0 +1,37 @@

+# Project Purpose
+## Why this project exists
+Pashto remains underrepresented in open AI speech/language resources. This project exists to close that gap through community collaboration.
+## Mission
+Create high-quality open resources that enable Pashto to work reliably in:
+- Speech recognition (ASR)
+- Text-to-speech (TTS)
+- Translation and NLP tooling
+## What success looks like
+- Public Pashto datasets with clear quality standards
+- Reproducible baseline models and training pipelines
+- Public benchmark/leaderboard for fair model comparison
+- Open desktop/API demos that real users can run
+## Non-commercial commitment
+This initiative is community-first and public-benefit oriented. The project is not being built for proprietary lock-in or short-term commercialization.
+## Principles
+- Openness: data/model/process transparency
+- Inclusivity: dialect and accent diversity
+- Quality: strong labeling/review standards
+- Reproducibility: scripts, configs, and documented experiments
+- Continuity: release cadence and long-term maintenance
+## Scope (v1 foundation)
+- Build core repository and contributor workflows
+- Launch Pashto data collection and validation pipeline
+- Publish ASR and TTS baselines
+- Publish first benchmark set and metrics
+## Out of scope (for now)
+- Closed paid APIs as the only path
+- Private datasets without reproducible provenance
+- Productization before core language quality is established

README.md ADDED Viewed

	@@ -0,0 +1,22 @@

+# Pukhto/Pashto Open Language Project
+Community-led open-source project to make Pashto a first-class language in AI speech and language tooling.
+## Core Goal
+- Build open datasets, benchmarks, and models for Pashto ASR, TTS, and NLP.
+- Keep work reproducible, transparent, and contribution-friendly.
+- Focus on public good and broad accessibility.
+## Start Here
+- Purpose: `PROJECT_PURPOSE.md`
+- Contributing: `CONTRIBUTING.md`
+- Roadmap: `ROADMAP.md`
+- Governance: `GOVERNANCE.md`
+- Community coordination: `community/COMMUNICATION.md`
+## Initial Workstreams
+- `data/` Pashto data collection, cleaning, metadata
+- `asr/` speech-to-text baselines and experiments
+- `tts/` text-to-speech baselines and experiments
+- `benchmarks/` fixed test sets and evaluation scripts
+- `apps/desktop/` app integration references

ROADMAP.md ADDED Viewed

	@@ -0,0 +1,27 @@

+# Roadmap
+## Phase 1: Foundation (0-2 months)
+- Finalize governance and contribution docs
+- Define Pashto text normalization policy
+- Prepare data schema and validation checklist
+- Publish baseline ASR/TTS experiment templates
+## Phase 2: Data Scale (2-4 months)
+- Community data campaigns (recording + validation)
+- Curate and release dataset versions (`v0.1`, `v0.2`)
+- Improve metadata quality (speaker, dialect, environment)
+## Phase 3: Baseline Models (4-6 months)
+- Train and release first open ASR baseline
+- Train and release first open TTS baseline
+- Publish reproducible training/eval scripts
+## Phase 4: Benchmark & Demos (6-9 months)
+- Release fixed evaluation benchmark
+- Launch public leaderboard (WER/CER + TTS quality eval)
+- Integrate models into desktop/app demos
+## Phase 5: Community Maturity (9+ months)
+- Regular release cadence
+- Contributor mentoring and review rotations
+- Long-term maintenance and quality governance

apps/desktop/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Desktop Integration
2	+
3	+ Tracks desktop app integration for ASR/TTS/translation pipelines.

asr/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # ASR Workspace
2	+
3	+ Place ASR baselines, training configs, and evaluation scripts here.

benchmarks/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Benchmarks
2	+
3	+ Define fixed test sets, metrics, and leaderboard generation scripts.

community/COMMUNICATION.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# Community Communication
+## Channels
+- GitHub Issues/Discussions for technical decisions
+- Community chat for coordination and quick support
+## Meeting rhythm
+- Weekly async update thread
+- Monthly community review call
+## Rules
+- Keep technical decisions in public threads
+- Summarize outcomes after meetings
+- Tag maintainers only when blocked

community/RECOGNITION.md ADDED Viewed

	@@ -0,0 +1,6 @@

+# Contributor Recognition
+## How contributors are recognized
+- Release notes mention key contributors
+- Monthly spotlight for impactful community work
+- Maintainer nomination path for sustained contributions

data/README.md ADDED Viewed

	@@ -0,0 +1,5 @@

+# Data Workspace
+- `raw/` incoming source files
+- `processed/` cleaned/aligned artifacts
+- `metadata/` manifests, speaker/dialect info, QA reports

data/metadata/.gitkeep ADDED Viewed

	@@ -0,0 +1 @@


1	+

data/processed/.gitkeep ADDED Viewed

	@@ -0,0 +1 @@


1	+

data/raw/.gitkeep ADDED Viewed

	@@ -0,0 +1 @@


1	+

docs/dataset_guidelines.md ADDED Viewed

	@@ -0,0 +1,18 @@

+# Dataset Guidelines
+## Minimum metadata
+- Speaker ID (anonymized)
+- Approximate age band
+- Gender (optional/self-declared)
+- Dialect/region
+- Recording environment and device class
+## Audio quality basics
+- Prefer 16kHz+ clean speech
+- Avoid clipping and heavy background noise
+- Keep transcript aligned with spoken content
+## Text policy
+- Use agreed normalization rules
+- Keep punctuation consistent
+- Track alternate spellings in glossary

docs/platforms.md ADDED Viewed

	@@ -0,0 +1,10 @@

+# Platforms
+## Primary platforms
+- GitHub: code, issues, pull requests, releases
+- Hugging Face Hub: models, datasets, demos
+- Community chat (Discord/Matrix): contributor coordination
+## Publishing expectations
+- Every release links to changelog + benchmark snapshot
+- Every model links to dataset provenance and eval metrics

docs/release_process.md ADDED Viewed

	@@ -0,0 +1,14 @@

+# Release Process
+## Cadence
+- Monthly milestone release
+- Hotfix releases as needed
+## Required for release
+- Changelog summary
+- Benchmark snapshot
+- Known limitations
+- Reproducible commands/scripts
+## Versioning
+- Use semantic-style tags for major milestones (e.g., `v0.1`, `v0.2`)

docs/workstreams.md ADDED Viewed

	@@ -0,0 +1,16 @@

+# Workstreams
+## Data
+- Collection guides, consent, validation, and metadata policy.
+## ASR
+- Baselines, fine-tuning recipes, and evaluation scripts.
+## TTS
+- Baselines, speaker/style control, and quality assessment.
+## Benchmarks
+- Fixed test set, metric definitions, and leaderboard process.
+## Applications
+- Desktop and API integrations for real-user testing.

models/asr/.gitkeep ADDED Viewed

	@@ -0,0 +1 @@


1	+

models/tts/.gitkeep ADDED Viewed

	@@ -0,0 +1 @@


1	+

scripts/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # Scripts
2	+
3	+ Automation scripts for setup, data checks, training, and evaluation.

tts/README.md ADDED Viewed

	@@ -0,0 +1,3 @@


1	+ # TTS Workspace
2	+
3	+ Place TTS baselines, training configs, and quality-evaluation scripts here.