musaw commited on
Commit
f725a8a
·
0 Parent(s):

chore: initial community foundation structure

Browse files
.github/ISSUE_TEMPLATE/bug_report.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: Bug report
3
+ about: Report a reproducible bug
4
+ title: "[Bug] "
5
+ labels: bug
6
+ assignees: ''
7
+ ---
8
+
9
+ ## Description
10
+
11
+ ## Steps to reproduce
12
+ 1.
13
+ 2.
14
+ 3.
15
+
16
+ ## Expected behavior
17
+
18
+ ## Environment
19
+ - OS:
20
+ - Branch/commit:
21
+
22
+ ## Logs / screenshots
.github/ISSUE_TEMPLATE/dataset_task.md ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: Dataset task
3
+ about: Propose/track a data collection or curation task
4
+ title: "[Data] "
5
+ labels: data
6
+ assignees: ''
7
+ ---
8
+
9
+ ## Task type
10
+ - [ ] Collection
11
+ - [ ] Validation
12
+ - [ ] Normalization
13
+ - [ ] Metadata QA
14
+
15
+ ## Scope
16
+
17
+ ## Acceptance criteria
18
+
19
+ ## Notes
.github/ISSUE_TEMPLATE/feature_request.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ name: Feature request
3
+ about: Suggest an improvement
4
+ title: "[Feature] "
5
+ labels: enhancement
6
+ assignees: ''
7
+ ---
8
+
9
+ ## Problem
10
+
11
+ ## Proposed solution
12
+
13
+ ## Alternatives considered
14
+
15
+ ## Additional context
.github/PULL_REQUEST_TEMPLATE.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Summary
2
+ - What changed and why
3
+
4
+ ## Type of change
5
+ - [ ] Data
6
+ - [ ] ASR
7
+ - [ ] TTS
8
+ - [ ] Benchmark
9
+ - [ ] Docs
10
+
11
+ ## Validation
12
+ - Steps used to validate
13
+ - Key results/metrics
14
+
15
+ ## Checklist
16
+ - [ ] Linked issue
17
+ - [ ] Reproducible steps included
18
+ - [ ] Docs updated if needed
.gitignore ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Python
2
+ __pycache__/
3
+ *.py[cod]
4
+ .venv/
5
+
6
+ # Data/model artifacts
7
+ *.wav
8
+ *.mp3
9
+ *.flac
10
+ *.m4a
11
+ *.mp4
12
+ *.mkv
13
+ *.mov
14
+ *.avi
15
+ *.zip
16
+
17
+ # Large/generated folders
18
+ outputs/
19
+ checkpoints/
20
+ artifacts/
21
+
22
+ # OS/editor
23
+ .DS_Store
24
+ Thumbs.db
25
+ .vscode/
26
+ .idea/
CODE_OF_CONDUCT.md ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Code of Conduct
2
+
3
+ We are committed to a welcoming, respectful, and inclusive community.
4
+
5
+ ## Expected behavior
6
+ - Be respectful and constructive.
7
+ - Assume good intent and communicate clearly.
8
+ - Give actionable feedback, not personal criticism.
9
+
10
+ ## Unacceptable behavior
11
+ - Harassment, hate speech, or discrimination.
12
+ - Doxxing, threats, or abusive language.
13
+ - Repeated disruptive behavior after warnings.
14
+
15
+ ## Enforcement
16
+ - Maintainers may warn, mute, or remove participants for violations.
17
+ - Serious cases may be escalated and documented.
CONTRIBUTING.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Contributing
2
+
3
+ Thanks for helping build open Pashto AI resources.
4
+
5
+ ## Ways to contribute
6
+ - Data recording and validation
7
+ - Text normalization and terminology fixes
8
+ - Model training/evaluation scripts
9
+ - Documentation, issue triage, and testing
10
+
11
+ ## Contribution flow
12
+ 1. Open or pick an issue.
13
+ 2. Comment your plan.
14
+ 3. Create a branch and make focused changes.
15
+ 4. Open a PR with clear summary and testing notes.
16
+
17
+ ## Standards
18
+ - Keep changes small and reviewable.
19
+ - Include reproducible steps for data/model changes.
20
+ - Document assumptions, limitations, and risks.
21
+ - Respect contributors and community guidelines.
22
+
23
+ ## Priority labels (recommended)
24
+ - `good first issue`
25
+ - `data`
26
+ - `asr`
27
+ - `tts`
28
+ - `benchmark`
29
+ - `docs`
30
+ - `help wanted`
GOVERNANCE.md ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Governance
2
+
3
+ ## Model
4
+ Lightweight maintainer model with transparent decision-making.
5
+
6
+ ## Roles
7
+ - **Maintainers**: review/merge PRs, release planning, quality control.
8
+ - **Contributors**: submit code/data/docs, review, and improve workflows.
9
+ - **Community moderators**: keep discussion spaces healthy and productive.
10
+
11
+ ## Decision process
12
+ - Default: consensus in issues/PR discussion.
13
+ - If blocked: maintainer vote with rationale posted publicly.
14
+ - Major changes: RFC issue with at least 7 days for feedback.
15
+
16
+ ## Release ownership
17
+ - Each release has one responsible maintainer and one backup reviewer.
18
+
19
+ ## Conflict resolution
20
+ - Follow `CODE_OF_CONDUCT.md`.
21
+ - Report issues privately to maintainers when needed.
LICENSE_POLICY.md ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # License Policy (Draft)
2
+
3
+ Use separate licenses for:
4
+ - Code
5
+ - Datasets
6
+ - Model weights
7
+
8
+ Recommended default:
9
+ - Code: Apache-2.0
10
+ - Dataset: a clear open data license with attribution terms
11
+ - Models: aligned with training data and dependency licenses
12
+
13
+ Finalize this file before first public release.
PROJECT_PURPOSE.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Project Purpose
2
+
3
+ ## Why this project exists
4
+ Pashto remains underrepresented in open AI speech/language resources. This project exists to close that gap through community collaboration.
5
+
6
+ ## Mission
7
+ Create high-quality open resources that enable Pashto to work reliably in:
8
+ - Speech recognition (ASR)
9
+ - Text-to-speech (TTS)
10
+ - Translation and NLP tooling
11
+
12
+ ## What success looks like
13
+ - Public Pashto datasets with clear quality standards
14
+ - Reproducible baseline models and training pipelines
15
+ - Public benchmark/leaderboard for fair model comparison
16
+ - Open desktop/API demos that real users can run
17
+
18
+ ## Non-commercial commitment
19
+ This initiative is community-first and public-benefit oriented. The project is not being built for proprietary lock-in or short-term commercialization.
20
+
21
+ ## Principles
22
+ - Openness: data/model/process transparency
23
+ - Inclusivity: dialect and accent diversity
24
+ - Quality: strong labeling/review standards
25
+ - Reproducibility: scripts, configs, and documented experiments
26
+ - Continuity: release cadence and long-term maintenance
27
+
28
+ ## Scope (v1 foundation)
29
+ - Build core repository and contributor workflows
30
+ - Launch Pashto data collection and validation pipeline
31
+ - Publish ASR and TTS baselines
32
+ - Publish first benchmark set and metrics
33
+
34
+ ## Out of scope (for now)
35
+ - Closed paid APIs as the only path
36
+ - Private datasets without reproducible provenance
37
+ - Productization before core language quality is established
README.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Pukhto/Pashto Open Language Project
2
+
3
+ Community-led open-source project to make Pashto a first-class language in AI speech and language tooling.
4
+
5
+ ## Core Goal
6
+ - Build open datasets, benchmarks, and models for Pashto ASR, TTS, and NLP.
7
+ - Keep work reproducible, transparent, and contribution-friendly.
8
+ - Focus on public good and broad accessibility.
9
+
10
+ ## Start Here
11
+ - Purpose: `PROJECT_PURPOSE.md`
12
+ - Contributing: `CONTRIBUTING.md`
13
+ - Roadmap: `ROADMAP.md`
14
+ - Governance: `GOVERNANCE.md`
15
+ - Community coordination: `community/COMMUNICATION.md`
16
+
17
+ ## Initial Workstreams
18
+ - `data/` Pashto data collection, cleaning, metadata
19
+ - `asr/` speech-to-text baselines and experiments
20
+ - `tts/` text-to-speech baselines and experiments
21
+ - `benchmarks/` fixed test sets and evaluation scripts
22
+ - `apps/desktop/` app integration references
ROADMAP.md ADDED
@@ -0,0 +1,27 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Roadmap
2
+
3
+ ## Phase 1: Foundation (0-2 months)
4
+ - Finalize governance and contribution docs
5
+ - Define Pashto text normalization policy
6
+ - Prepare data schema and validation checklist
7
+ - Publish baseline ASR/TTS experiment templates
8
+
9
+ ## Phase 2: Data Scale (2-4 months)
10
+ - Community data campaigns (recording + validation)
11
+ - Curate and release dataset versions (`v0.1`, `v0.2`)
12
+ - Improve metadata quality (speaker, dialect, environment)
13
+
14
+ ## Phase 3: Baseline Models (4-6 months)
15
+ - Train and release first open ASR baseline
16
+ - Train and release first open TTS baseline
17
+ - Publish reproducible training/eval scripts
18
+
19
+ ## Phase 4: Benchmark & Demos (6-9 months)
20
+ - Release fixed evaluation benchmark
21
+ - Launch public leaderboard (WER/CER + TTS quality eval)
22
+ - Integrate models into desktop/app demos
23
+
24
+ ## Phase 5: Community Maturity (9+ months)
25
+ - Regular release cadence
26
+ - Contributor mentoring and review rotations
27
+ - Long-term maintenance and quality governance
apps/desktop/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Desktop Integration
2
+
3
+ Tracks desktop app integration for ASR/TTS/translation pipelines.
asr/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # ASR Workspace
2
+
3
+ Place ASR baselines, training configs, and evaluation scripts here.
benchmarks/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Benchmarks
2
+
3
+ Define fixed test sets, metrics, and leaderboard generation scripts.
community/COMMUNICATION.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Community Communication
2
+
3
+ ## Channels
4
+ - GitHub Issues/Discussions for technical decisions
5
+ - Community chat for coordination and quick support
6
+
7
+ ## Meeting rhythm
8
+ - Weekly async update thread
9
+ - Monthly community review call
10
+
11
+ ## Rules
12
+ - Keep technical decisions in public threads
13
+ - Summarize outcomes after meetings
14
+ - Tag maintainers only when blocked
community/RECOGNITION.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ # Contributor Recognition
2
+
3
+ ## How contributors are recognized
4
+ - Release notes mention key contributors
5
+ - Monthly spotlight for impactful community work
6
+ - Maintainer nomination path for sustained contributions
data/README.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ # Data Workspace
2
+
3
+ - `raw/` incoming source files
4
+ - `processed/` cleaned/aligned artifacts
5
+ - `metadata/` manifests, speaker/dialect info, QA reports
data/metadata/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
data/processed/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
data/raw/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
docs/dataset_guidelines.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Dataset Guidelines
2
+
3
+ ## Minimum metadata
4
+ - Speaker ID (anonymized)
5
+ - Approximate age band
6
+ - Gender (optional/self-declared)
7
+ - Dialect/region
8
+ - Recording environment and device class
9
+
10
+ ## Audio quality basics
11
+ - Prefer 16kHz+ clean speech
12
+ - Avoid clipping and heavy background noise
13
+ - Keep transcript aligned with spoken content
14
+
15
+ ## Text policy
16
+ - Use agreed normalization rules
17
+ - Keep punctuation consistent
18
+ - Track alternate spellings in glossary
docs/platforms.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ # Platforms
2
+
3
+ ## Primary platforms
4
+ - GitHub: code, issues, pull requests, releases
5
+ - Hugging Face Hub: models, datasets, demos
6
+ - Community chat (Discord/Matrix): contributor coordination
7
+
8
+ ## Publishing expectations
9
+ - Every release links to changelog + benchmark snapshot
10
+ - Every model links to dataset provenance and eval metrics
docs/release_process.md ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Release Process
2
+
3
+ ## Cadence
4
+ - Monthly milestone release
5
+ - Hotfix releases as needed
6
+
7
+ ## Required for release
8
+ - Changelog summary
9
+ - Benchmark snapshot
10
+ - Known limitations
11
+ - Reproducible commands/scripts
12
+
13
+ ## Versioning
14
+ - Use semantic-style tags for major milestones (e.g., `v0.1`, `v0.2`)
docs/workstreams.md ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Workstreams
2
+
3
+ ## Data
4
+ - Collection guides, consent, validation, and metadata policy.
5
+
6
+ ## ASR
7
+ - Baselines, fine-tuning recipes, and evaluation scripts.
8
+
9
+ ## TTS
10
+ - Baselines, speaker/style control, and quality assessment.
11
+
12
+ ## Benchmarks
13
+ - Fixed test set, metric definitions, and leaderboard process.
14
+
15
+ ## Applications
16
+ - Desktop and API integrations for real-user testing.
models/asr/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
models/tts/.gitkeep ADDED
@@ -0,0 +1 @@
 
 
1
+
scripts/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # Scripts
2
+
3
+ Automation scripts for setup, data checks, training, and evaluation.
tts/README.md ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ # TTS Workspace
2
+
3
+ Place TTS baselines, training configs, and quality-evaluation scripts here.