Spaces:

build-small-hackathon
/

Pozify

Running on Zero

Pozify / docs /50-codex-development-workflow.md

chore: remove Linux-specific dependencies from pyproject.toml and requirements.txt; update README to clarify usage of Nemotron implementation for local model inference

b46e058 12 days ago

preview code

Raw

History Blame Contribute Delete

16.5 kB

	# How We Use Codex To Build Pozify

	This is a short field note on how we used Codex while building Pozify, our small-model workout form
	coach. Pozify takes a short exercise video and turns it into a structured form-review report: pose
	analysis, exercise routing, rep counting, issue markers, annotated clips, and a grounded coach
	summary.

	That kind of product has a lot of moving pieces. It is not only a UI. It is not only a model. It has
	data preparation, computer vision, small-model inference, deterministic rules, safety wording,
	training scripts, deployment constraints, and docs. Codex helped because it could move across those
	layers with the repo open, while still letting us stay in control of product direction.

	Codex did not replace engineering judgment. It made the loop between idea, implementation, review,
	and documentation much tighter.

	## Why Codex Was Useful For This Project

	The biggest advantage was context. A normal chatbot can answer a question, but Codex can inspect the
	actual project: `app.py`, `src/pozify/`, `web/`, `scripts/`, `configs/`, `tests/`, and `docs/`.
	That matters because the best answer for Pozify is usually not the most generic answer. It has to fit
	the current pipeline.

	For example, if we want to improve push-up feedback, the right starting point is not "write a new
	fitness AI feature." The right starting point is:

	- read the existing push-up analyzer
	- read the shared rep counter and issue-marker helpers
	- check the tests that define current behavior
	- understand the JSON contracts used by the UI and coach summary
	- then propose the smallest useful change

	Codex is good at that kind of grounded work. It can keep the current codebase in view while it
	brainstorms, implements, reviews, and documents.

	## Brainstorming: From Vague Ideas To Scoped Work

	Early in the project, many ideas started rough:

	- Can the app explain bad reps better?
	- Should unsupported exercises be rejected or forced into the closest label?
	- How should we show confidence without making medical claims?
	- What is the smallest useful coach summary model we can ship?

	Codex was useful because we could ask it to brainstorm inside the project constraints. A good prompt
	was not just "give me ideas." It was closer to:

	```text
	Read the current Pozify pipeline and brainstorm three ways to improve form feedback. Keep the ideas
	compatible with the existing analyzer structure and avoid medical claims.
	```

	The result was more useful than a blank-page brainstorm. Codex could separate product ideas from
	engineering tasks, identify likely files, and call out risk. That helped us avoid turning every idea
	into a large rewrite.

	The best brainstorming output usually had this shape:

	- what the user problem is
	- what the smallest version could be
	- which files would change
	- what tests would prove it works
	- what wording needs human review

	That is why Codex is good for early-stage product work: it can turn messy intent into a concrete
	engineering path without pretending the path is risk-free.

	## Deep Research: Faster Learning, Better Tradeoffs

	Pozify touches several systems that change over time: Hugging Face Spaces, Modal, MediaPipe,
	Gradio, small-model inference, and model publishing. We used Codex for deep research when we needed
	to understand a tool before changing code.

	The useful pattern was to ask Codex to separate facts from recommendations:

	```text
	Research the current deployment constraints that matter for this Gradio app. Prefer official
	sources. Summarize the facts, explain what they mean for Pozify, then recommend changes only if they
	are justified.
	```

	That made research actionable. We did not want a long pile of links. We wanted to know what affected
	the project:

	- Does this runtime support the dependency we need?
	- Should this model run through hosted inference or local inference?
	- Does this training job belong locally, in CI, or on Modal?
	- Which setting affects GPU time, startup time, or reliability?

	Codex was good here because it could connect research back to the repo. It could say, "this affects
	the provider code," or "this belongs in the Modal script," or "this should be documented in the
	training report."

	We still treated research as research. Codex could make recommendations, but we converted those
	recommendations into scoped implementation tasks before changing the project.

	## Collaboration: Keeping The Team In The Same Thread

	Codex was also useful as a collaboration tool. When multiple people touch a fast-moving project, the
	hard part is often not writing code. The hard part is remembering why a change exists, what is still
	untested, and what another teammate needs to know.

	We used Codex to create handoff notes like:

	```text
	Summarize the current branch for another contributor. Include what changed, why it changed, how to
	test it, and what still needs review.
	```

	That was especially helpful around training and deployment work. A branch might include a script
	change, a config update, a docs update, and a model artifact note. Codex could inspect the diff and
	turn it into a readable handoff.

	The collaboration rule we kept was simple: Codex can help explain and organize work, but it should
	not overwrite another member's changes. Before editing, we ask it to inspect `git status --short` and
	read relevant diffs. That keeps the workflow respectful of everyone else's worktree.

	## Code Review: A Fast Second Pass

	One of the best uses of Codex was code review. Not a replacement for human review, but a fast second
	pass before asking someone else to look.

	The review prompt we used most often was direct:

	```text
	Review the current diff. Focus on correctness, regressions, missing tests, grounding, and user
	safety. Put findings first with file and line references.
	```

	That framing matters. We did not ask Codex to nitpick style. We asked it to look for things that
	could break the product:

	- a pipeline contract changed but the UI still expects the old field
	- a fallback path no longer works when a model provider fails
	- a test fixture covers only the happy path
	- a coach summary can say more than the structured evidence supports
	- a deployment setting works locally but not on Hugging Face Spaces

	Codex is good at review because it can inspect related files quickly. If a change touches
	`src/pozify/steps/coach_summary.py`, it can also check the verifier, fallback summary, provider
	tests, and docs. That is the kind of cross-file attention that catches practical regressions.

	## Implementing UI And Code

	For implementation, Codex was most helpful when the task was scoped. "Improve the app" is too broad.
	"Update the result view to show summary provider metadata and add a focused test" is a good Codex
	task.

	For Python pipeline work, we ask Codex to follow existing project structure:

	- pipeline steps live under `src/pozify/steps/`
	- exercise logic lives under `src/pozify/exercises/`
	- shared contracts live in `src/pozify/contracts.py`
	- training and publishing workflows live in `scripts/` and `configs/`
	- behavior should be covered in `tests/`

	For UI work, we ask it to inspect both the Gradio entrypoint and static assets:

	- `app.py`
	- `web/index.html`
	- `web/app.js`
	- `web/report.js`
	- `web/styles.css`

	The strongest Codex implementation loop looks like this:

	1. Read the relevant files.
	2. Explain the smallest safe change.
	3. Make the edit.
	4. Add or update focused tests.
	5. Run the relevant checks.
	6. Summarize what changed and what remains uncertain.

	That loop is where Codex feels different from autocomplete. It is not only producing lines of code.
	It is helping maintain the whole change: code, tests, docs, and verification.

	## Plugins: Bringing The Right Tool Into The Same Flow

	One thing that made Codex more effective was using plugins for tasks that needed more than plain code
	editing. The value was not "more tools for the sake of tools." The value was staying in one
	development flow while Codex used the right capability at the right time.

	For Pozify, the most useful plugin pattern was UI verification. When we changed the app interface,
	Codex could edit the frontend code, start the local app, open it in a browser, inspect the result,
	and then come back to the code with a concrete fix. That is much better than only reading CSS and
	guessing whether the page looks right.

	Plugins also helped with artifact-heavy work. Pozify has reports, model-card style docs, demo notes,
	and training writeups. When the output is a document, presentation, spreadsheet, screenshot, or PDF,
	it is useful for Codex to work with the artifact directly instead of treating everything like raw
	text.

	The practical lesson was simple: use a plugin when the task has a real environment or artifact to
	inspect.

	- For UI work, use browser inspection instead of trusting code alone.
	- For docs and reports, use document-aware workflows when layout or structure matters.
	- For product design work, use design-oriented workflows before jumping into implementation.
	- For generated artifacts, ask Codex to render or verify the result when possible.

	That made Codex feel less like a detached assistant and more like a teammate sitting inside the same
	workspace.

	## Skills: Turning Good Habits Into Repeatable Playbooks

	Skills were useful for a different reason. A plugin gives Codex a capability. A skill gives Codex a
	way of working.

	In this project, we used skills as repeatable playbooks for work that needed a consistent standard.
	For example, documentation should not be a random dump of notes. It should have a clear audience,
	scope, and structure. UI work should not only "compile"; it should be checked for layout, responsive
	behavior, and product fit. Code review should start with bugs and regressions, not style opinions.

	Skills helped encode those expectations. Instead of re-explaining the standard every time, we could
	ask Codex to use the relevant skill and then let it follow that workflow:

	- documentation skills for clear project docs, reports, and handoff notes
	- frontend/design skills for UI changes that need visual quality and responsive behavior
	- code review behavior for focused review comments and missing-test analysis
	- product or research skills when we needed to compare options before implementation

	The important habit was to invoke the skill before the work starts. That makes Codex read the right
	instructions first, then inspect the project, then act. The result is more consistent than asking for
	a one-off answer each time.

	For a fast project like Pozify, that consistency mattered. We were moving between model training,
	UI, docs, deployment, and tests. Skills helped keep the quality bar stable while the task type kept
	changing.

	## Automation: Turning Repeated Work Into Scripts

	Pozify has repeated workflows: running fast tests, preparing data, training routers, training coach
	summaries, publishing artifacts, and keeping docs in sync. Codex helped us turn some of those
	manual steps into explicit scripts and checklists.

	Automation is a good Codex task because the desired behavior can be made concrete:

	```text
	Add a script that runs the fast Pozify validation checks before a PR. Reuse existing commands, avoid
	network-dependent steps, and document how to run it.
	```

	The important part is that automation should be boring. It should log clearly, fail clearly, and avoid
	surprising side effects. For this project, anything involving credentials, model uploads, dataset
	publishing, or GPU spend still needs human approval.

	Codex is good at automation because it can inspect how the project already runs. It can reuse
	`uv run pytest`, `uv run ruff check .`, Modal scripts, existing config files, and docs instead of
	inventing a parallel workflow.

	We also used Codex to decide what should not be automated. Some actions are too expensive or risky to
	run silently: uploading a model, publishing a dataset, spending GPU time, changing public demo
	behavior, or rewriting safety wording. For those, the better automation is a checklist or a command
	with an explicit approval step.

	That split made automation more useful:

	- automate local checks that are cheap and repeatable
	- script data and training setup when the inputs and outputs are clear
	- document manual approval points for publishing and public claims
	- use reminders or handoff notes for follow-up work that should not block a coding session

	The best automations were small. A good script saved a few minutes every time and made failure
	obvious. A good checklist prevented a risky release mistake. Codex helped build both.

	## How Plugins, Skills, And Automation Fit Together

	The most effective Codex workflow combined all three.

	Plugins gave Codex access to the working surface. Skills gave it the right operating style.
	Automation made the repeated parts cheap.

	For example, a UI change could flow like this:

	1. Use a frontend or product-design skill to frame the change.
	2. Ask Codex to inspect `app.py` and `web/`.
	3. Implement the smallest UI update.
	4. Use the browser plugin to open the local app and check the rendered result.
	5. Run focused tests or linting.
	6. Update docs or write a handoff note.

	A training workflow looked different:

	1. Use Codex to research or review the training goal.
	2. Inspect `scripts/`, `configs/`, and the relevant training docs.
	3. Update the script or config in a scoped way.
	4. Automate only the safe local checks.
	5. Keep model upload, dataset publishing, and GPU-heavy runs behind human approval.
	6. Record metrics and artifact paths in the docs.

	That is where Codex became especially effective. It was not one magic prompt. It was a repeatable
	system: choose the right playbook, use the right tool, automate the boring part, and keep human
	judgment on the decisions that matter.

	## What Makes Codex Good

	For this project, Codex was good for eight practical reasons.

	First, it works with the real repo. It can read the current files, not just guess from a description.
	That makes its suggestions more grounded.

	Second, it moves across layers. Pozify needs Python, web UI, ML scripts, configs, tests, and docs.
	Codex can connect those pieces in one task.

	Third, it is good at turning ambiguity into a plan. When an idea is vague, Codex can propose options,
	tradeoffs, affected files, and a smallest useful version.

	Fourth, it is good at review. It can look at a diff and check related files faster than a human can
	manually scan the whole repo.

	Fifth, it helps preserve momentum. Instead of stopping to remember the exact test command, docs
	location, or helper API, we can ask Codex to inspect and continue.

	Sixth, it improves documentation while the context is still fresh. After implementing a change, Codex
	can update the relevant docs and write a handoff note before details are forgotten.

	Seventh, plugins let it inspect real outputs. That is important for UI, documents, generated
	artifacts, and local app behavior.

	Eighth, skills and automation make the workflow repeatable. The team does not have to rebuild the
	same process from memory each time.

	## What We Still Keep Human-Owned

	The biggest lesson is that Codex works best when responsibility stays clear.

	Humans still own:

	- product direction
	- user safety language
	- fitness and health-related claims
	- dataset choices and licensing judgment
	- public model and dataset publishing
	- final code review and merge approval
	- whether a feature is actually useful to users

	Codex helps us move faster, but it does not decide what Pozify should be. That distinction matters a
	lot for a product that gives workout feedback. The app should be grounded in evidence, and the
	development process should be grounded too.

	## Our Current Codex Workflow

	The workflow we settled into is simple:

	1. Use Codex to inspect the repo and understand the current shape.
	2. Brainstorm or research with project constraints in view.
	3. Pick a small, human-approved direction.
	4. Ask Codex to implement the scoped change.
	5. Ask Codex to review the diff.
	6. Run tests, linting, local app checks, or browser checks.
	7. Update docs and write a handoff note.

	That workflow made Codex valuable throughout the build. It helped us think, research, collaborate,
	review, implement, and automate without turning the project into a black box.

	The short version: Codex is good because it compresses the distance between intent and verified
	change. For Pozify, that meant more time spent on product judgment and less time lost to mechanical
	work, context switching, and stale documentation.