Welz Presentation Notes

Presentation: Gary Welz | CopernicusAI / CUNY Graduate Center (PoI) Date: February 27, 2026 Live deck: TDA_Seminar_Slides.html Preprint: HTML

Presentation Script (29 Slides)

Slide 1: Title

Feedback Loops as Loops — Topological Data Analysis of Genetic Regulatory Circuits. Gary Welz, CopernicusAI / CUNY Graduate Center (PoI). February 27, 2026.

Slide 2: From papers to flowcharts

The first attempt at a beta-galactosidase flow chart was made in 1995 and appeared in an article in The X Advisor, an online magazine for Unix developers, entitled “Is the Genome Like a Computer Program?” The article contained excerpts from conversations with biologists on the bionet.genome.chromosome newsgroup. The article is archived at the Internet Archive; the newsgroup discussions are archived by Google. The 1995 chart was created from text alone—the same process that large language models (LLMs) use today. The source was Berg & Singer (1992, pp. 71-73). This illustrates that diagrams are only as detailed and reliable as their source material; using different sources for the same process can yield different charts. In the original bionet thread, the genome was proposed as a flowchart with genes connected by logical “and” and “or.” Robert Robbins replied that flow charts require careful interpretation but that bringing computer-science insights to bear on the genome has potentially huge payoffs. G. Dellaire emphasized that genome structure, not just linear sequence, encodes how the code is read—context that is spatial or temporal. The original chart is shown in the slide image.

Slide 3: Same chart, 30 years later

The same Lac operon / beta-galactosidase idea is now generated with LLMs and Mermaid Markdown. The original chart was so time-consuming to produce that the approach lay dormant for decades. It is now possible to produce any of these flowcharts from a single prompt in seconds. The Lac Operon flowchart can be viewed in the GLMP viewer via the link on the slide.

Slide 4: The Innovation: Text to Visual Data

Traditional topological data analysis (TDA) starts from numerical data. In this work, the starting point is text—paper descriptions—which is converted into visual flowcharts first. That shift is what makes the rest possible. The pipeline is: text (papers) to visual flowcharts to features to topology. Mermaid Markdown converts textual process descriptions into structured flowcharts. Flowcharts become visual data, and TDA reveals structure. Topology is extracted from descriptions, not from direct measurements. Novel aspects include: a text-to-visual-to-topology pipeline; five features (nodes, conditionals, OR gates, AND gates, loops); feedback loops corresponding literally to H1 loops in homology; and LLM-assisted curation at scale. The approach is conceptually similar to the Politics case study in Carlsson & Vejdemo-Johansson (2021, pp. 199-201) but exhibits these distinct characteristics.

Slide 5: The Question

The central question is whether the shape of these circuits—as captured by topology—aligns with what biologists already know: feedback loops, cascades, and regulatory motifs. Can regulatory structure (feedback, cascades) be detected from circuit topology? Feedback loops are literally loops; they should appear in H1. The work asks whether text-derived visual data can support that.

Slide 6: The GLMP Database

The Genome Logic Modeling Project (GLMP) provides 108 processes—each one a Mermaid flowchart with nodes, conditionals, OR/AND gates, and loops (back-edges). We extract five features per process: nodes, conditionals (aka edges), AND gates, OR gates, loops. The set includes 66 from E. coli, 38 from S. cerevisiae, and 4 from Bacillus subtilis. Examples include lac operon, SOS response, and two-component signaling. A link to the full database table allows any process to be opened for its flowchart. Code is available at github.com/garywelz/glmp.

Slide 7: GLMP: References in JSON and Feedback

Each process in GLMP is grounded in the literature: the JSON holds PubMed and DOI. The viewer accepts feedback so that flowcharts can be corrected or improved. Flowcharts are thus citable and correctable. In the viewer, Sources & Citations, Metadata, and the Improve-this-process form appear below each flowchart.

Slide 8: From Flowcharts to Features

The full graph structure is not used for TDA. Instead, each flowchart is summarized into five features: nodes, conditionals (aka edges), AND gates, OR gates, and loops (back-edges). Features are standardized to zero mean and unit variance. The matrix is 108 processes × 5 features. These capture circuit complexity and logic structure.

Slide 9: TDA Pipeline

From the feature matrix, a distance is built between every pair of processes. A Vietoris-Rips filtration is run and Ripser is used to obtain persistence diagrams. Cocycles are extracted; they indicate which processes sit on which topological loop. Output includes persistence diagrams for H0, H1, and H2, plus the membership of each H1 loop.

Slide 10: What Are We Counting? H₀, H₁, H₂

H₀ counts connected components—are the pieces connected? In GLMP, H₀ starts at 108 and collapses as the Vietoris–Rips radius grows. H₁ counts loops—closed cycles with no filled face; in gene regulation, feedback loops. The 33 H₁ features are these unfilled cycles; biologically richest for GLMP. H₂ counts enclosed voids (hollow cavities). In cancer GRN work (Masoomy et al., 2021), H₂ in healthy cells = redundant regulatory structures. GLMP yields H₂ = 1.

Slide 11: Mathematical Note (1) — Betti Numbers: History & Geometry

[NEW] Before looking at our results, a brief mathematical grounding — skip this if you'd prefer and come back to it. Betti numbers are named for Enrico Betti (1823–1892), formalized by Poincaré in the 1890s. They count topological "holes" of each dimension: β₀ = connected components (pieces), β₁ = independent loops that don't bound any filled region (1-dimensional holes), β₂ = enclosed voids (2-dimensional holes). Euler's formula for connected planar graphs is χ = V − E + F = 2, where F includes the outer, unbounded face — the triangle has V=3, E=3, F=2, giving χ=2; the tetrahedron has V=4, E=6, F=4, also χ=2; the cube has V=8, E=12, F=6, also χ=2. This generalizes via Betti numbers to χ = β₀ − β₁ + β₂ − …, the Euler characteristic. The key point for this talk: feedback loops in biology should show up as β₁ features — loops in H₁. That is exactly what we find.

Slide 12: Mathematical Note (2) — Faces, 2-Simplices, and H₁

[NEW] This slide explains why some loops persist and others don't. In a planar graph, a face is a region bounded by edges, including the outer, unbounded region. In homology, faces correspond to 2-simplices — filled triangles: when three processes are mutually close enough in feature space, the Vietoris–Rips complex inserts a solid triangle among them. When a cycle of edges exists but no 2-simplex fills it in — no triangle caps it off — that loop is not the boundary of any face, so it cannot be "explained away," and it persists as an H₁ feature. Our 33 H₁ loops are exactly those cycles with no filling triangle. Biologically: a feedback circuit A→B→C→A persists in H₁ when there is no shortcut pathway that cuts across the loop and completes a filled triangle. The literal correspondence between feedback in biology and loops in homology is the conceptual core of this work.

Slide 13: Persistence Diagram

The persistence diagram shows one component per process in H0 and 33 loops in H1. The question is whether those H1 loops align with known biology—feedback circuits, stress responses, and so on. H2 yields 1 void (expected—few points form persistent 2D cavities).

Slide 14: What Do the Loops Look Like? (1) PCA + Cocycle Edges

The persistence diagram tells us H1 has 33 loops but not where they sit in the data. To make homology visible, the 5D feature space is projected to 2D via PCA (principal component analysis—finds directions of maximum variance; preserves distances for visualization), then the cocycle edges—the pairs of processes that form each cycle—are drawn. Each colored loop is one H1 cycle: red (#1), blue (#2), green (#3), purple (#4), orange (#5). Lac operon, two-component, and SOS are labeled. Interactive version (hover for process names): glmp_h1_loops_interactive.html

Slide 15: What Do the Loops Look Like? (2) Mapper Graph

The Mapper algorithm builds a simplicial complex: cluster nearby processes, then connect clusters that overlap. Each node is a cluster of similar processes (node size = process count); edges connect overlapping clusters. Cycles in this graph correspond to topological loops—so the loops in the Mapper graph visualize H1 structure in a different way, complementing the persistence diagram and the cocycle-in-PCA view. Current parameters: n_cubes=12, perc_overlap=0.65 → 18 nodes, 45 edges. Interactive version (click nodes to see processes, search by name): glmp_mapper_graph_interactive_v2.html

Slide 16: Top H1 Loop #1 (Persistence = 0.563)

The most persistent loop aggregates stress response, protein quality control, and DNA repair: SOS response, quorum sensing, biofilm formation, BER, BAM, ribosome assembly, RNA pol recycling, Type III secretion, ubiquitin-proteasome, UPR. E. coli and yeast; shared “stress + quality control + feedback” character.

Slide 17: Example: SOS Response (Loop #1)

The SOS response is E. coli’s emergency DNA repair system: damage activates RecA, which inactivates LexA repressor, inducing repair genes. Classic feedback — repair turns genes off. SOS sits in the top H1 loop alongside quorum sensing, biofilm, UPR, and protein quality control.

Slide 18: Top H1 Loop #2 (Persistence = 0.443)

Six processes: antibiotic efflux pumps, arginine biosynthesis, osmotic stress response, tryptophan biosynthesis, peroxisome biogenesis, vacuolar protein sorting. Metabolic regulation and organelle biogenesis—E. coli and yeast.

Slide 19: Top H1 Loop #3 (Persistence = 0.306)

Six processes: biofilm formation, DNA replication elongation, flagellar assembly, osmotic stress, sigma factor competition, peroxisome biogenesis. Gene regulation, replication, motility, stress—E. coli and yeast.

Slide 20: Top H1 Loop #4 (Persistence = 0.279)

Six processes: phosphate regulation, translation elongation, translation termination, tryptophan biosynthesis, osmotic stress response, sporulation initiation. Gene regulation, translation, stress, developmental—E. coli, yeast, Bacillus.

Slide 21: Top H1 Loop #5 (Persistence = 0.198)

Five processes: ara operon, maltose regulon, Pho regulon, nitrogen catabolite repression (NCR/TORC1), competence development. Nutrient and developmental regulation—ara and Pho are classic feedback circuits. E. coli, yeast, Bacillus.

Slide 22: Example: Ara Operon (Loop #5)

AraC acts as repressor or activator depending on arabinose; DNA looping and CRP-cAMP integration. Ara sits in Loop #5 with Pho regulon, maltose regulon, nitrogen catabolite repression, and competence—all nutrient-sensing or developmental decisions with shared regulatory logic.

Slide 23: Biological Coherence Check

With the new loop-based feature set, known feedback circuits cluster coherently: SOS, quorum sensing, biofilm in Loop #1; ara and Pho in Loop #5; trp biosynthesis in Loops #2 and #4. Topology recovers stress, nutrient-sensing, and feedback architecture.

Slide 24: Organism Patterns

All top five loops mix organisms. Loop #1, #2, #3: E. coli and yeast. Loop #4 and #5: E. coli, yeast, and Bacillus. Regulatory logic transcends organism boundaries.

Slide 25: Why These Features Work

Using loop (back-edge) features instead of NOT gates yields richer persistence values and clearer biological groupings. Stress circuits → Loop #1; nutrient-sensing (ara, Pho) → Loop #5; metabolic feedback (trp) → Loops #2 and #4. The new feature set is a distinct experiment; results are richer and more interpretable.

Slide 26: Limitations and Caveats

Sample size: 108 processes—enough to reveal structure; scaling to 200-500+ is a priority. Five features (nodes, conditionals, OR gates, AND gates, loops); ablation shows node_count carries the most weight. Graph-theoretic features (cycle rank, longest path, gate ratios) are planned. LLM-generated flowcharts require expert fact-checking. Open question: Does topology predict function or correlate with known biology? The coherence check supports the latter.

Slide 27: Next Steps

Directions include Mapper, ablation and null-model validation, richer features. Longer-term goal: flowcharts and TDA as a Rosetta Stone linking topology to genetic “machine code”—sequence motifs for AND/OR. Falsifiable if circuits in the same H1 loop share enriched motifs. Null model permutation test: p = 0.022. Graph-theoretic features, persistent cohomology, scaling to 200-500+ planned. Code: github.com/garywelz/glmp/tree/main/tda-analysis.

Slide 28: References

Carlsson & Vejdemo-Johansson (2021). Bauer (2021). Berg & Singer (1992). Masoomy et al. (2021). Rivera-Cancel et al. (2014). Swingle et al. (2025). Tralie et al. (2018). Welz (1995).

Slide 29: Acknowledgments and Questions

Jordan Matuszewski; CUNY Graduate Center TDA seminar group; Kevin Gardner and colleagues (ASRC, CCNY). GLMP and TDA analysis: github.com/garywelz/glmp. Contact: Gary Welz | [email protected] | 917-593-2537.

Glossary of Terms

TDA Terms

Persistent Homology (H0, H1, H2): H₀ counts connected components—the number of disconnected pieces. In GLMP, H₀ begins at 108 (one per process) and collapses as the Vietoris–Rips radius grows. H₁ counts loops—closed cycles with no filled-in face. In a gene regulatory network, this corresponds to a feedback loop: gene A activates B, B activates C, C represses A; the loop persists because no 2-simplex (filled triangle) caps it off. The 33 H₁ features in GLMP are precisely these unfilled cycles. H₂ counts enclosed voids—hollow cavities, like the interior of a sphere. In cancer GRN work (Masoomy et al., 2021), H₂ features in healthy cells were interpreted as redundant regulatory structures. GLMP yields H₂ = 1. The intuitive ladder: H₀ asks “are the pieces connected?”; H₁ asks “are there feedback loops?”; H₂ asks “are there enclosed cavities?” For GLMP, H₁ is biologically richest: feedback loops are literally loops.

Persistence (birth, death): Birth = distance scale where a feature appears. Death = scale where it disappears. Persistence = death minus birth = significance. Loop #1: persistence 0.563; Loop #2: 0.443; Loop #3: 0.306.

Vietoris-Rips Complex: Points connect when within distance epsilon; epsilon increases gradually. At each scale, shapes (clusters, loops, voids) form. Loops persisting across scales are treated as real structure.

Betti numbers (β₀, β₁, β₂, …): The ranks of the homology groups; they count “holes” of each dimension. Named for Enrico Betti (1823–1892), formalized by Henri Poincaré in the 1890s. Geometrically: β₀ = connected components (pieces); β₁ = loops—closed paths that do not bound any filled region (1-D holes); β₂ = enclosed voids (2-D holes). A cycle is a closed path with no boundary; a cycle that is not the boundary of any filled region represents a hole. Betti numbers are topological invariants—stable under continuous deformation. In our work: β₀ = 108 (components), β₁ = 33 (loops), β₂ = 1 (void).

Euler characteristic (χ): For connected planar graphs, χ = V − E + F = 2, where F includes the outer (unbounded) face. Examples: triangle (V=3, E=3, F=2) → χ=2; square (V=4, E=4, F=2) → χ=2; tetrahedron (4,6,4) → χ=2; cube (8,12,6) → χ=2. This generalizes to χ = β₀ − β₁ + β₂ − … via the Betti numbers. The Euler characteristic sits at the foundation of persistent homology.

Faces and 2-simplices: In a planar graph, a face is a region bounded by edges—including the outer region. In homology, faces correspond to 2-simplices (filled triangles): three vertices within the distance threshold form a triangle whose interior fills in the loop. When a loop is not bounded by any 2-simplex—no triangle fills it in—that loop persists as an H₁ feature. Our 33 H₁ loops are exactly those cycles that fail to be filled; they are the β₁ contribution to χ.

Cocycles: Mathematical representation of which points form a loop. Used to identify which processes (e.g., lac operon) form each H1 loop. For visualization: project the 5D feature space to 2D (PCA), then draw edges between each (process A, process B) pair in the cocycle; the resulting polygon is the H1 loop.

PCA + Cocycle view: Makes homology visible by showing where loops sit in a 2D projection. Each colored polygon corresponds to one H1 cycle.

Mapper graph: Nodes = clusters of similar processes; edges = overlapping clusters. Cycles in the Mapper graph correspond to H1 loops in homology. Current implementation (n_cubes=12, perc_overlap=0.65): 18 nodes, 45 edges. Interactive version: glmp_mapper_graph_interactive_v2.html.

Biological Terms

Lac Operon: Classic gene regulation in E. coli. Controls lactose digestion genes. Demonstrates negative feedback; textbook feedback circuit.

Two-Component Signaling (EnvZ-OmpR): Sensor (EnvZ) detects signal; response (OmpR) controls genes. Feedback: response affects sensor. Paradigm bacterial signaling with feedback.

SOS Response: E. coli emergency DNA repair. Feedback: damage turns genes on; repair turns them off. SOS appears in Loop #1 with other stress responses.

Operon: Group of genes controlled together. Often has feedback. Lac, trp, ara are examples.

Quorum Sensing: Bacteria coordinate by signaling. At quorum, behavior changes (biofilms, toxins). Positive feedback. Appears in Loop #1.

Featured Notes

Extended discussions for readers who want to go deeper. Each note is self-contained and can be read independently of the others.

📐 Note for Mathematicians: Euler's Formula, Betti Numbers, and the Shape of Regulatory Space

You've probably seen Euler's formula for polyhedra:

V − E + F = 2

Vertices minus edges plus faces equals 2. It holds for a cube (8 − 12 + 6 = 2), a tetrahedron (4 − 6 + 4 = 2), a triangle treated as a flat polyhedron with an outer face (3 − 3 + 2 = 2). It looks like a curiosity until you understand what it is actually counting: a running tally of holes at different dimensions, with alternating signs. V counts 0-dimensional things (points). E counts 1-dimensional things (edges, which can close into loops). F counts 2-dimensional things (faces, which can cap loops or bound voids). The alternating + and − is the algebraic signature of homology.

Enrico Betti generalized this in the 1870s; Poincaré formalized it in the 1890s. Instead of just V, E, F, you get a sequence of numbers β₀, β₁, β₂, … — Betti numbers, one per dimension — and the Euler characteristic generalizes to:

χ = β₀ − β₁ + β₂ − β₃ + …

Each Betti number counts independent topological features of that dimension:

β₀ — connected components (pieces). For a single connected shape, β₀ = 1.
β₁ — independent loops not bounding any filled region; genuine 1-dimensional holes.
β₂ — enclosed voids; genuine 2-dimensional holes, like the interior of a hollow sphere.

A few examples to build intuition:

Solid triangle: β₀ = 1, β₁ = 0 (loop is filled), β₂ = 0. χ = 1.
Hollow triangle (three edges, no interior): β₀ = 1, β₁ = 1 (loop unfilled), β₂ = 0. χ = 0.
Hollow sphere: β₀ = 1, β₁ = 0, β₂ = 1. χ = 2.
Torus: β₀ = 1, β₁ = 2 (one loop around the tube, one through the hole), β₂ = 1. χ = 0.

Euler's original V − E + F = 2 is simply χ = β₀ − β₁ + β₂ = 2 for convex polyhedra, where β₁ = 0 (no through-holes) and β₀ = β₂ = 1.

Connection to GLMP. The 108 processes embedded in 5-dimensional feature space via the Vietoris–Rips filtration produce a simplicial complex with its own Betti numbers. Ripser computes them across the filtration:

β₀ = 108 at birth (one component per process), collapsing toward 1 as ε grows and processes connect.
β₁ = 33 persistent loops — 33 independent cycles that are never capped by a 2-simplex across the filtration range we study.
β₂ = 1 enclosed void — consistent with expectations for 108 points in 5D; few such hollow structures form and persist.

The alternating sum gives a topological fingerprint of the dataset: χ = 108 − 33 + 1 = 76 at peak complexity. That number would differ under a different feature set, a different organism distribution, or a different pipeline — it is a genuine invariant of the data's shape as we have encoded it.

Why β₁ = 33 is the productive dimension. β₀ describes clustering — informative but unsurprising; processes group by complexity. β₂ = 1 is expected given dimensionality and sample size. β₁ = 33 is where the structure lives: 33 independent ways the circuit data "goes around a hole" without filling it in. Each represents a family of regulatory circuits sharing a structural niche, arranged in a ring with a gap at the center that no single process bridges. Persistence filters signal from noise — the five loops with the highest death-minus-birth values are the ones that resist filling across the widest range of ε, and those are the loops with biological interpretations that hold up.

The deepest point. Euler's formula says that for any convex polyhedron, however complex, V − E + F = 2. The topology is invariant — it doesn't depend on how you draw the shape. Betti numbers extend this invariance to arbitrary shapes in arbitrary dimensions. They are not sensitive to embedding, orientation, or continuous deformation — only to fundamental topological structure. This is why TDA is a principled choice for data analysis: you are finding invariants, not clusters whose boundaries depend on a threshold, and not model fits that depend on distributional assumptions. The 33 loops are a property of the shape of the data. And that shape, it turns out, reflects the shape of regulatory logic in living cells.

🔬 Notes for Biologists

Five short discussions on the biological meaning and assumptions behind the TDA results.

2. What Do the Five Features Actually Capture?

Biologists sometimes worry that reducing a regulatory circuit to five numbers loses everything important. That worry is legitimate — and it is exactly why the five features were chosen carefully. Node count captures overall circuit complexity: how many molecular players are involved. Conditional count (edges) captures connectivity: how densely those players communicate. OR gates capture circuits that respond to any one of several signals — alternative pathway logic. AND gates capture circuits that require coincidence of multiple signals — conjunction logic, often associated with tighter control. Loops (back-edges in the Mermaid flowchart) directly count explicit feedback structure. Together these five numbers encode the logical architecture of a circuit — not its molecular identity, but its computational shape. Two circuits from different organisms with the same architecture will be neighbors in feature space. That is a hypothesis, and the coherence check tests it.

5. What the Null Model Result Means in Plain Language

The null model permutation test asks: if we randomly shuffled which process gets which label — scrambling the biological identity of every circuit while keeping the feature values — how often would we get a coherence score as high as 0.750? In 1,000 random shuffles, we achieved that score only about 22 times (p = 0.022). In plain language: the fact that known feedback circuits cluster coherently in our H₁ loops is very unlikely to be a coincidence. A skeptic might object that the feature set was chosen to favor feedback detection (loops/back-edges are one of the five features), and that objection is fair — which is why the feature ablation study matters. Dropping the loops feature reduces coherence sharply, but so does dropping node count and conditional count. The signal is distributed across features, not manufactured by a single one. The null model result and the ablation study together make the case that the topology is reflecting something real, not something we built in by construction.

Key Discoveries and Innovations

Methodological contribution: A pipeline was demonstrated: text to visual flowcharts (Mermaid) to features to topology. Topology is extracted from descriptions, not direct measurements.

Novel aspects: (1) Text-to-visual-to-topology pipeline; (2) Five features: nodes, conditionals, OR gates, AND gates, loops; (3) Feedback loops = H1 loops (literal correspondence); (4) LLM-assisted curation at scale.

Main finding: With the loop-based feature set, known feedback circuits cluster coherently: SOS, quorum sensing, biofilm in Loop #1; ara and Pho