Navv Studio’s Newsletter: AI Builder Stack

My Market Judgement Used to Decay in Translation. Now It Ends Up in the Product.

Dickson Lai — Tue, 21 Apr 2026 01:30:44 GMT

The signal: I’m the founder of Navv Studio, a venture builder operating across Singapore and Indonesia. My core competence is data-driven market research. Reading public data, bank filings, government announcements, local coverage, and operator conversations, then forming a specific read of the market. Not just “what is happening” but “what matters, what doesn’t, and what the product implication is.” That judgement is what I’m supposed to bring to a venture. It is the thing that is hardest to copy and hardest to delegate. For most of the last decade, it was also the thing that decayed fastest in translation. gstack and oh-my-claudecode let me carry that judgement all the way into the product itself.

What caught my eye

The classic venture-builder problem is not generating insight. It is that the insight, and more importantly the judgement built on top of it, has to pass through other hands before it shows up in the product. I form a view. I write a memo. A product lead reads the memo and forms a related but slightly different view. They brief an engineer, who forms another related but slightly different view. The engineer makes hundreds of micro-decisions while building: which field to prioritize in a form, which label to use on a button, which edge case to surface first. Every one of those micro-decisions is a judgement call, and every one of them is being made by someone further from the data than I was when I formed the original view.

This isn’t a criticism of the team. It’s the nature of handoffs. Each translation loses something. By the time a thesis becomes a shipping feature, the judgement that originally drove it has been diluted four or five times.

gstack, from Garry Tan, is a set of twenty-plus slash commands, each one a structured prompt that packages a professional role into something I can run. /office-hours pressure-tests a product thesis. /design-shotgun generates UI variants for the same underlying concept. /review audits what I build. /qa drives a browser session against a prototype and flags what breaks. Each command lets me operate in a role I don’t formally occupy: product refiner, UI explorer, QA lead, code reviewer.

oh-my-claudecode, or omc, is Yeachan Heo’s parallel orchestration layer. Shared task lists, coordinated parallel agents. It is what lets me build three product directions simultaneously instead of having to pick one and defend it.

Together, these tools are not a way to do research faster. They are a way to stop my judgement from decaying between the moment I form it and the moment it shows up in the product.

How we’re actually using this

The clearest way to show this is to walk through AutoKon, our Indonesian construction-tech venture, starting from the insight and judgement I brought to it.

The research that grounded AutoKon surfaced a specific pattern inside the Indonesian property development value chain. Payments between property developers and their contractors don’t get stuck because capital is missing. They get stuck because nobody has a tamper-resistant record that the work actually got done. Progress sits in WhatsApp threads, photos live on individual phones, and disbursement decisions happen in meetings where nobody can prove what’s real. That much is insight.

My judgement on top of that insight is more specific. The wedge is not project management. It is not progress tracking for its own sake. The wedge is a financial verification pipeline: photo evidence from the field, captured at the moment of work, gating the release of payment. I believe, based on field conversations with Indonesian developers, that Pelaksana (the field workers doing the actual construction) will never reliably use a dedicated mobile app, so capture has to happen inside WhatsApp where they already live. I believe the web app should be read-only for Finance and Owners, because giving Finance teams edit rights on progress data is the surface where disputes get manufactured. I believe the approval chain has to be rigid, SM then PM, two steps in that order, because flexible chains get abused in Indonesian site operations. I believe the artifact that triggers money is a PDF report, not a dashboard view, because Finance releases payment against documents that match the shape of their existing disbursement packages. I believe progress has to be weighted by stage rather than binary, because disbursement is tranched against specific milestones. These are not things you get from a data pull. They are things you form by reading market data together with operator interviews and deciding what actually matters.

Before. This judgement would go into a memo. The memo would brief a product lead and an engineer. The product lead would agree broadly but make their own calls on flow structure. The engineer would build against the spec but make their own calls on which screens to prioritize, which labels to use, which edge cases to handle first. Six months later we would have shipped something that looked like a generic construction-tech SaaS with a progress tracker, a project management module, and an approval tab tacked on. The 30 to 40 percent that got lost in translation would be exactly the judgement layer: that the product is specifically a financial verification pipeline and not a project management tool, that capture has to live inside WhatsApp, that Finance gets read-only access, that the PDF is the artifact that matters. The things that are hardest to spec and easiest to lose.

Now. The same judgement ends up in the build.

/office-hours first. Not to generate the judgement, which comes from the research, but to stress-test it against itself. I walk through the assumptions behind each judgement call: why photos and not IoT sensors, why WhatsApp and not a dedicated mobile app, why read-only for Finance, why a rigid two-step approval over configurable chains. The tool pushes back on the weakest links. What comes out is not new judgement. It is my original judgement with its soft spots identified.

/design-shotgun next. I describe the specific flows I think matter. The Pelaksana WhatsApp submission confirmation, the SM approval queue at Step 1, the PM approval and report generation at Step 2, the Finance download flow, the Owner dashboard with S-curve and SPI grades from SANGAT_CEPAT down to KRITIS. I tell Claude exactly which fields matter, which state transitions I care about (the full seven-state model from belum_mulai through pending_sm, pending_pm, ditolak_pm, all the way to selesai), which print states have to work, and which conditions the product will operate under. The command generates three variants of the same shape, each of which reflects the specific parameters I set. These are not generic mockups. Every one of them has my judgement encoded into its specifics, right down to the Indonesian status labels and the approval queue semantics.

omc runs the parallel build. I kick off three parallel agents, each building out one of the three variants as a working artifact in the Next.js stack AutoKon actually runs on. I am in the loop on each one, making the micro-decisions as they come up: what the Pelaksana sees in their WhatsApp confirmation response, what the SM sees in the Step 1 queue, what the PDF report that Finance downloads actually looks like compared to the paper disbursement packages their teams already process. These are the micro-decisions that used to get made by whoever was downstream of my memo. They are now made by me, because I am the one building.

/qa walks each variant through a real browser session. It flags flows that break, states that don’t render, cases where my judgement and the actual implementation diverged without me noticing. /review audits each build for structural issues and surfaces the things that would fall apart under real load.

A week later, three testable artifacts exist, each one built with my specific market judgement intact. When we take them to a real Indonesian property developer, the artifact they react to is the artifact I had in my head when I read the research. Not a diluted version. Their feedback goes directly against my thesis, not against someone else’s interpretation of my thesis.

The second example is positioning. Research surfaced a specific competitive gap. Most construction-tech vendors in Southeast Asia position on generic SaaS benefits, project efficiency, collaboration, digital transformation. My judgement is that AutoKon is not a construction-tech SaaS. It is a financial verification pipeline, and the buyer is the property developer’s CFO, not their head of construction. The copy should lead with disbursement risk and tamper-resistant evidence, proof points should be Indonesian bank and developer names rather than global enterprise logos, and the hero should show the PDF report that triggers payment, not a dashboard. Previously this would become a brief for a marketing team who would form their own related view and probably default back to the construction-tech SaaS frame. Now /design-shotgun produces three landing page variants, each one carrying my specific positioning judgement into the page copy, the hero framing, and the proof point selection. The discussion with the team happens against a real page in the repo, not an abstract positioning doc.

The third example is the PDF report itself, the artifact that actually triggers money. My judgement, grounded in operator conversations, is that the report has to mimic the shape of the paper disbursement packages Finance teams already process. Signature blocks in the positions they expect. A photo grid that maps to the work breakdown structure they audit against. The S-curve and earned value analytics (PV, EV, SPI, AC, CPI, BAC) presented as a single-page narrative they can read without scrolling, not a multi-tab dashboard export. A generic engineer briefed from a spec would have built a clean modern report that looks great and that a Finance team would have rejected, because unfamiliar formats get rejected. I prototyped the report shape myself, using the specific bank-package conventions I had picked up in the field, pushed it as a PR, and the engineers reviewed and merged. The report that ships measures up against the artifacts it is actually replacing, not against a designer’s abstraction of what a report should look like.

What has not changed

The judgement still has to be grounded in the research. If the research is weak, everything downstream is confidently wrong. /office-hours will stress-test framing but it cannot tell me my underlying data is misread. It cannot know that I missed a restated number in a bank annual report or that I over-indexed on one noisy operator conversation. The quality of what comes out depends entirely on the quality of the judgement going in, and the judgement depends on the research.

Engineers still merge, and their judgement still matters. My prototypes are reviewable, not production-ready. My specs sometimes prescribe approaches that don’t fit how a specific service is wired, and the engineers on our ventures redirect me in review. The tools don’t replace engineering judgement. They let my market judgement coexist with engineering judgement at the level of the build, instead of being translated away before the build starts.

The bigger pattern

Judgement is the scarce asset in venture building. Insights can be replicated once someone else reads the same sources. Data can be rebought. Judgement, the specific read of what matters in this market at this moment for this user, is what I am supposed to bring. Historically, that judgement got diluted every time it passed through another hand between formation and shipment. I accepted that as a fixed cost of how products get built.

It turns out it was not a fixed cost. It was a translation cost, and translation costs can be removed when the person holding the judgement can operate in the medium of the build. gstack and omc let me operate in that medium. The things that used to be lost between memo and prototype, the specific judgement calls that made the difference between a generic product and a product that actually fits a market, now end up in the artifact itself.

That is the quiet thing I think a lot of tool commentary misses. The conversation about AI in product keeps orbiting “non-engineers can now code.” That is a shallow framing. The deeper shift is that the people with market judgement can now carry that judgement all the way into the product, intact. For a venture builder, that is the whole game.

Builder’s shortcut: The next time your research produces a specific judgement call, don’t write a memo. Open Claude Code with gstack installed. Describe the specific judgement to /design-shotgun with the parameters that matter to you, and have omc build one variant as a real artifact. Compare what gets built to what you would have written in the memo. The gap between the two is the judgement that historically got lost. That gap is the thing worth closing.

Tools mentioned:

gstack. Role-based slash commands for Claude Code, from Garry Tan. MIT.
oh-my-claudecode. Parallel multi-agent orchestration for Claude Code, from Yeachan Heo.

Five Tools, Zero Overlap: The AI Dev Pipeline We Actually Ship With

Dickson Lai — Tue, 14 Apr 2026 00:30:45 GMT

The signal: The AI developer tools space has gone from scarce to overwhelming in about six months. Superpowers just crossed 149k GitHub stars. Garry Tan open-sourced gstack and it shot to 70k. OpenAI acquired Astral to bet big on developer tooling. And yet most teams I talk to are drowning in options, not shipping faster. The problem isn’t finding good tools. It’s finding tools that don’t overlap, don’t conflict, and actually compose into something greater than the sum of their parts.

What caught my eye

The breakthrough wasn’t any single repo. It was realizing that the best AI dev tools each solve exactly one concern in the development lifecycle, and the real discipline is in how you combine them.

Here are the five tools and the five distinct concerns they address:

OpenSpec (39k stars) solves “what to build.” It’s a spec-driven development framework that enforces a strict three-phase state machine: proposal, apply, archive. Before any code gets written, you and your AI assistant agree on the spec. It works with 25+ coding assistants, so it’s not locked to a single tool. Think of it as the contract layer. No ambiguity about intent.

gstack (70k stars) solves “who reviews.” Garry Tan’s open-source Claude Code setup gives you 23 opinionated skills that act as CEO, Designer, Eng Manager, Release Manager, Doc Engineer, and QA. The plan-ceo-review command runs layered review automatically. It’s organizational discipline encoded as tooling. When you’re a small team wearing every hat, this is how you get the rigor of a full org without the headcount.

Superpowers (149k stars) solves “how to build.” The brainstorm-plan-TDD-review methodology that turned coding agents from reckless autocomplete into disciplined engineers. It forces a strict workflow: brainstorm before coding, create isolated git worktrees, write failing tests first, then implement. RED-GREEN-REFACTOR, enforced by the framework, not by willpower.

oh-my-claudecode (28k stars) solves “how to execute.” Multi-agent orchestration for Claude Code. When you have a complex feature that needs parallel work streams, this handles intelligent agent delegation, persistent execution, and cost optimization through smart model routing. It’s the difference between running one agent at a time and running a coordinated team.

Optio (864 stars) solves “how to ship.” Workflow orchestration from ticket to merged PR. You submit a task, Optio provisions an isolated environment, runs the agent, opens a PR, monitors CI, triggers code review, auto-fixes failures, and merges when everything passes. The feedback loop is what matters: when CI fails, the agent gets the failure context and tries again. When a reviewer requests changes, the agent picks up the comments and pushes a fix.

The star counts tell an interesting story. Superpowers and gstack are mainstream. OpenSpec is gaining fast. oh-my-claudecode fills a real gap for teams doing parallel agent work. Optio is early but solves the last-mile problem that everyone ignores: actually getting code merged without babysitting.

How we’re actually using this

I want to be honest about where we are with this. We’re not running all five tools on every feature at Navv Studio. We’re experimenting with the combination, testing which handoffs work smoothly and which still have friction.

The tools we’ve been running day-to-day are gstack, oh-my-claudecode, and claude-mem (52k stars). gstack’s layered review process has become central to how we work. For a team our size, having automated CEO/design/eng review perspectives catches things I’d normally miss because I’m context-switching between five different priorities. It’s not a replacement for real code review. But it’s a forcing function that makes sure we at least ask the right questions before shipping.

oh-my-claudecode handles the orchestration layer. When we’re building AutoKon features that touch multiple services, spinning up parallel agent work streams instead of running one agent at a time makes a real difference. The smart model routing also keeps costs sane, which matters when you’re a venture studio, not a BigCo with unlimited API credits.

claude-mem solves a problem that doesn’t show up in the five-concern model but matters just as much: context persistence. Every time you start a new Claude Code session, you lose everything the agent learned in the last one. claude-mem captures what happened, compresses it, and injects the relevant context back when you start fresh. It’s the difference between working with a colleague who remembers last week’s decisions and one who shows up every morning with amnesia.

Superpowers has been the backbone of our actual coding workflow for longer. The TDD discipline it enforces is genuinely useful. When you let an AI agent just write code without tests, you get code that looks right but breaks in subtle ways. When you force it through RED-GREEN-REFACTOR, you get code that provably works.

OpenSpec and Optio are the pieces we haven’t adopted yet but are watching closely. OpenSpec’s spec-driven approach is compelling, especially for features where the requirements are ambiguous. Optio’s ticket-to-merged-PR automation is promising, but we’re still validating it on smaller tasks before trusting it with anything critical.

The bigger pattern

Here’s what I think most people get wrong about AI dev tools: they evaluate them individually. “Is this tool good?” is the wrong question. The right question is: “Does this tool solve exactly one concern, and does it compose cleanly with the tools solving the other concerns?”

The five-concern model maps to real failure modes I’ve seen in every team that uses AI coding agents:

Built the wrong thing (no spec)
Shipped without review (no organizational discipline)
Code works in demo, breaks in production (no methodology)
Single-threaded execution bottleneck (no orchestration)
PR sits open for days, agent can’t self-correct (no shipping pipeline)

Each tool addresses exactly one failure mode. Zero overlap means zero conflict.

For venture builders and small teams in SEA, this matters even more. We can’t afford the luxury of dedicated QA teams, release engineers, and project managers. But we also can’t afford to ship broken code. The answer isn’t hiring. It’s encoding the discipline into your toolchain.

Builder’s shortcut: Start with just OpenSpec + Superpowers. The spec layer plus the TDD discipline gives you 80% of the value. Add the other tools as your workflow stabilizes and you feel the specific pain each one solves.

Tools mentioned:

OpenSpec - Spec-driven development framework (39k stars)
gstack - Garry Tan’s Claude Code review stack (70k stars)
Superpowers - Agentic skills framework and dev methodology (149k stars)
oh-my-claudecode - Multi-agent orchestration for Claude Code (28k stars)
claude-mem - Persistent memory across Claude Code sessions (52k stars)
Optio - Workflow orchestration from ticket to merged PR (864 stars)

The Three-Layer Knowledge Stack That Stops Your AI Agents From Forgetting Everything

Dickson Lai — Tue, 07 Apr 2026 01:30:52 GMT

The signal: The open-source AI memory space just had its breakout week. claude-mem surged past 45k GitHub stars, Onyx crossed 25k as the go-to organizational knowledge layer, and Microsoft’s VibeVoice quietly became the default open-source voice AI. Three tools, three timescales of memory, one stack that finally solves the amnesia problem every builder running AI agents has been hitting.

What caught my eye

If you run AI agents for anything beyond a single session, you know the Monday Morning problem. Your agent was brilliant on Friday. It knew your codebase, your preferences, your project context. Then Monday comes, and it’s a brand new intern who needs the full tour again.

This isn’t a minor annoyance. It’s the core bottleneck for anyone trying to build real workflows on top of AI agents. We tracked the problem to three distinct timescales, and the interesting thing is that the open-source ecosystem just shipped a tool for each one.

Session memory: claude-mem (45.6k stars). This is the one that caught fire. Built by thedotmack, it’s a Claude Code plugin that automatically captures everything your agent does during a session, compresses it using Claude’s agent-sdk, and injects relevant context back into future sessions. The key insight is progressive disclosure: instead of dumping your entire history into context, it retrieves only what’s relevant, getting roughly 10x token efficiency. It replaced claude-subconscious almost overnight, which tells you the market was desperate for a better solution to session persistence.

Organizational memory: Onyx (25.1k stars). If claude-mem handles what your agent did last Tuesday, Onyx handles what your company knows. It’s an MIT-licensed AI platform backed by Y Combinator and Khosla Ventures, with 50+ connectors to Slack, Google Drive, GitHub, Confluence, Salesforce, and more. The real power is hybrid indexing: it combines semantic search, keyword search, and knowledge graphs to surface exactly the organizational context your agents need. Self-hosted, so your data stays yours.

Audio capture: VibeVoice (36.5k stars). Microsoft open-sourced this frontier voice AI, and the ASR model is genuinely impressive: it processes up to 60 minutes of audio in a single pass, with speaker identification, timestamps, and customizable hotwords across 50+ languages. This matters because a huge amount of organizational knowledge never gets typed. It lives in meetings, calls, and hallway conversations. VibeVoice captures that layer.

How we’re thinking about this

At Navv Studio, we run specialised agents across our ventures. The re-onboarding problem is real. Every Monday, I spend time getting Claude back up to speed on where we left off with AutoKon’s pipeline, which leads were warm, what the n8n workflows were doing. The productivity drain adds up.

We’re experimenting with claude-mem right now. Early days, but the idea is simple: if session context carries over automatically, that’s one less thing I have to reconstruct manually. The web viewer dashboard is a nice touch. You can see what your agent remembers, which builds trust in a way that black-box memory never does. We’re not over-complicating this. The goal is to find the simplest open-source tools that solve specific bottlenecks, not to build an elaborate infrastructure stack.

Onyx is on our radar for the same reason. We have context scattered across Slack, Google Docs, GitHub repos, and Supabase dashboards for each venture. Pulling the right context into an agent session is still manual. Whether Onyx is the right fit or whether a simpler connector setup works better, we’re still figuring out.

VibeVoice’s ASR is the one I’m most curious about for the Indonesia context. We run meetings in a mix of English, Bahasa Indonesia, and occasionally Javanese. The 50+ language support and speaker diarization could help capture meeting context that currently just evaporates. The TTS code was pulled by Microsoft over responsible-use concerns, but the ASR model is fully available, and that’s the piece that matters for knowledge capture.

The bigger pattern

What’s happening here is the emergence of a knowledge infrastructure layer beneath AI agents. Last year, the conversation was about making agents smarter. This year, it’s about making agents remember. That’s a more important problem.

For small teams and venture builders, this is especially relevant. We don’t have the luxury of dedicated AI infrastructure teams. We need tools that compose together simply and run on modest hardware. The fact that all three of these are open-source (claude-mem under AGPL, Onyx and VibeVoice under MIT) means you can deploy the entire stack without a single vendor dependency.

The broader signal from our GitHub AI Briefs this week confirms it. The agent ecosystem is splitting into two layers: agents that do things and agents that remember things. The memory layer is where the real platform plays are being made. If you’re building on AI agents and not thinking about knowledge persistence, you’re building on sand.

Builder’s shortcut: Start with claude-mem. Install it, run it for a week, and see how much re-onboarding time disappears. Then layer Onyx when you’re ready to connect your organizational knowledge. That’s the 80/20 path.

Tools mentioned:

claude-mem — Persistent memory compression for Claude Code (45.6k stars, AGPL-3.0)
Onyx — Open-source AI platform with 50+ connectors (25.1k stars, MIT)
VibeVoice — Open-source frontier voice AI by Microsoft (36.5k stars, MIT)

The Specialisation Wave: When AI Stops Trying to Do Everything and Starts Doing One Thing Brilliantly

Dickson Lai — Tue, 31 Mar 2026 01:30:48 GMT

The signal: Last week, four repos hit escape velocity on GitHub at the same time — and none of them are trying to be “the one AI tool to rule them all.” Each one does exactly one thing, and does it better than any general-purpose agent can. That’s the pattern worth paying attention to.

What caught my eye

The GitHub trending page has been telling the same story for about two weeks now: the generalist layer is commoditizing. What’s growing fastest are the tools that pick a lane and stay in it.

oh-my-claudecode racked up +598 stars in a single day — not because it’s another Claude wrapper, but because it turns Claude Code into a team orchestration layer. Parallel autopilot mode. Multiple agents running simultaneously on different parts of the same codebase. This isn’t “AI pair programming” anymore; it’s AI team management.

Dexter (+210 stars/day) automates financial research end-to-end: SEC filings, market data, company analysis, all with a self-validation loop that catches its own mistakes. It doesn’t try to write your code or manage your calendar. It reads 10-Ks so you don’t have to.

Chandra (+557 stars/day) is a 4-billion-parameter OCR model that handles tables, forms, handwriting, and 43 languages. Not a chatbot. Not a code assistant. Just the best document reader I’ve seen in open source.

And then there’s Flash-MoE — a 397-billion-parameter model running on a MacBook via mixture-of-experts sparsity. It scored 393 points on Hacker News in a day. The strategic signal here isn’t the benchmark scores; it’s that local inference at this scale is now possible on consumer hardware. That changes the economics of every AI-powered product on a 12-18 month horizon.

How we’re actually using this

At Navv Studio, the tool that grabbed us first was oh-my-claudecode. We’re implementing it right now for product builds — specifically testing parallel autopilot mode, where multiple agents work on different parts of a feature simultaneously. The promise is cutting a 2-hour feature build down to 30 minutes. We’re measuring wall-clock time against our usual sequential Claude Code workflow, and early results are genuinely encouraging. If it holds up over the next couple of weeks, this changes how a small team ships software.

The other tools on this list are on our radar but we haven’t tried them yet — and I think that’s worth being honest about. Dexter is the one I’m most curious to test for venture due diligence. We spend a lot of time pulling financial data manually when evaluating companies, and the self-validation loop (it flags its own inconsistencies) is exactly the kind of feature that matters when you’re making investment decisions. That’s next on the list.

Chandra is the one I’m watching for our construction tech work at AutoKon. Indonesia’s construction industry runs on paper — permits, inspection reports, material certifications, all in a mix of Bahasa Indonesia and English, often handwritten. A 4B-parameter OCR model that handles 43 languages and messy table layouts could be a game-changer for document processing in that world. We haven’t tested it yet, but if it handles bilingual government forms better than the general OCR tools we’ve tried (which have all broken on those layouts), it becomes a core part of the stack.

The bigger pattern

Here’s what I think is actually happening: we’ve passed the “can AI code?” phase and entered the “which AI for which job?” phase. The research modality stack is now effectively complete — social intelligence, academic papers, financial data, document processing, web crawling. Five tools covering every research angle a venture builder needs.

For small teams in Southeast Asia, this is a massive unlock. You don’t need a 15-person research department. You need five well-chosen specialised agents and someone who knows which questions to ask. The leverage ratio is getting absurd.

The Flash-MoE development deserves special attention if you’re building AI-powered products for markets where data sensitivity matters — and in Indonesia, government and enterprise clients care deeply about where their data goes. Local inference at 397B parameters means you can offer AI capabilities without the “but our data leaves the country” objection. That’s not a technical detail; that’s a sales unlock.

Builder’s shortcut: Pick one domain-specific agent from the list above and test it on a real task this week — not a demo dataset, a real one. The gap between “general AI can sort of do this” and “a specialised tool nails this” is wider than you think. Start with Dexter if you do any financial analysis, Chandra if you process documents.

Tools mentioned:

oh-my-claudecode — Multi-agent Claude Code orchestration with parallel autopilot
Dexter — Automated financial research with self-validation
Flash-MoE — 397B parameter model running locally via MoE sparsity
Chandra — SOTA OCR at 4B params, 43 languages, tables and handwriting