Don't Vibe Code. Engineer Code.
Writing syntax is a solved commodity.
Communicating intent and defining requirements is where the real engineering is in 2026. If you really break it down, YOU (as an engineer) are the architect dealing with strategy, and the AI agents are the implementers.
I've been extremely interested in agentic coding (also sometimes called "agentic harnessing") and building/optimizing my own workflow. This is my attempt to synthesize it all.
Choosing Your Harness
If you’re building something that is the infrastructure for other things, your tool choice dictates your leverage.
Poor tools yield poor leverage.
So, you need to choose the right "harness" for your agents.
The options currently look like this:
- Opencode: This is my personal favorite (and it has a nice free tier).
- Github Copilot: Built by GitHub/Microsoft, but honestly, it's pretty "ehh" when it comes to true agentic leverage.
- Claude Code: Anthropic's proprietary offering.
- Claw Code: An open-source fork of Claude Code.
- Codex: OpenAI's open-source option.
- Cline: Another strong contender in the space.
If you want to get started right away and give it a whirl, Opencode is my recommendation. You can grab it via curl -fsSL https://opencode.ai/install | bash on Mac, or npm install -g opencode-ai on Windows.
The Anatomy of an Agent
To understand how to get ROI from these systems, you have to understand what an agent actually is. In the workshop, I mapped out the core architecture: The Agent sits at the center. It is powered vertically by the LLM (the brain) and its State. Horizontally, it reaches out to Tools and MCP. Diagonally, it relies on its Lifecycle, Loops, Memory, and Subagents.
LLM
^
|
|
LIFECYCLE | LOOPS
^ | ^
\ | /
\ | /
TOOLS <---- [ AGENT ] ----> MCP
/ | \
/ | \
v | v
MEMORY | SUBAGENTS
v
STATE
If you leave out the systems-thinking required to manage these components, you're going to get systems that don't really work at scale.
Prompt Engineering 101: R-TECCS
So far, we've seen companies try to optimize for this through very clever prompt engineering. To standardize this, I teach the R-TECCS framework:
- Role/Persona
- Task
- Expected Outcome
- Constraints
- Context
- Scope
Each letter in R-TECCS addresses a specific failure mode in how people talk to AI. Think of it like a lens system; each component focuses the light differently, and without all of them, you get distortion.
R is for Role/Persona
The first thing an AI needs to know is who it should be in this conversation. Give it a title and you get generic output. Give it an identity and you get focused reasoning.
When you say "you are a senior backend engineer," you're invoking a entire worldview. You're saying: this person thinks in systems, cares about scalability, has strong opinions on database normalization, and gets uneasy when business logic leaks into route handlers.
A developer writing Python scripts has a fundamentally different mental model than a platform engineer designing CI/CD pipelines. A data scientist thinking in pandas DataFrames will approach the same problem completely differently than a systems programmer thinking in memory allocation.
This is why "be a helpful assistant" doesn't work. It's a role with no edges, no priorities, no history. It's like hiring someone and saying "do stuff." You get generic output because you gave generic identity.
When you write a role definition, think about the person who does this professionally. What keeps them up at night? What decisions do they make in the first five minutes of a new codebase? What do they roll their eyes at?
Write for that person.
The role is the foundation. Everything else sits on top of it.
T is for Task
The task is the verb. It's what you're asking the AI to actually do.
This sounds simple, and people still get it wrong. "Do something with the auth code" is not a task. "Refactor the authentication module to use JWT stateless sessions and extract the token validation logic into a separate utility file" is a task.
The difference is specificity. A task should answer: who does what, with what inputs, producing what outputs?
One common mistake is conflating the task with the expected outcome. "Write a function" is the task. "The function should validate emails according to RFC 5322 and return either a parsed email or a validation error" is the expected outcome. Don't mix them.
The task should be atomic. If you find yourself writing "and also..." you're probably trying to stuff two tasks into one prompt. Split them.
A well-defined task is a gift to the AI. It removes ambiguity, sets clear boundaries, and dramatically increases the likelihood that you get exactly what you asked for.
E is for Expected Outcome
If the role is the identity and the task is the verb, the expected outcome is the definition of done.
This is where most prompts fail. People say "fix the bug" but never define what "fixed" looks like. The AI cannot read your mind. If you don't tell it what success looks like, it will guess, and its guess might not match yours.
Compare "the code is working" with something like this: "the auth middleware validates the JWT signature, checks expiration, attaches user_id to req.user, and returns 401 with an error payload for any validation failure." The first leaves everything to chance. The second is a contract.
Be explicit about what the AI should produce. If it's a function, describe its signature. If it's a refactor, describe the before and after. If it's a test, describe what scenarios need coverage.
Think of the expected outcome as a contract. You're telling the AI: if you produce this, I will accept it. If you produce something else, I will reject it.
This is also where you encode your standards. "Write clean code" is meaningless. "Follow the existing naming conventions in this codebase, keep functions under 50 lines, and add JSDoc comments to all exported functions" is meaningful.
The expected outcome is your quality gate. Write it accordingly.
C is for Constraints (first C)
Constraints are the "you cannot do X" guardrails. They're negative definitions of the solution space.
This component is underrated. People think they're being helpful by not limiting the AI. "Just do whatever you think is best." But unbounded freedom produces unpredictable results. The AI will make assumptions you didn't know it was making, and you'll spend hours untangling them.
Constraints tell the AI what to leave alone. What patterns should it not break? What existing APIs must it maintain compatibility with? What architectural decisions are off-limits?
CONSTRAINTS IN PRACTICE
───────────────────────
Bad: "Implement the new feature"
Good: "Implement the new feature
without modifying auth/ or api/v1/
and maintain backwards compatibility
with existing client SDKs"
─────────────────────────────────
Constraints answer:
• What should NOT change?
• What must remain untouched?
• What patterns must be preserved?
• What are the non-negotiables?
The best constraints are specific. "Don't touch the legacy code" is vague. "The refactor must not modify anything in lib/legacy/ or break the external API contract documented in api-spec.yaml" is specific.
Constraints are not about limiting creativity. They're about protecting the things that should not change while giving freedom in the areas that can evolve.
C is for Context (second C)
Context is the information the AI needs to do the task correctly. It's the difference between starting from scratch and starting from a foundation.
Think of context as the "given" in "given X, do Y." Without X, the AI is guessing. With X, the AI is reasoning.
There are different layers of context. The immediate context is the specific code you're working on. The architectural context is how this code fits into the larger system. The business context is why this system exists, who uses it, and what problem it solves.
Most people only give the immediate context. They say "fix the bug in auth.ts" without explaining that this auth system is used by three different frontend applications and a mobile client, each with different session handling expectations. Then they wonder why the fix breaks something in production.
Context is not about dumping everything you know. It's about giving the AI enough to make informed decisions. What does it need to know to avoid creating new problems while solving the current one?
A good test: if the AI had to guess this information, would its guess be wrong? Then include it.
S is for Scope
Scope defines the boundaries of the work. What's in, what's out, and how do you know when you're done?
This is where you prevent feature creep, both from yourself and from the AI. "While you're in there, could you also..." is the enemy of focused execution. Scope is the answer to "what does this task actually include?"
The best scope statements are specific: concrete deliverables, defined endpoints, explicit exclusions. Compare "refactor the backend" with "extract the user authentication logic from the Express routes into a dedicated auth middleware module in middleware/auth.ts, keeping the existing route interfaces intact." The difference is specificity. Being intentional with scope matters more than keeping it small. You can scope a large task just as easily as a small one. The point is that you define what "done" means by defining what the work includes.
Without scope, tasks expand. The AI sees an opportunity to improve something, or notices a related problem, and the problem it came to solve becomes a different solution solving a different problem. Scope prevents this.
R-TECCS in Practice
Here's how these pieces fit together in a real prompt:
ROLE: You are a senior platform engineer with deep
Kubernetes experience and strong opinions
on infrastructure as code.
TASK: Create a Helm chart for the checkout service
that follows our existing chart patterns.
OUTCOME: A deployable Helm chart in deploy/checkout/
with values.yaml, templates/, and Chart.yaml.
Must be compatible with our ArgoCD workflow
and pass 'helm lint' without warnings.
CONSTRAINTS: Do not modify existing infrastructure code.
Do not create new Terraform resources.
Must use our standard labels and annotations.
CONTEXT: The checkout service is a Node.js app with
3 replicas, environment-based config, and
needs external Redis and PostgreSQL connections.
Our existing charts use _helpers.tpl extensively.
SCOPE: Only the checkout service Helm chart.
No related services, no CI/CD changes.
Exclude local development documentation.
Notice how each piece does its job. The role sets mental model. The task is specific and atomic. The outcome defines done. The constraints protect the surrounding system. The context enables informed decisions. The scope prevents expansion.
That's R-TECCS. That's how you talk to AI so it actually helps you.
A massive cheat code here is to use AI to generate and improve your prompts. Draft a prompt, and then ask ChatGPT to critique and optimize it. (For example, I have a custom "Head of Product" prompt I use at Kynd, created with this exact technique).
Here's an example of a conversation between me and Perplexity to illustrate the value of what I mean.
... and a prompt I use to get AI to generate me exceptional prompts (with follow-up questions included to ensure it gets the right info it needs).
Agentic Basics: Compaction, Skills, Tools, and Commands
There are four fundamental building blocks to agentic development:
1. Compaction
The premise here is simple: you want the AI to reference the CliffNotes instead of the novel. As a session progresses (Msg 1, Msg 2, Msg 3, Msg 4), the context window fills up. Compaction distills this session's history into a more compact format (like a summarized bio or concise rules) that takes up fewer tokens for the next model to use.
2. Skills
Think of skills as cookbooks. They contain the instructions (your SKILL.md files) and the ingredients (the scripts, sub-instructions, etc.). You can even pull from a skill library or marketplace (like skills.sh). For example, a deployment skill might have a SKILL.md that says "use scripts/vercel_deploy to..." and then references a deploy_to_vercel.sh script.
[ Agent ] <--> [ Skill Library ]
|
+-----------+-----------+
| | |
[ Deploy ] [ Test ] [ Refactor ]
deploy_skill/
SKILL.md
scripts/deploy_to_vercel.sh
3. Tools
If skills are the cookbooks, tools are the "pots", "pans", and "stoves" that the cookbook relies on to actually make the food. This is your Tool Executor giving the Agent access to Read, Write, and the Terminal.
4. Slash Commands
A slash command is just an easy way to inject instructions. That's it. Commands like /review_pr, /research_codebase, or /create_handoff are all just custom prompts that get injected into the user input.
The Three Pillars of Agentic Development
With the basics out of the way, we can now get to the core pillars that undergird effective agentic development. There are three of them: Context Management, Human in the Loop, and Evaluation.
Pillar #1: Context Management
Your context window is short-term memory; the AI forgets what it can't hold. That's not a flaw, it's physics. When you cram too much in, the model starts hallucinating to fill the gaps; it's not being difficult, it's trying to be helpful with incomplete information.
Think of the context window like a fuel gauge. You have a tank, you can fill it, but once it's full, something gets pushed out. The AI doesn't choose what it forgets; physics decides.
When the context is nearly full, the model starts losing earlier information to make room for newer content. That's when you get hallucinations, contradictions, and the feeling that the AI forgot what you told it five minutes ago. It's not being difficult; it's physics.
CONTEXT CAPACITY
─────────────────
EMPTY SWEET SPOT FULL OVERFLOW
| | | |
v v v v
0% 40-50% 80% 100%+
|<-- good -->| |<-- trouble -->|
target: ~36% for a 264k context window
The sweet spot is 40 to 50 percent. Below that and you're paying for capacity you're not using; above that and you're gambling with quality.
For context, a 95,502 token automated PR review on a 264k token context window sits at 36 percent. That's the target.
I manage this through the "3 C's": Compaction, Composition, and moving Context to Files. Compaction we've covered; that's summarization and stripping the noise. Composition is where it gets interesting.
Instead of one agent doing everything, you build a hierarchy. A parent agent delegates to subagents, each handling a distinct task and reporting back with findings. The subagent might chew through 100k tokens reading files; your parent agent only sees the final summary. It's like a project manager who doesn't need to sit in on every meeting, just the stand-ups.
ORCHESTRATOR
sees ~2k tokens
|
+---------+---------+
| |
[CODER] [SEARCHER]
100k tokens 80k tokens
| |
v v
summary summary
(~1k) (~1k)
The key insight: the AI that runs your tests doesn't need to know about your git history. Different tasks, different context. You wouldn't load your entire codebase into a prompt about fixing a CSS bug.
Subagents do two things, and two things only:
- They do stuff
- They report back
That's it.
The "do stuff" part is where they go read files, write code, run commands, use up all their tokens doing the actual work. The "report back" part is where they distill everything they did into a summary the parent can actually use.
Why does this matter for context? Because the alternative is having one agent hold all the intermediate state from every tool call, every file read, every failed attempt. That's noise. The parent doesn't need to know you tried three different approaches before finding the one that worked; it just needs to know the result.
Report back is the filter and the compression step. The subagent spent 100k tokens to figure something out, and the parent sees 1k tokens that capture the answer. This compression is the intended behavior. You're using the subagent's thinking to compress value; storing a full transcript misses the point.
Think of it like a final exam. The student spends a semester learning, doing homework, reading textbooks. At the end, they sit down and write a 2-page summary that captures what they know. The 2 pages is the report. The semester of work is the "do stuff." The parent agent only ever sees the 2 pages.
subagent work:
100k tokens of reading, trying, failing, succeeding
subagent output:
"found the bug in auth/middleware.ts, line 47,
it's a missing null check, fixed it, tests pass"
parent context cost: ~50 tokens
That's the trade. The parent stays lean because the subagent does the compression.
Pillar #2: Human in the Loop
Here's where most vibe coding goes wrong: the human abdicates strategy. They say "build me something" and let the AI figure out everything, and then they wonder why the result is a mess of scope creep and unconnected pieces.
You are the Architect; the AI is the laborer. Architecture is hard. Implementation is easy. If you don't believe me, try describing a bridge to a builder versus actually building one. The description takes domain knowledge, tradeoff analysis, constraint reasoning. The building takes following instructions. Your job is the thinking; the AI's job is the typing.
HUMAN (YOU) AI
| |
STRATEGY EXECUTION
| |
"where?" "how?"
| |
YOU DO THIS AI DOES THIS
When you hand over strategy, you're handing over the destination. When you hold strategy, you're holding the map.
Define Boundaries Early. Code is a liability. Every line is something that can break, something that needs maintaining, something that needs explaining to the next person who reads it. The tighter your scope, the less liability you create. When you prompt an agent, you're describing what to build and drawing a fence around what it can touch.
A tight scope looks like this: "Fix the login button styling on the settings page. Do not touch auth logic, do not touch other pages, do not add new components." A loose scope looks like: "Improve the settings page."
You can guess which one doesn't result in seventeen unexpected file changes.
The Ping Pong Technique. Take two different AI models and have them debate your plan. Model A proposes; Model B critiques. The output goes back to Model A for refinement. You do this three times. The goal is stress-testing your thinking before you commit code (you're not trying to reach consensus).
The prompt I use: "You are a senior developer reviewing a technical plan from a junior teammate. Aggressively identify flaws, gaps, and risks. Do not soften your feedback." Give that to your critic model and let it work.
MODEL A MODEL B
propose --> critique
^ |
| v
refine refine
^ |
| v
(repeat 3x)
The result isn't perfect; nothing is. The result is robust. You've caught the obvious problems before they became runtime errors. But evaluation only matters if you actually run the tests. If you write tests and never execute them, you haven't built a mold; you've drawn a picture of one.
The RPI Workflow: Research, Plan, Implement
The most effective human-in-the-loop system I've found is called RPI, and it comes from Dex Horthy's writeup on advanced context engineering. The core idea is that you split development into three distinct phases, each with a specific output and a human checkpoint before moving forward.
The premise is simple: a bad line of code is just a bad line of code. A bad line of a plan could lead to hundreds of bad lines of code. A bad line of research, a misunderstanding of how the codebase works or where certain functionality lives, could land you with thousands of bad lines of code. So you focus human effort and attention on the highest leverage parts of the pipeline.
The three phases work like this:
Research. You send an agent into the codebase to understand how things work. You want the full picture: the file you're looking at and the connections around it. How does data flow through the system? Where are the entry points? What are the existing patterns? The output is a research doc that captures what you learned.
Plan. You take the research and turn it into an implementation plan. Not "fix the bug" but "here's exactly what files we'll touch, in what order, and here's how we'll verify each step works." The plan is a contract between you and the agent about what success looks like.
Implement. You execute the plan, phase by phase. After each phase, you verify before moving to the next. This is where TDD kicks in; you write the test first, watch it fail, then write the code to make it pass.
This workflow designs your entire process around context management. You keep the context window in the 40 to 60 percent range (depending on problem complexity), and you build in high-leverage human review at exactly the right points. The human isn't reviewing every line of code; they're reviewing the research and the plan, which is where the real leverage is.
When you review a plan, you're checking the decision about what to build before construction starts. Reviewing code catches construction errors after the fact. If the plan is wrong, you catch it before a single line gets written. If the plan is right but the implementation is wrong, you catch it in the verification step before it gets merged.
Pillar #3: Evaluation
You would never pour concrete without the mold in place. You'd end up with a mess, and you'd have to break it all out and start over. Writing code without tests is the same thing; you're hoping the concrete lands correctly, and when it doesn't, you're doing twice the work.
TDD is not a suggestion. It is the mold for your concrete.
1. WRITE TEST 2. RUN (FAIL) 3. WRITE CODE 4. RUN (PASS)
+-----+ +-----+ +-----+ +-----+
|MOLD | --> |CHECK | --> |POUR | --> |SET |
+-----+ +-----+ +-----+ +-----+
The sequence is simple. Write the test that defines what success looks like. Run it. Watch it fail because the code doesn't exist yet. Write the code. Run the test again. Watch it pass because your code now matches the mold you built. That's it. That's the whole process.
Now add the tools. Your agent should have access to LSP errors; Opencode provides this by default. The agent can read compilation errors, trace through the codebase from the frontend to the backend, and understand the connections. For browser interactions, give it Playwright. The agent can spin up a browser, navigate to your app, click things, check for console errors. It's like having a QA engineer that never gets tired and never forgets to test the happy path.
Playwright handles testing and verification. After the agent writes code, it can run the browser test and confirm the feature actually works in a real environment. Code that passes linting isn't necessarily code that works.
Review Agents: The Quality Gate You Actually Use
Tests only catch what you told them to catch. If you write a test for the happy path and nothing else, you'll never know about the edge case that crashes production. This is where review agents come in.
A review agent is a separate agent that looks at the code after it's written and before it gets merged. It checks for things the implementer might have missed: security vulnerabilities, performance problems, deviation from existing patterns, incomplete error handling. A second pair of eyes that never gets tired and never skips the boring checks.
The review agent operates on a different context than the implementer. It doesn't know what the code was supposed to do; it just knows what it does. This creates a gap between intent and implementation that the implementer can't see because they're too close to the code.
Combining TDD with review agents gives you a quality gate that actually works. The implementer writes tests for what they expect. The review agent checks for what they didn't expect. If the review agent finds something, it flags it. The implementer fixes it. The cycle repeats. Nothing gets merged until the review agent passes.
The review agent catches the mechanical problems. The human catches the design problems. Together, they catch almost everything.
Bringing it Together: Pre-Built Workflows
There are infinite options for workflows, but the best advice is to choose one and stick with it for a month. Some popular ones are Research, Plan, Implement (RPI), micode (Opencode-native RPI), and Super RPI.
But the most fascinating one I've found is called "The Greek Gods" (from Oh My Openagent). This is "Ultrawork": Everything. All of it. Full horsepower. Epic. No human intervention. It operates on Agent Discovery Mode; it automatically hunts for context, infers your goal without a detailed prompt, and works through the codebase autonomously. It skips the formal planning phase and implements directly.
It does this through orchestration and parallelization of different subagents with specialties: The User sets the Goal; Prometheus creates the strategy; Metis identifies gaps to refine the plan; Momus finalizes the PLAN.md; Atlas coordinates execution, delegating tasks; Junior does the actual work. Atlas then verifies the results against accumulated wisdom stored in .sisyphus/notes/ before marking it DONE.
The Orchestration Alternative
Ultrawork is the "lazy mode" approach. Sometimes you want more control. That's where orchestration comes in.
The system uses three layers:
PLANNING EXECUTION WORKERS
(User + (Atlas (Specialized
Prometheus) the Conductor) Subagents)
Goals Reads plan Junior (code)
Requirements Delegates tasks Oracle (architecture)
Boundaries Verifies results Explore (grep)
Reports back Librarian (docs)
Prometheus does the strategic thinking. He interviews you, figures out what you actually want, and writes a plan. He's read-only; he can only touch files in .sisyphus/plans/.
Metis catches what Prometheus misses. She looks for gaps in the plan, hidden assumptions, and the kind of ambiguity that derails implementations.
Momus is the gatekeeper. She checks the plan against four criteria: clarity, verification, context, and big picture. She won't let work start until the plan is solid.
Atlas is the foreman. He reads the approved plan, breaks it into tasks, delegates to the right specialists, and verifies everything before marking it done.
Junior is the worker. He writes code. He can't delegate to others (so no task sprawl), and he tracks his progress obsessively.
When to Use What
This is the decision tree:
Is it a quick fix or simple task?
└─ YES → Just prompt normally
└─ NO → Is explaining context tedious?
└─ YES → ulw (let the agent figure it out)
└─ NO → Do you need precise, verifiable execution?
└─ YES → @plan (Prometheus plans, Atlas executes)
└─ NO → ulw
Ultrawork (ulw) is for when you can't be bothered to explain everything. The agent explores, plans on the fly, and implements. It's fire and forget.
Orchestration (@plan → /start-work) is for when you need rigor. The plan gets reviewed, tasks get verified, and nothing ships until Momus says it's solid.
The Category System
When Atlas delegates, he doesn't pick a model. He picks a category.
Instead of this:
task({ agent: "gpt-5.4", prompt: "..." }); // Model knows its limitations
task({ agent: "claude-opus-4-7", prompt: "..." }); // Different self-perception
You get this:
task({ category: "ultrabrain", prompt: "..." }); // "Think strategically"
task({ category: "visual-engineering", prompt: "..." }); // "Design beautifully"
task({ category: "quick", prompt: "..." }); // "Just get it done"
The category describes the intent, not the implementation. Each category maps to a model at runtime based on what's available and configured.
Skills layer on top. We talked about skills earlier: they're bundles of specialized instructions that gets prepended to the prompt. frontend-ui-ux for design work. playwright for browser automation. git-master for git operations. You can chain multiple skills onto one task.
task(
category = "visual-engineering",
load_skills = ["frontend-ui-ux", "playwright"],
prompt = "..."
);
This gives you the full stack of expertise for that task without managing it manually.
The Wisdom System
Orchestration only works because of wisdom accumulation. After each task completes, Atlas extracts what was learned and passes it forward.
.sisyphus/notepads/{plan-name}/
├── learnings.md # Patterns, conventions, what worked
├── decisions.md # Architectural choices and why
├── issues.md # Problems encountered, how they were solved
├── verification.md # Test results, what passed
└── problems.md # Unresolved issues, technical debt
When Junior starts a task, he gets the accumulated wisdom from all previous tasks in that plan. This means the agent doesn't repeat mistakes. It doesn't try three approaches that already failed. It just picks up where it left off.
This is the difference between an agent that works and an agent that actually helps. Without the wisdom layer, you get a fresh start every time. With it, you get a team that gets smarter as work progresses.
How to Actually Use This
The workflow in practice:
1. @plan "I want to refactor the auth system"
(Prometheus interviews you, writes a plan)
2. Answer questions about scope, constraints, what must not break
3. Momus reviews the plan; if she rejects it, Prometheus fixes it
4. /start-work
(Atlas kicks off; Junior starts executing tasks)
5. Each task completes, gets verified, wisdom accumulates
6. Done
(All tasks complete, verified, documented)
Or for the lazy path:
1. ulw add user authentication
(Agent explores codebase, figures out where it goes,
implements, verifies, done)
Both work. The first gives you control and verification. The second gives you speed and convenience. Pick based on what the task actually needs.
The Bottom Line: 5 High-Value Tips to Apply Now
To wrap up, if you take nothing else away, implement these 5 tips right now:
- Run messy prompts through an LLM before sending them to your code agent.
- Tell AI to use TDD (write tests so they fail, write code until they pass).
- (If you can) set your compaction threshold to 40%/50% of your context window.
- Tell AI to research the codebase and write a plan before implementing in phases.
- Tell the LLM to follow SOLID principles (good code quality) and KISS (Keep it Simple, Stupid).
For numbers 2, 4, and 5, just use my AGENTS.md file, and you're set.
Let me know your thoughts; thanks for reading!
– Jeremy