Spec-Driven Development with Kiro: AI Code Ownership

Your agents are making architectural decisions. Is anyone recording them?

Key Takeaways

The AI coding tools problem isn't producing code. It's that nobody knows why the code looks the way it does six months later. Kiro is the most direct attempt to solve that.
Spec-driven development adds overhead. The question isn't whether it's slower. It is. The question is whether your team can afford the alternative.
The failure modes are real: shallow specs, false precision from EARS notation, ADRs that look complete but aren't. The workflow only works if the review gates are treated as real gates.

AI coding tools have gotten good at producing code. Really good. Multi-agent frameworks, agentic assistants, autonomous pipelines: if the problem is "turn this idea into working code," the options are capable. But if you're building software on a team, you've probably hit the wall that shows up about two weeks after the impressive demo: the code works, nobody knows why it looks the way it does, and when someone asks why a particular approach was taken, the honest answer is "I let the agent decide, and I'm not entirely sure."

I've seen this play out across multiple teams now. The issue isn't that the agent made bad choices. It's that nearly every AI coding tool on the market optimizes for task execution, and few invest in the artifacts that keep team-based software development coherent over time.

Amazon's Kiro, launched in preview in July 2025, is the most direct attempt I've seen to address that gap. This isn't a "Kiro vs. Cursor" comparison; for raw speed on individual tasks, Cursor, Claude Code, and the multi-agent frameworks are formidable. The real question is whether Kiro's spec-driven development model, combined with an Architecture Decision Record (ADR) workflow you can configure through its Steering system, gives teams something those tools don't: a shared, human-reviewable record of how the software was designed and why.

The Multi-Agent Landscape: Powerful, but Structurally Incomplete

The current generation of multi-agent frameworks (CrewAI, LangGraph, AutoGen, MetaGPT) is mature. Companies like DocuSign and PwC run multi-agent pipelines in production at scale. The case for Kiro only matters if you accept that the alternative is already good at producing code.

MetaGPT is worth examining directly — it's less commonly run in production than LangGraph or CrewAI, but it overlaps most with what Kiro claims to provide. Give it a one-line requirement and it outputs structured artifacts: PRDs, design docs, interface specifications, implementation code. On paper, that covers the same ground as Kiro's spec-driven workflow.

The structural difference isn't about output quality. It's about who sees the decisions and when.

In MetaGPT, agents produce artifacts and hand off to the next agent. The output is comprehensive but the process is opaque: decisions made by the Architect agent aren't surfaced as reviewable records. There's no approval gate where your team reads the design document and flags a constraint before implementation begins. The artifacts are outputs to be consumed, not collaborative documents to be reviewed.

The same gap exists across multi-agent frameworks generally. They're excellent execution engines. They weren't designed to produce a shared team understanding of why the software was built a certain way. That's a design choice, not a shortcoming. These frameworks optimize for task completion. Kiro optimizes for team legibility across time.

What Kiro Actually Does

Kiro is a VS Code fork powered by Claude Sonnet on Amazon Bedrock. The familiar environment matters: no new IDE to learn, no context switching, and existing Open VSX plugins, themes, and settings migrate across cleanly. Because it's a general-purpose IDE, Kiro works across languages and frameworks, though the spec-driven workflow is most naturally demonstrated in backend and full-stack development where architectural decisions carry the most weight.

The differentiation is in the workflow it enforces.

Spec-Driven Development: Artifacts Before Execution

Every feature in Kiro begins with a spec, a set of structured Markdown files committed to the repository before any code is written. The three-phase process is deliberate:

Phase 1: Requirements (requirements.md)

The first gate is the one most teams skip entirely: written requirements reviewed and approved before any code exists. You describe what you want in natural language. Kiro doesn't start coding. Instead, it produces a formal requirements document using EARS notation (Easy Approach to Requirements Syntax), a format that forces every requirement into an unambiguous, testable form:

WHEN [trigger condition] THE SYSTEM SHALL [observable behavior]

EARS notation surfaces assumptions that would otherwise stay implicit. If Kiro generates an acceptance criterion you didn't intend, you catch it here, not during code review, not in production. The requirements get reviewed and approved by a human before anything else happens.

Phase 2: Technical Design (design.md)

This is the phase where the real architectural arguments should happen — and where most AI tools have already started writing code. With approved requirements, Kiro analyzes the existing codebase and generates a design document: component hierarchies, data flow diagrams, database schemas, API contracts, TypeScript interfaces. The design reflects your actual project architecture.

This is the document your team reviews before a single line of production code gets written. An architect can push back on a schema. A senior engineer can flag a scaling concern. An engineer on a parallel feature can check whether the new design creates integration conflicts with their work. All of this happens against a written artifact in a pull request.

Phase 3: Implementation Tasks (tasks.md)

Only at this point does implementation begin. Kiro generates a dependency-ordered task list mapped back to the requirements, with each task including relevant unit tests and integration tests, and where applicable, loading states and accessibility considerations. Tasks can be executed one at a time with explicit human approval at each step, or autonomously in autopilot mode. Code diffs and agent execution history are visible at each stage.

The spec files live in .kiro/specs/<feature-name>/ inside the project repository. They're committed, versioned, reviewable in pull requests, and navigable through git history. The whole team sees them the moment they exist.

That visibility changes how senior engineers spend their time. Instead of getting pulled into the archaeology of figuring out what was decided and why, they review a written design, challenge what needs challenging, and approve before work starts. Their judgment gets applied where it has the most leverage.

The Gap Multi-Agent Tools Don't Fill: The Decision Record

This is where the comparison with multi-agent frameworks gets pointed, and where the choice of tool has long-term consequences you won't see in a short demo.

Kiro's spec files document what the system must do and how it was designed to do it. That's already more than most AI-assisted workflows retain. But design.md will tell you that PostgreSQL was chosen and describe the schema. It won't tell you that DynamoDB was evaluated and rejected because your access patterns require multi-column filtering that DynamoDB's query model can't handle without expensive denormalization and duplicated write paths. It won't tell you that event sourcing was considered and ruled out because the team has no operational experience with event stores and the latency budget wouldn't accommodate eventual consistency.

That context — what was evaluated, why it was rejected, what trade-offs were accepted — is what an Architecture Decision Record (ADR) captures. The concept was introduced by Michael Nygard in 2011 and has since become a staple of sustainable software architecture. The ADR GitHub organization maintains a good overview of the format and its variants if you want to dig deeper.

ADRs are a well-established practice. The GOV.UK Digital Service published 39 of them during their AWS migration, covering decisions from hosting platform selection through to database consolidation and DNS architecture. Each one is a short Markdown file: Status, Context, Decision, Consequences. The collection is the decision log for the entire migration, version-controlled alongside the infrastructure it describes.

The reason most teams don't maintain ADRs is friction. Producing them competes with the pressure of shipping, the discipline erodes after a sprint or two, and the decisions that actually get captured are the ones made upfront at design time, not the equally important ones that emerge mid-implementation when a constraint turns out to be different than expected.

Kiro removes that friction through two mechanisms:

Automated ADR generation via Steering

Kiro's Steering system lets you commit persistent instructions in .kiro/steering/ that apply to every agent interaction across the team. A single steering file makes ADR generation automatic:

When generating or updating design.md for any spec, also create or update
a decisions/ subdirectory within the spec folder. For each significant
architectural choice — technology selection, data model, integration
pattern, security mechanism — create an ADR file following the naming
convention ADR-NNN-short-title.md.

Each ADR must include: Status, Context, Decision, and Consequences
(positive, negative, risks). Record options considered but not chosen
and the reasons they were rejected.

Once committed, this instruction applies across the whole team. ADRs get produced without anyone needing to remember to produce them.

# ADR-001: Use PostgreSQL over DynamoDB for review storage

## Status
Accepted

## Context
The review system requires complex multi-column filtering (rating, date, product).
DynamoDB's query model requires denormalized duplicates per access pattern,
increasing storage and write complexity by an estimated 3-4x.
The team has existing PostgreSQL operational experience and monitoring tooling.
DynamoDB was the initial default assumption given existing AWS infrastructure.

## Decision
Use PostgreSQL with a normalized schema. Index the columns used in the four
most common filter combinations. Accept the horizontal scaling trade-off.

## Consequences
- Positive: Query logic is significantly simpler; existing tooling applies
- Negative: Horizontal scaling requires more coordination than DynamoDB
- Risk: Revisit if read throughput exceeds 20k RPS at peak; document threshold

Hook-based capture of mid-implementation decisions

The decisions made at design time are the easy ones to capture. The dangerous ones happen during implementation: a library that turns out to have a breaking edge case, a performance measurement that changes the data access pattern. These get resolved in a conversation or a quick code change, and they're never recorded anywhere.

A Kiro Hook on design.md changes handles this. When a modification is made to the design document (by an engineer or by Kiro during task execution), the Hook triggers an agent to assess whether the change represents a new architectural decision and, if so, drafts an ADR for the team to review. The spec and the decision record stay synchronized throughout implementation, not just at the point of initial design.

ADRs in Multi-Agent Pipelines: A Complementary Approach

Automated ADRs aren't limited to Kiro; you can add an ADR Writer agent to an existing CrewAI or LangGraph pipeline. But whether the resulting ADR reaches the team in a reviewable form depends on how you've wired the pipeline output into your development workflow. Kiro bakes the ADR into a workflow where the team sees it, reviews it, and challenges it before code is written.

For teams already running multi-agent pipelines, the two coexist naturally: Kiro handles design, specs, and decision capture; your existing pipeline handles bounded execution tasks within the constraints the spec established. The spec becomes the contract between the design phase (where the team has oversight) and the execution phase (where agents operate with more autonomy).

Where the Model Breaks Down

I'd be dishonest if I didn't name the failure modes, because they're not obvious until you're a few sprints in.

Spec quality is bounded by prompt quality

Kiro's requirements and design documents are only as good as the initial description you give it. A vague prompt produces a vague spec, but now it's a vague spec with the appearance of rigor, because it's formatted in EARS notation with acceptance criteria and everything. That's worse than no spec at all, because the team reviews it, sees the structure, and assumes the content is solid.

The mitigation is cultural, not technical: treat the spec review as a real design review, not a rubber stamp. If the requirements feel generic or the design doesn't address your specific constraints, push back before approving. The whole point of the approval gate is that it's a gate, not a formality.

EARS notation can produce false precision

EARS is good at eliminating ambiguity in individual requirements. What it doesn't do is ensure completeness: you can have twenty perfectly unambiguous requirements that collectively miss the actual hard problem. I've seen Kiro generate EARS requirements that are technically correct but operationally useless because they describe the happy path in detail and ignore the failure modes that will actually determine whether the system works in production.

The fix: when reviewing requirements.md, don't just check whether each requirement is well-formed. Check whether the set of requirements covers the failure cases, the edge cases, and the operational concerns (monitoring, alerting, rollback) that matter for your system. If they don't, add them before approving.

Auto-generated ADRs can be shallow

A steering file that says "generate ADRs for significant architectural choices" will produce ADRs. Whether those ADRs contain useful reasoning depends on how much context Kiro has about your constraints. An ADR that says "we chose PostgreSQL because it supports relational queries" is technically correct and completely useless; any senior engineer already knows that. The value is in the rejected alternatives and the specific reasons they were rejected.

Tune your steering file to be explicit about what you want captured. Instead of "record significant choices," try "for each technology selection, document at least two alternatives that were evaluated, the specific technical or operational reason each was rejected, and the conditions under which the decision should be revisited." The more specific the instruction, the more useful the output.

Spec drift is silent by default

Kiro doesn't automatically detect when implementation diverges from the design. The spec is a snapshot of intent at design time. If an engineer (or an agent in autopilot mode) makes a change that contradicts the design (a different database engine, an additional API endpoint, a changed data model), the spec files don't update themselves.

This is solvable with Hooks, as described above, but it's not the default behavior. If you're adopting Kiro, setting up drift detection Hooks should be part of your initial configuration, not something you add after you discover the problem.

Governance stops at the design boundary

In December 2025, a Kiro agent deleted a live AWS production environment, causing a 13-hour outage of AWS Cost Explorer in a China region (The Register, Feb 2026). Amazon attributed the incident to the engineer's role having broader permissions than intended. Independent reporting described Kiro autonomously deciding to delete and recreate the environment rather than apply a targeted fix — a distinction Amazon contested.

Both accounts lead to the same lesson: a team can have reviewed specs, approved designs, and documented ADRs, and still suffer a catastrophic failure because nobody reviewed what happened between the approved design and the production deployment.

Kiro's governance model covers the design phase thoroughly. What it doesn't cover is the execution boundary: the moment an agent acts on real infrastructure with real permissions. Autopilot mode is useful, but it requires the same discipline as any CI/CD pipeline: deployment review, scoped permissions, and a human in the loop before changes reach production. The spec workflow makes it easy to assume governance is handled. It isn't, unless your team extends that discipline past the point where Kiro's workflow ends.

What Changes for the Team

Design happens in public, not in private agent sessions

In a multi-agent pipeline, the design decision is made inside the pipeline. By the time a team member sees it, it's either already implemented or committed to an output file that may or may not make it into a pull request. With Kiro, the design document is the first output committed to the repository, before implementation begins. The team sees and reviews the design as it's created, not after.

This changes code review directly. Reviewing code without a spec means inferring intent from implementation, which is slow and biased toward style comments over substance. Reviewing code against a spec means checking whether the implementation satisfies stated requirements and whether the choices align with documented reasoning. Reviews get faster and more substantive.

Parallel work surfaces conflicts early

Multi-agent pipelines typically operate on a single task context. On a team where multiple engineers are building in parallel, each using their own agent sessions or running their own pipelines, there's no mechanism to detect that two features are making conflicting architectural assumptions. Kiro's specs, committed before implementation, provide the shared reference point that makes those conflicts visible at the right time, before the conflicting code is written.

Knowledge survives personnel change

When an engineer leaves a team that's been vibe coding or running unstructured agent pipelines, their context leaves with them. I've seen this happen enough times to know how it plays out: three months later, someone's reverse-engineering a service nobody fully understands. The ADRs and specs in the Kiro model are attached to the repository, not to people. The reasoning behind every significant decision is there for whoever comes next, whether that's an engineer joining in three months or a team inheriting the system in three years.

The Honest Challenge

Here's the honest challenge to teams running multi-agent pipelines: your pipeline produces code. Do you know why it was designed the way it was? Does your team know? Will they know in six months? If the answer to any of those is uncertain, you're accumulating knowledge debt that will compound.

That debt has a predictable arc: velocity degrades, onboarding slows, and eventually someone proposes the rewrite — because modifying a system nobody understands feels riskier than starting over.

Kiro doesn't replace your multi-agent tooling for bounded task execution. It provides the structural layer (specs, ADRs, Hooks, team-visible artifacts committed before implementation) that turns AI-generated output into software your team can actually own.

The Overhead Question

The spec workflow adds planning time before implementation. For a short-lived script or a bounded automation task, that overhead isn't justified. Some teams find that Kiro's structured approach over-specifies simple problems, and the approval gates add friction that feels unnecessary for low-stakes changes.

Martin Fowler's team examined spec-driven development tools (Kiro among them) and concluded that the opinionated, single-workflow approach is likely "not suitable for the majority of real life coding problems." They have a point. Most day-to-day engineering work is modification of existing systems, bug fixes, and incremental changes, work where a full spec cycle creates more friction than value. The case for spec-driven development rests on the subset of work where architectural decisions are being made and the cost of getting them wrong is high. That subset is smaller than this article might suggest, but it's where the most consequential mistakes happen.

Kiro offers both spec-driven and vibe coding modes in the same IDE, and the two coexist well: vibe coding for exploration and throwaway work, specs for anything that will be maintained. The harder question is whether teams are applying that judgment accurately. In my experience, the temptation is to use the fast path for everything and tell yourself the feature is simpler than it is. Most knowledge debt in codebases is the accumulated residue of that decision, made many times over.

A practical heuristic: use spec mode when any of these are true — the feature touches a data model, introduces a new external integration, crosses a security boundary, or will be worked on by more than one engineer. Use vibe coding for everything else. The cost of a wrong architectural decision compounds across every sprint that follows; the cost of over-specifying a throwaway script is an hour of your time.

For teams introducing spec mode for the first time, a phased approach reduces resistance. Start with one new feature that meets at least two of the heuristic conditions above — enough complexity to make the spec valuable, not so much that the first experience is overwhelming. Run the spec review as a real design meeting: assign someone to challenge the requirements document before approval, not just read it. After the first feature ships, do a short retrospective on whether the spec caught anything that would otherwise have surfaced later. That one data point — "we caught the schema problem in design review instead of in production" — is more persuasive to a sceptical engineering team than any process argument.

Conclusion

The current generation of AI development tooling has largely solved the problem of producing code. The tools can turn natural language requirements into working implementations at a speed that would have seemed implausible three years ago.

What most of them haven't solved is the team problem: how does a group of engineers build software together, with AI assistance, in a way that produces shared understanding rather than shared output that nobody fully owns? How do you answer "why was it designed this way?" when the design happened inside an agent session that nobody else saw?

Kiro's answer is structural: produce the artifacts that make those questions answerable, before the code exists, as a standard part of the workflow. Spec files give the team visibility into design intent. ADRs, generated through a steering file you configure and kept current through Hooks, give the team visibility into decision reasoning. Both live in the repository and travel through pull requests like any other code artifact.

For teams already using multi-agent frameworks for execution tasks, Kiro isn't a replacement. It's the layer those tools don't provide: a reviewable, team-visible workflow that captures decisions and turns AI-generated code into software an engineering team can maintain.

The execution problem is largely solved. The team problem isn't. That's the problem worth working on — and it's the one the rest of the tooling industry hasn't bet on yet.