AI Engineering Infrastructure

Production infrastructure for agentic development

AI coding agents are powerful. But running them without session tracking, conformance enforcement, and observability is like deploying microservices without monitoring — it works until it doesn't, and you can't debug why.

We build the infrastructure layer that makes agentic development reliable at scale: session telemetry, machine-readable specifications, automated conformance checks, and real-time observability. Your agents, governed by your standards, visible in your dashboards.

The problem

Agents are easy to start. Hard to operate.

Getting an AI agent to write code takes minutes. Getting it to write code that consistently matches your architecture, follows your conventions, and doesn't introduce drift across a multi-repo ecosystem — that's an infrastructure problem. And most teams are solving it with willpower instead of systems.

Agents without infrastructure

AI coding agents can generate code. But without session tracking, conformance enforcement, and observability, you have no idea whether they’re producing code that matches your architecture — or slowly introducing drift across your codebase. Most teams discover the problems in production.

No visibility into agent behaviour

Your agents are running sessions, reading files, making decisions. But you can’t see what they did, why they did it, or whether it worked. Without structured session data and telemetry, debugging an agent’s output means reading diffs and guessing at intent.

Standards that exist on paper only

You have coding standards, architecture decisions, and review guidelines. But AI agents can’t read your team’s wiki. Without machine-readable specifications and automated conformance checking, every agent session starts from zero context about how your team builds software.

Scaling agents without losing control

Running one AI agent is manageable. Running a fleet of them across multiple repositories — with budget controls, health monitoring, and coordinated workflows — is an infrastructure problem. Most teams hit this wall when they try to move past the single-developer, single-repo stage.

What we build

The infrastructure layer between your agents and your codebase

We've built and operate this infrastructure ourselves — across 15 repositories with automated agent fleets. These aren't theoretical patterns. They're systems we run in production, refined through thousands of agent sessions.

Context architecture & conformance

Machine-readable standards that agents check before they write code

Most teams give AI agents a README and hope for the best. We build a two-layer system: structured context files that tell agents how your team builds software, and executable specifications that verify agents actually followed the rules.

Every agent session starts with a conformance check. Before writing a single line of code, the agent knows whether the repository is READY or has VIOLATIONS — and gets machine-readable remediation steps it can act on directly.

CLAUDE.md hierarchy: Root and directory-level context files encoding your architecture, patterns, constraints, and conventions — loaded automatically with every session
Executable specifications: Typed spec definitions with automated checks — compiler strictness, import hygiene, test structure, dependency pinning — not wiki pages, but code that runs
Session-start hooks: Pre-flight conformance checks that report READY/VIOLATIONS before the agent begins work, with structured remediation guidance the agent can follow
Cross-repo conformance matrix: A single view of spec compliance across your entire ecosystem — which repos conform, which are drifting, and what needs attention

What this looks like in your codebase:

CLAUDE.md

1# CLAUDE.md — Acme Portal
2 
3## Project Overview
4Next.js 14 App Router application with TypeScript, Prisma ORM,
5and Tailwind CSS. Deployed on Vercel with PostgreSQL on Supabase.
6 
7## Architecture
8- - /src/app — App Router pages and API routes
9- - /src/components — React components (co-located tests)
10- - /src/lib — Shared utilities, database client, auth helpers
11- - /prisma — Database schema and migrations
12 
13## Coding Standards
14- - All components are functional with TypeScript props interfaces
15- - Use `cn()` helper from /src/lib/utils for conditional classNames
16- - API routes return NextResponse with consistent error format
17- - Database queries go through /src/lib/db.ts, never direct Prisma
18- - Tests use Vitest + React Testing Library, co-located as .test.tsx
19 
20## Key Patterns
21- - Auth: NextAuth.js v5 with JWT strategy, session in middleware
22- - State: React Query for server state, Zustand for client state
23- - Forms: React Hook Form + Zod validation
24- - Styling: Tailwind + shadcn/ui components (DO NOT use raw HTML
25  elements when a shadcn component exists)
26 
27## Common Commands
28- - `pnpm dev` — Start dev server
29- - `pnpm test` — Run Vitest
30- - `pnpm db:push` — Push Prisma schema changes
31- - `pnpm db:seed` — Seed development database
32 
33## Important Context
34- - Multi-tenant: all queries MUST filter by organisationId
35- - Australian timezone handling: use date-fns-tz, 'Australia/Sydney'
36- - All user-facing text must support i18n (use t() from /src/lib/i18n)
37- - Never store PII in logs — use sanitiseLog() from /src/lib/logging

We build this for your codebase during the engagementDownload Template

Session tracking & fleet orchestration

See what your agents are doing, and manage them at scale

Running one AI agent is manageable. Running a fleet of them across multiple repositories — knowing which sessions succeeded, which failed, which are idle, and how much they cost — requires real infrastructure. We build the data pipeline and orchestration layer that makes this visible and controllable.

Session data pipeline

Every agent session is captured: files read, tools used, tokens consumed, outcomes. Ingested into PostgreSQL via a filesystem watcher with content-hash deduplication and incremental sync.

Terminal state parsing

Real-time parsing of agent terminal output into structured state — idle, generating, waiting for approval, error — using a state machine tested against hundreds of live-captured fixtures.

Fleet lifecycle management

Start, stop, monitor, and recycle agent sessions across repositories. Idle detection with two-strike thresholds, context-percent recycling, and thundering-herd jitter for safe restarts.

Budget and provider health

Track AI API costs against budgets, monitor provider rate limits, and surface health status per provider — so you know before your agents hit a wall.

Query and debug interface

A CLI for querying session history, searching across sessions, and analysing agent behaviour — so debugging an agent’s output starts with data, not guesswork.

Periodic self-validation

The session pipeline validates itself: a full sync runs periodically and treats any newly-inserted records as evidence the live pipeline missed events. Missed events trigger alerts.

Built on real infrastructure we operate across our own 15-repository ecosystem — session-ledger alone processes thousands of agent sessions with full telemetry.

Observability & efficiency measurement

Dashboards, alerts, and analysis that tell you whether your agents are actually working

Session tracking tells you what happened. Observability tells you whether it's working. We build a full telemetry stack — dashboards, alerts, and agent efficiency analysis — so you can quantify whether your AI investment is paying off and catch problems before they compound.

Dashboard-as-code: Grafana dashboards defined in TypeScript and compiled to JSON — covering agent health, session quality, cost tracking, conformance drift, and infrastructure status
Agent efficiency scoring: Classify agent tool calls into exploration vs production phases, score navigation against your dependency graph, and measure how efficiently agents traverse your codebase
Conformance drift detection: Track spec compliance over time across your ecosystem — which repos are converging toward standards, which are drifting, and alert on regressions
Self-monitoring observability: The observability stack monitors itself: detecting stale queries, orphaned log streams, alerts that have never fired, and computing signal-to-noise ratios across your telemetry

The result: you can answer questions like “how much did our agents cost this week?”, “which repos are drifting from our standards?”, and “are agents getting more efficient at navigating our codebase over time?”

This is what separates teams running agents from teams operating them.

How it works

Four phases. Infrastructure you keep. No ongoing dependency.

We don't run workshops and leave. We build real infrastructure in your environment — session tracking, conformance checks, dashboards, specifications — tested against your actual codebase and running before we hand it over.

Assess: Audit your codebase and agent readiness

Week 1

Map your repository structure, architecture patterns, and existing CI/CD pipeline to identify where agentic development has the highest leverage
Evaluate your team’s current AI usage — tooling, workflows, prompt patterns, and where agents are producing inconsistent output
Baseline your codebase against machine-readable specifications: compiler strictness, import hygiene, test structure, and deployment conventions
Identify cross-repository dependencies and coordination risks that will matter when agents work across your ecosystem
Deliver a concrete assessment with prioritised recommendations your team can act on — with or without us

Instrument: Build the observability and conformance layer

Weeks 2–3

Deploy session tracking infrastructure that captures what your agents do — files read, tools used, tokens consumed, and session outcomes — ingested into a queryable database
Write machine-readable specifications for your architecture standards and wire them into automated conformance checks that run on every agent session
Build Grafana dashboards and alerts for agent health, session quality, cost tracking, and conformance drift across your repositories
Configure structured context files (CLAUDE.md hierarchy) so agents understand your architecture, patterns, and constraints before they write a line of code
Set up budget controls and provider health monitoring for your AI API usage

Embed: Run agents on real work with your team

Weeks 4–6

Pair with your developers on real tickets — running agents against your actual backlog with the conformance and observability layer in place
Train your team to read session telemetry, diagnose agent behaviour, and iterate on context configuration based on real data
Refine specifications based on what the agents actually produce — tightening constraints where output drifts, relaxing where they’re unnecessary
Build agent workflow patterns specific to your team: feature development, bug triage, migrations, and cross-repo coordination
Scale from single-agent sessions to multi-agent workflows with proper orchestration and monitoring

Measure: Quantify and hand over

Week 7–8

Before/after comparison using real data from the session tracking pipeline: cycle time, conformance rates, agent efficiency scores, and cost per session
Agent navigation efficiency analysis — how well your agents traverse your codebase, measured against the dependency graph
Executive summary with concrete metrics your leadership team can use to justify continued investment
Hand over all infrastructure: dashboards, specifications, session tracking, conformance checks, and context configuration — running in your environment
Sustainability guide: how to maintain specifications, onboard new repositories, and evolve the system as your codebase changes

Illustrative dashboard — your engagement tracks metrics specific to your team:

Dashboard

AI Engineering Impact — 6 Week Engagement

Stacktrace AI Engineering Enablement

PR Cycle Time

6.2 hrs1.4 hrs

-77%

Deploy Frequency

2.1/week8.4/week

+300%

AI Tool Adoption

23%94%

+71pts

Test Coverage

34%71%

+37pts

PR Cycle Time (hours) — 6 Weeks

AI Tool Usage by Team Member

Sarah

96%

James

91%

Priya

98%

Tom

84%

Wei

95%

Alex

88%

Jordan

92%

Sam

86%

Week 1 Week 6

In practice

What governed agent development looks like

Not slides. Not architecture diagrams. These are the artifacts that come out of a governed agent workflow — structured PRs from conformance-checked sessions, with full test coverage and traceable agent decisions.

Fleet monitor & conformance matrix

Real-time agent status and cross-repo spec compliance at a glance

The babysitter process monitors every agent session in your fleet via terminal state parsing — detecting active work, idle agents, errors, and high-context sessions that need recycling.

The conformance matrix gives you a single view of spec compliance across your entire ecosystem. Which repos are clean, which are drifting, and exactly which specs need attention.

Terminal state parsing:idle, generating, error, approval — from raw tmux capture

Two-strike idle detection:first idle records a strike, second assigns new work

Context recycling:sessions at 90%+ context get gracefully restarted

Ecosystem conformance:all specs across all repos in one matrix

Fleet Statusbabysitter \u00b7 22:08 cycle

Found 6 agent sessions
● agent-session-ledger active:reading ctx=45%
● agent-typescript-cli generating ctx=67%
● agent-git tool-approval ctx=23%
● agent-observability strike 1 (idle) ctx=72%
● agent-graph-analysis assigning work ctx=31%
● agent-claude-code recycling ctx=92%

Ecosystem Conformance Matrix

session-ledgerREADY(46 specs)
standards-body-gitREADY(43 specs)
standards-body-workflowREADY(38 specs)
graph-analysisVIOLATIONSTS-CLI-SPEC-015
ai-gatewayREADY(28 specs)
host-servicesREADY(41 specs)
MATRIX: 5/6 READY, 1 VIOLATIONS [8s]

A governed agent session

Conformance check → claim work → implement → quality gates → advance phase

This is what an autonomous agent session looks like with the infrastructure layer in place. The agent starts with conformance checks, claims work from a queue, implements within the guardrails set by your specifications, passes quality gates, and advances the work item through a tracked phase pipeline.

The babysitter monitors each session via terminal state parsing — detecting idle agents, recycling high-context sessions, and assigning new work automatically. Every action is captured by the session tracking pipeline.

The governed agent lifecycle

1.Session-start hooks run conformance checks (READY / VIOLATIONS)

2.Agent queries bead queue for highest-priority work item

3.Agent claims item — phase transitions to executing

4.Implementation follows CLAUDE.md context and spec constraints

5.Quality gates run: typecheck, build, lint, format, test

6.Structured commit with Issue-ID trailer, phase advances

agent-session-ledger — ~/code/session-ledger

0:00 / 0:30

Deliverables

Infrastructure your team keeps and operates

Every engagement produces running systems in your environment. Session tracking pipelines, conformance checks, dashboards, and specifications — not documents, but infrastructure that operates continuously after we leave.

Session tracking pipeline

A data pipeline that captures everything your AI agents do — files read, tools used, tokens consumed, session outcomes — ingested into a queryable database with deduplication and incremental sync.

Includes

Filesystem watcher with reliable event ingestion
PostgreSQL schema for session data with content-hash dedup
Query CLI for session analysis and debugging
Periodic self-validation that detects missed events

Machine-readable specifications

Your architecture standards, coding conventions, and constraints encoded as executable specs that agents check automatically — not documents that gather dust.

Includes

Typed specification definitions with automated conformance checks
Session-start hooks that report READY/VIOLATIONS before agents begin work
Cross-repository conformance matrix
Remediation guidance agents can act on directly

Observability stack

Grafana dashboards, alerts, and telemetry covering agent health, session quality, cost tracking, and conformance drift — including self-monitoring that detects when observability itself goes stale.

Includes

Dashboard-as-code definitions compiled to Grafana JSON
Alert rules for agent health, latency, cost, and conformance
OpenTelemetry instrumentation across the agent pipeline
Signal-noise analysis that identifies orphaned queries and silent alerts

Context architecture

Structured CLAUDE.md configuration hierarchy that gives agents deep understanding of your codebase — architecture, patterns, constraints, and team conventions.

Includes

Root and directory-level context files for each major module
Constraint rules (security boundaries, data handling, multi-tenancy)
Pattern documentation pulled from your actual codebase
Hooks and automation for context maintenance

Agent efficiency measurement

Tooling that analyses how well agents navigate your codebase — classifying tool calls into exploration vs production phases and scoring navigation against your dependency graph.

Includes

Codebase knowledge graphs built from your TypeScript imports
Session pathfinding analysis with multi-dimensional quality scores
Cross-repo dependency tracking and impact analysis
Before/after efficiency comparison across the engagement

Reproducible environments

Nix-based development environments that ensure every agent and every developer runs against the same toolchain, dependencies, and configuration — eliminating works-on-my-machine failures.

Includes

Devenv configuration with shared base modules
Deterministic flake management CLI
Pre-commit hooks wired to conformance checks
Environment bootstrap that works for new repos in minutes

Work with Stacktrace

Ready to run your agents on real infrastructure?

We build session tracking, conformance enforcement, and observability for your agentic development workflow. Running in your environment within weeks.

Based in Brisbane. Building agentic infrastructure for engineering teams across Australia and New Zealand.

Our offices

Follow us

Production infrastructure for agentic development

Agents are easy to start. Hard to operate.

Agents without infrastructure

No visibility into agent behaviour

Standards that exist on paper only

Scaling agents without losing control

The infrastructure layer between your agents and your codebase

Context architecture & conformance

Session tracking & fleet orchestration

Session data pipeline

Terminal state parsing

Fleet lifecycle management

Budget and provider health

Query and debug interface

Periodic self-validation

Observability & efficiency measurement

Four phases. Infrastructure you keep. No ongoing dependency.

Assess: Audit your codebase and agent readiness

Instrument: Build the observability and conformance layer

Embed: Run agents on real work with your team

Measure: Quantify and hand over

AI Engineering Impact — 6 Week Engagement

What governed agent development looks like

Fleet monitor & conformance matrix

A governed agent session

Infrastructure your team keeps and operates

Session tracking pipeline

Machine-readable specifications

Observability stack

Context architecture

Agent efficiency measurement

Reproducible environments

Ready to run your agents on real infrastructure?