cubedesk
The Composer's Desk
Where Composed Intelligence happens. Define your agents. Observe their reasoning. Measure their coordination. Compose nine communication modalities into a coherent system. One application. Your laptop. No cloud required.
In development — showing the work, not offering a download.
The Observation Gap
Today, agent coordination is measured by proxy: did the task complete? How many tokens did it cost? These are outcome metrics. They tell you what happened but not why. They cannot distinguish between an agent that reasoned well and got lucky and one that reasoned poorly and got lucky.
The gap is observability into reasoning itself. LLMs produce thinking — extended chains of reasoning that precede every action. Today, this material is lost. Session logs strip it. Terminals render it ephemerally. No system captures, indexes, or analyzes agent reasoning at scale.
CUBEdesk closes that gap.
What It Does
CUBEdesk is a desktop application. Tauri (Rust) + Svelte + SQLite. It runs on your laptop. It does three things.
Nine modalities, one desk
CUBEdesk doesn't just compose agents — it composes communication modalities. Each of the nine cubes is a surface through which composed intelligences interact: mudCUBE for immersive text, matrixCUBE for real-time messaging, loreCUBE for knowledge, biblioCUBE for research. Define agents along five axes: role, permissions, model, scope, budget. The desk keeps them all flowing.
Four-layer capture architecture
At Layer 3, CUBEdesk holds the master file descriptor for every agent's PTY. Every thinking block. Every tool call. Every coordination trace. Before the terminal renders it. The Six Rs sedimentation pipeline transforms raw observation into durable knowledge.
Psi dashboards and coordination quality
Psi dashboards. Gate progressions. Coordination quality over time. The fleet that produced these measurements is the same fleet that produces the work. Claims about performance are empirical, not aspirational.
The Six Rs
Sedimentation. Raw observations deposit like geological strata. Each stage compresses, cross-references, and challenges until bedrock knowledge remains. The cycle is continuous — Rethink feeds back into the Gap Engine. New gaps re-enter at Record.
The Context Window Advantage
Every intelligence — human or LLM — has a cognitive budget. For an LLM, it's the context window. For a human, it's working memory and attention span. The question isn't whether this budget exists, but how it's spent.
A single agent with a 1M-token window burns context on everything: domain reasoning, coordination overhead, repeated context-loading, and recovery from confusion. A composed fleet of fourteen agents, each with a 1M-token window, doesn't just have 14x the budget — it has 14x the budget with near-zero cross-loading, because each intelligence holds only its domain.
The gate convention tracks this empirically. Every gate passage records context utilization at that moment. Over a shift, you can watch whether composition decisions are preserving or wasting the cognitive budget.
The Four Layers
At Layer 3, CUBEdesk sees everything the agent produces before the terminal renders it. Thinking blocks that session logs strip. Tool calls that scroll past. Coordination traces that no existing system preserves. This is the raw material of intelligence observation.
How It Was Built
CUBEdesk was itself built by Composed Intelligence — proof that the paradigm works.
We deposited traces. The fleet built software. The commit messages cite our traces. We're still investigating what this means.
What surprised us: the test agent didn't need to be told when to start testing. It waited until builders deposited testable interfaces, then began. Dependency order emerged from the pheromone structure — the same way termite builders respond to cement deposits without knowing the blueprint.
What we didn't expect: three complete build cycles in a single overnight session. The trace → behavior → product → new trace loop closed and re-entered autonomously. Each cycle produced working code that the next cycle extended.
What failed: early attempts without manufactured traces produced churn — agents duplicating work, contradicting each other, rebuilding what existed. The architecture decisions and duty boundaries we deposited before the build weren't suggestions. They were the pheromone scaffold that made coordination possible.
The full methodology is documented in the ANTS 2026 paper.
The Stack
Apple Silicon. Unified memory means SQLite, the LLM sedimentation pipeline, and the visualization layer share memory without serialization. The Secure Enclave holds cryptographic keys. One chip, all paths.
Security Model
The one-way valve.
Secrets flow out to cloud proving grounds. Results flow back as bundles. The cloud never reaches in. If an agent goes rogue, it's in a container that gets destroyed — it never had access to your laptop.
What Becomes Measurable
With CUBEdesk capturing full reasoning streams, Composed Intelligence claims become empirically testable:
Composed agents coordinate more efficiently than general agents.
Measure: Compare Psi trajectories between fleets with tight compositions vs. fleets with generic specifications.
Composition reduces redundant work.
Measure: Analyze reasoning traces across fleet members for semantic overlap. Quantify redundant information directly from captured thinking blocks.
The operator's composition quality determines fleet performance.
Measure: Track output quality as a function of specification changes. A/B test composition variants across shifts.
Context budget management improves with composition.
Measure: Analyze gate progressions. Composed agents should show more predictable context burn rates.
Fleet capability exceeds the sum of individual agent capabilities.
Measure: Compare fleet output on complex tasks against single-agent output on the same tasks.
Composition efficiency is measurable per intelligence.
Measure: Context utilization ratio — what fraction of tokens are spent on domain reasoning vs. coordination overhead vs. repeated context-loading? Given two fleet compositions doing the same work, which burns less total context?
The fleet built this. The desk measures how.