orchid

High velocity agent orchestration

by denoland

Star on GitHub Fork

5 stars 1 forks 4 contributorsActive · 17h agoSince 2026 v0.1.1MIT

Meet the team

See all 4 on GitHub →

littledivy263 contributions

divybot29 contributions

bartlomieju12 contributions

Languages

View on GitHub →

Go100%

Commit activity

Last 12 weeks · 305 commits

Full graph →

Community health

2 of 6 standards met

Community profile →

✓README✓License○Contributing○Code of Conduct○Issue Template○PR Template

Recent PRs & issues

Active · 1 in progress · Last activity 17h ago

See all on GitHub →

feat: add detached local worker spawn commandOpenPR

Summary add so operators can spawn local tmux workers without auto-attaching add as a detached alias for programmatic orchestration validate JSON responses and document the notify-hook-driven detached flow Test Plan exercised and against a local mock server

littledivy · 1w ago

Anthropic 529 overloaded errors causes session to be orphaned for daysOpenIssue

I had a worker session claude-425 that hit repeated 529 Overloaded errors and then went idle at a shell prompt. tmux session claude-425 was still live for days, but the worker never got re-poked by Orchid. I had to manually tmux send-keys -t claude-425 ... to get it to resume. orchid.service was up and busy (clawpatrol run claude --resume + codex loops), so the orchestrator itself was healthy. Expected: Orchid should detect stalled sessions (e.g. statusline idle + no status updates for N minutes while tmux session still exists) and either re-ping the pane or mark the lane as dead and respawn a new worker tied to the same PR. This affected at least claude-425; there may be other similar sessions.

littledivy · 3w ago

Recent fixes

View closed PRs →

Stop wasting sessions: pre-spawn triage + poke-on-noise + flake re-diagnosisClosedIssue

From reading ~240 real session transcripts on the live swarm. The agents are smart and well-behaved — they close redundant PRs, answer questions, distinguish flakes from real failures, file follow-ups. The waste is process, not model: we spend full (opus) sessions on work that isn't there, and we wake sessions for noise. On a subscription the binding cost is weekly quota, so every wasted session/turn directly shrinks how much real work fits. 1. No pre-spawn triage → full sessions on non-tasks Many "issues" aren't implementable tasks, but each gets a full coding session to discover that: #280 (closed)* — agent started the patch, then "Confirmed: already fixed on current deno 2.7.14… the dep is now on 0.7.0" and closed its own PR. Correct call — burned a session to find the bug was stale/already-fixed. #290 (closed) — was a question, not a task: agent researched prior art and posted a reply. Never any code to write. #208 (closed) — shipped a clean fix; the red CI was "the unrelated WPT flake already analyzed." Declined by a maintainer decision, not a failure. The "closed-not-merged" rate badly undercounts swarm quality — a big share is the agent correctly declining to ship. But it costs a full opus session each. Direction: a cheap pre-spawn classifier (haiku/sonnet, or a lightweight pass) that flags an issue as vs / / before committing a full session — answer/label/skip the non-tasks instead of spawning opus on them. 2. Pokes fire on routine noise → wake → re-read → "nothing to do" A merged session's tail (#242) is mostly idle narration triggered by pokes: "Divy merged main into the branch — routine sync, nothing for me to do." "Windows flakes are the only remaining red — waiting on merge." "Waiting for the next signal." Each poke is a full-context turn (the dominant token sink is re-reading context). We poke for routine main-merge syncs and known-flaky CI re-runs — the agent wakes, re-reads everything, concludes there's nothing to do, sleeps. Direction: filter pokes — don't wake the session for a routine main-merge into the branch, or for a CI transition on a known-flaky job. Only poke on real new signal (new review comment, a non-flaky check going red, a push). 3. Known-flaky CI is re-diagnosed every session #208 and #242 both spend end-of-session re-deriving the same flakes (/ Node crashes, WPT fetch flakiness). Every session re-learns which jobs are flaky. Direction: a per-repo flake allowlist the swarm maintains (this is a natural fit for the memory store) — known-flaky jobs don't count as "needs attention" and don't trigger pokes. Why it matters These are a different lever than tokens-per-turn: don't start (or wake) a session that has no real work.* Combined they likely reclaim a meaningful slice of weekly quota — the runaway/non-task tail and poke-on-noise are pure overhead invisible in the merge metric. (Companion data: ~98% of turns run on opus, including triage/grep; 22% of sessions hit API/auth/capacity errors mid-run. See also #2 for the sshd/infra side.)

littledivy · 3d ago

sshd load: orchestrator's per-tick SSH load on workersClosedIssue

Problem orch drives every worker over plain SSH, at high frequency: Health probe — SSHes each VM every ~15s. Per-tick, per-session — every (~20s) the scheduler does , (), and on changes ( + ) — each a separate invocation, multiplied across all live sessions on a VM. With 38 worker slots across 3 hosts (and growing), a single busy VM can see dozens of SSH handshakes per tick. + pools connections, but: if the control socket drops (idle, network blip, expiry), every queued call re-handshakes at once → bursts past / , which then refuses connections → orch reads VMs as unhealthy → flapping. the central box () has open to the internet, so it also eats continuous scanner/brute-force traffic on top of orch's own load. So both directions are a concern: orch can self-inflict an sshd DoS on a contended worker, and the public box's sshd is externally exposed. Ideas to investigate Reuse, don't reconnect. Verify every / actually shares one per VM; raise ; pre-warm the socket on health-probe. Batch per-VM calls. Coalesce the per-session + into a single per VM per tick (one + batched ) instead of N invocations. Harden worker sshd. Bump / , and back off (don't stampede) when a probe fails. Shrink the public surface. Move worker SSH onto the tailnet only (the hosts already are); restrict the central box's to Tailscale / known IPs, add fail2ban or rate-limiting for any internet-facing sshd. Add a circuit breaker.** If a VM returns connection-refused, exponentially back off its SSH cadence instead of retrying every tick. Why it matters Under load this shows up as VMs flapping online/offline (the / health-flap churn) and wasted reconnect overhead — the bigger the swarm, the worse it scales.

littledivy · 3d ago

feat(herdr): complete tmux→herdr migration + spawn-failure cooldownMergedPR

Summary Finishes the tmux→herdr multiplexer migration (vultr + mac were on the herdr shim but it was broken; gcp stays on real tmux) and hardens the scheduler so one wedged VM can't starve the swarm. All deployed and verified live — all 7 VM-agents spawning. herdr tmux-shim fixes The shim was written against a flat JSON schema, but this herdr build nests CLI output under : resolve_label / resolve_pane / ls / new-session — read / . The flat paths returned empty, so false-negatived live sessions (→ reconcile tore them down) and couldn't resolve the workspace id. list-panes — (herdr rejects the positional form) + read . new-session pane race** — herdr spawns the root pane asynchronously; an immediate raced to "no root pane". Retry ~10×0.3s, fall back to the create output's . Scheduler: per-VM spawn-failure cooldown (, ) A VM that fails (3) spawns consecutively is benched for (5m); skips it so admission falls through to a healthy VM instead of re-picking the same wedged box every tick. Any successful spawn resets the tally, so an issue-specific failure on a healthy VM never cools it. Covered by . Ops notes () The localhost (root-run) herdr path needs orch to run with set (herdr socket at ) and the VM's to stay , since orch drives the localhost shim as root. Known follow-ups (not in this PR) gcp sshd caps concurrency on that host → under spawn storms; cooldown routes around it. herdr leaks a login shell per workspace (pty leak); normal teardown reaps it now that the shim works, but failed-spawn storms re-leak.

littledivy · 1w ago

Structured data for AI agents

Repository: denoland/orchid. Description: High velocity agent orchestration Stars: 5, Forks: 1. Primary language: Go. Languages: Go (100%). License: MIT. Latest release: v0.1.1 (1mo ago). Open PRs: 1, open issues: 1. Last activity: 17h ago. Community health: 37%. Top contributors: littledivy, divybot, bartlomieju.