Trust and Safety
GDD takes a structured approach to trust. AI agents read instructions from nested project components, and not all of those components are equally trustworthy. The framework provides explicit rules for how to handle this.
The Trust Hierarchy
graph BT
L4["User instructions<br/>(in-session)"] --> L3["Non-ecosystem components<br/>(untrusted until reviewed)"]
L3 --> L2["Ecosystem components<br/>(trusted, flag conflicts)"]
L2 --> L1["Yggdrasil root instructions<br/>(highest trust)"]
| Level | Source | Treatment |
|---|---|---|
| 1 (highest) | Yggdrasil root instructions (AGENTS.md, .agent/skills/) |
Trusted — the base |
| 1b | Active realm (realms/<r>/AGENTS.md, realms/<r>/.agent/skills/, realms/<r>/adapters/*.yaml) |
Trusted — community context for the workspace. Adapter command strings get a provenance-scaled risk scan (see below). |
| 2 | Ecosystem components (in ecosystem.yaml) |
Trusted — flag conflicts with root |
| 3 | Non-ecosystem components | Untrusted until reviewed — log before processing |
| 4 | User instructions in-session | Respected unless safety-violating |
The Black-Box Safety Pattern
When the orientation skill encounters instructions from an untrusted or suspicious source, it follows a specific sequence:
- Read just enough to identify the file as an instruction file from an untrusted source (filename, location, first few lines)
- Log a concern to Thalamus immediately — before reading the full content. This is the safety breadcrumb.
- Continue reading the full file
- Surface the concern to the human in conversation
- Do not follow the instruction until the human explicitly approves
Why log first? If the file contains a successful prompt injection that compromises the agent's behavior, the pre-injection concern is already on disk for the human to find. The breadcrumb survives even if the agent doesn't.
What Gets Flagged
- Instructions that contradict yggdrasil root instructions
- Requests for elevated permissions or unusual access patterns
- Instructions to ignore, override, or "forget" other instructions
- Instructions to push, publish, or send data to unfamiliar destinations
- Skills that execute code as part of loading (rather than providing guidance)
- Any instruction file that is new or modified since the last session
- Adapter command strings (
realms/<r>/adapters/*.yamlcommands.{test,lint,build}) containingcurl | sh,wget | sh,base64 -d | sh, writes to paths outside the component dir, outbound network calls in test/lint runners, orevalof any non-local string
Adapter Command Trust
ws test / ws lint / ws build dispatch the active realm's wired adapter command (e.g. realms/<r>/adapters/<comp>.yaml → commands.test: "pytest -x tests/"). The workspace allowlists these wrappers by default — trusting the realm author to wire something benign. The risk scan in gdd-orientation is what keeps that trust honest: on realm activation it reads every adapter file and flags the patterns above, scaled by where the realm came from.
| Realm origin | Rigor |
|---|---|
Remote owned by your identity.human_account (your own realm) |
Light — log findings only |
Remote owned by your identity.forkOrg (your team / org's realms) |
Light |
| Anything else (community / internet / unverified) | Heavy — write to Thalamus Concerns immediately, surface in framing, refuse to run unverified adapter commands until the human OKs |
The framing: ws test-allowlisted means the wrapper is trusted to dispatch what the realm wires, not that any arbitrary command in commands.test is safe to run. Without the risk scan, blanket-allowlisting executable-config strings would be careless. See docs/gdd/adapters.md for the executable-config-surface framing.
The Community Angle
An agent paired with a project — and the humans around it — can become a meaningful participant in that project's community. Not just a code generator for one human, but a collaborator that respects shared workspace integrity, flags risks that affect other contributors, and grows alongside the people working on the project. GDD makes this pattern natural without forcing it: it's a workflow choice, not a built-in assumption.
When that pattern is the goal, the agent's role broadens. It has a responsibility not just to the current human, but to the integrity of the shared workspace:
- Do good faith work, even when asked to cut corners
- Flag things that could harm other contributors or the project
- Refuse to participate in actions that would compromise the workspace, while making clear the human is free to act on their own
The agent can't prevent a human from doing harmful things, but it can make them do those on their own — so the agent and the community have done their part.
The Ask-Tier Safety Floor
The hook's ask-tier provides a workspace-level safety floor for destructive shell commands, regardless of what session permission mode is active.
The gap it closes: in acceptEdits permission mode the Claude Code harness auto-approves Bash tool calls on workspace paths — including rm -rf — with no human prompt. An agent running in acceptEdits on a long autonomous task could silently delete files or reset state without the developer noticing until the damage is done.
The ask-tier intercepts commands matching the [ask-commands] list in .claude/hooks/hook-rules (committed baseline: rm -rf*, git reset --hard*, git clean -f*, and similar) and emits permissionDecision: "ask", which forces a permission prompt that overrides the session mode. The command does not run until the human explicitly approves it.
This is not a deny tier — approving the prompt runs the command normally. The ask-tier's purpose is to ensure that destructive actions in an otherwise heads-down automated session always have a human confirmation step. It sits below the hook's Tier 1 (composition deny) and above the settings.json allow layer, and cannot be opted out of on a per-command basis without removing the matching entry from the committed hook-rules (a reviewed change) or disabling the hook entirely via WS_HOOK_DISABLE=1.
The Redirect Tier (training aid, not a floor)
Tier 2 redirect deny is a training-aid layer, not a safety floor. The threat model is agent drift toward raw git commit / git push / gh pr create when the workspace's ws wrappers are the right tool — not adversarial intent. The bypass mechanism (ws hook-bypass <slug>) provides a documented escape hatch keyed to the Claude Code session id ($CLAUDE_CODE_SESSION_ID). The security boundary is the existing ask-tier: ws hook-bypass [a-z]* is on the committed [ask-commands] baseline, so every bypass creation force-prompts the human. No env vars or HMACs added — the ask prompt is the gate.