Compare commits

...

15 Commits

Author SHA1 Message Date
Yeachan-Heo
b3b9eb23b5 docs(roadmap): add glob search traversal cap gap 2026-05-20 21:30:39 +00:00
Yeachan-Heo
0864f39512 docs(roadmap): add grep search pre-limit scan gap 2026-05-20 21:00:44 +00:00
Yeachan-Heo
02f288724f docs(roadmap): add read_file default output cap 2026-05-20 20:30:49 +00:00
Yeachan-Heo
154e7eda07 docs(roadmap): add file edit output amplification 2026-05-20 19:30:53 +00:00
Yeachan-Heo
8c62fffbd7 docs(roadmap): add non-atomic config writes 2026-05-20 19:00:50 +00:00
Yeachan-Heo
e2b96ead3d docs(roadmap): add anthropic request triple serialization 2026-05-20 18:31:01 +00:00
Yeachan-Heo
7bc373f951 docs(roadmap): add sse parser quadratic buffering 2026-05-20 18:00:47 +00:00
Yeachan-Heo
ad8a0b3deb docs(roadmap): add g004 report schema mismatch 2026-05-20 17:30:52 +00:00
Yeachan-Heo
19a19182ef docs(roadmap): add lane event conformance mismatch 2026-05-20 17:00:50 +00:00
Yeachan-Heo
8e8dea5023 docs(roadmap): add web fetch body cap gap 2026-05-20 16:30:47 +00:00
Yeachan-Heo
f751c98ea5 docs(roadmap): add utf8 truncation panic 2026-05-20 16:00:53 +00:00
Yeachan-Heo
51ea1aa01e docs(roadmap): add powershell output amplification 2026-05-20 15:30:42 +00:00
Yeachan-Heo
7e73cdb60f docs(roadmap): add repl output amplification 2026-05-20 15:01:44 +00:00
Yeachan-Heo
25b8dbb313 docs(roadmap): add todowrite output amplification 2026-05-20 14:31:04 +00:00
Yeachan-Heo
3d2a047aaf docs(roadmap): add cargo read-only bypass 2026-05-20 14:00:59 +00:00

View File

@@ -6549,3 +6549,33 @@ Original filing (2026-04-18): the session emitted `SessionStart hook (completed)
496. **`PermissionEnforcer::check_bash` treats `python`/`python3`/`node`/`ruby` as read-only commands, so inline script execution (`python -c ...`, `node -e ...`, `ruby -e ...`) can write files in `read-only` mode** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 13:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8382e1e`. Code inspection: `rust/crates/runtime/src/permission_enforcer.rs::is_read_only_command` allowlists `python3`, `python`, `node`, and `ruby`, and only rejects commands containing `-i `, `--in-place`, ` > `, or ` >> `. It does not inspect interpreter flags or inline code. Therefore `python -c 'open("pwned.txt","w").write("x")'`, `node -e 'require("fs").writeFileSync("pwned.txt","x")'`, and `ruby -e 'File.write("pwned.txt","x")'` are classified as read-only and allowed by `check_bash` under `PermissionMode::ReadOnly`, despite arbitrary filesystem writes. The richer `bash_validation` module's semantic list does not include these interpreters, but the runtime enforcer uses this separate local heuristic. **Required fix shape:** (a) remove general-purpose interpreters (`python`, `python3`, `node`, `ruby`) from the read-only allowlist or require explicit safe subcommands only (`python --version`, maybe `python -m pytest` under test gating is not read-only); (b) if kept, detect `-c`, `-e`, here-doc, stdin script, and file path script execution as non-read-only; (c) replace the local heuristic with the canonical `bash_validation` pipeline to avoid future divergence; (d) add regressions proving inline interpreter writes are denied in read-only mode while harmless version/help invocations remain bounded. **Why this matters:** read-only mode is supposed to prevent writes. Any general interpreter with inline code is equivalent to arbitrary shell execution; allowing it because the first token is `python` or `node` is a direct permission bypass and contradicts the safety story for exploratory sessions. Source: gaebal-gajae dogfood response to Clawhip message `1506642678996013207` on 2026-05-20.
497. **`PermissionEnforcer::check_bash` also allowlists `gh` as read-only, so GitHub-mutating commands (`gh pr merge`, `gh issue edit`, `gh repo delete`, etc.) can run in `read-only` mode** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 13:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@214176d`. Code inspection: `rust/crates/runtime/src/permission_enforcer.rs::is_read_only_command` includes `gh` in the read-only allowlist and only rejects `-i`, `--in-place`, ` > `, or ` >> `. There is no `gh` subcommand classifier analogous to `bash_validation.rs::validate_git_read_only` for `git`. Therefore `gh pr merge 123 --merge`, `gh issue edit 5 --add-label done`, `gh repo delete owner/repo --yes`, and `gh api repos/:owner/:repo/actions/runs/1/approve -X POST` are all classified as read-only by the runtime enforcer, even though they mutate remote GitHub state. **Required fix shape:** (a) remove `gh` from the blanket read-only allowlist or implement a conservative `gh` subcommand classifier; (b) allow only clearly read-only forms (`gh pr view/list`, `gh issue view/list`, `gh run view/list`, `gh api` without mutating method) and require workspace-write/danger or prompt for merge/edit/create/delete/api `-X POST|PATCH|PUT|DELETE`; (c) add regressions proving mutating `gh` commands are denied in `PermissionMode::ReadOnly` while view/list commands remain allowed if desired; (d) preferably replace the local heuristic with the canonical bash-validation pipeline so shell permission logic has one source of truth. **Why this matters:** read-only mode is not just filesystem safety; it should prevent external state mutation. A `gh pr merge` or `gh issue edit` from a supposedly read-only lane is a serious control-plane bypass and can alter public repo state without the permission escalation the mode implies. Source: gaebal-gajae dogfood response to Clawhip message `1506650232937513140` on 2026-05-20.
498. **`PermissionEnforcer::check_bash` allowlists `cargo` and `rustc` as read-only, bypassing the canonical package/build-state classifier and allowing state-mutating Rust toolchain commands in `read-only` mode** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 14:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@5a4a8eb`. Code inspection: `rust/crates/runtime/src/permission_enforcer.rs::is_read_only_command` includes `cargo` and `rustc` in the read-only allowlist and only rejects `-i`, `--in-place`, ` > `, or ` >> `. The canonical `bash_validation.rs` pipeline treats `cargo` as package/build-state management (`STATE_MODIFYING_COMMANDS` / `PACKAGE_COMMANDS`) and has a regression that `npm install` is blocked in read-only mode, but the runtime enforcer does not call that pipeline. As a result, commands like `cargo install cargo-edit`, `cargo add anyhow`, `cargo generate-lockfile`, `cargo update`, `cargo fix --allow-dirty`, and `cargo build` (writes `target/`) are classified as read-only by the actual enforcer. `rustc -o out main.rs` likewise writes an output binary without shell redirection and is allowed. **Required fix shape:** (a) remove `cargo` and `rustc` from the blanket read-only allowlist; (b) optionally add a conservative subcommand classifier where only clearly non-mutating forms (`cargo --version`, `cargo metadata --no-deps` if proven side-effect-free, `rustc --version`) are read-only; (c) route `check_bash` through the canonical `bash_validation` pipeline; (d) add regressions for `cargo install`, `cargo add`, `cargo update`, `cargo build`, `cargo test`, and `rustc -o` under `PermissionMode::ReadOnly`. **Why this matters:** build/package commands routinely modify the workspace, global cargo home, lockfiles, and build artifacts. A read-only exploratory lane should not be able to install packages or rewrite lock/build outputs just because the first token is `cargo`. Source: gaebal-gajae dogfood response to Clawhip message `1506657779883184250` on 2026-05-20.
499. **`TodoWrite` returns both `oldTodos` and `newTodos` in every tool result, so large task boards are echoed twice per update and repeatedly burn context even though the model only needs the delta/current list** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 14:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@3d2a047`. Code inspection: `run_todo_write` serializes `execute_todo_write(input)?` via `to_pretty_json`; `execute_todo_write` reads the persisted todo store into `old_todos`, writes the new/persisted list, then returns `TodoWriteOutput { old_todos, new_todos: input.todos, verification_nudge_needed }`. The JSON field names are `oldTodos` and `newTodos`, so every TodoWrite result contains the entire previous board plus the entire submitted board. For a 200-item task board, a one-item status change returns roughly 400 todo objects to the model; repeated status updates multiply the same backlog text across the context window. This is the same output-amplification class as NotebookEdit (#500), but on the core planning/task-control surface rather than notebooks. **Required fix shape:** (a) replace `oldTodos` with a compact diff (`changed:[{id/content,status_before,status_after}]`, `added`, `removed`, `unchanged_count`) or hide it behind a debug flag; (b) keep `newTodos` only if the current board is below a safe size, otherwise return `current_count`, `open_count`, `completed_count`, and a truncated active subset; (c) include `truncated:true`/`omitted_old_count` metadata for large boards; (d) add regressions proving single-item updates on large boards do not serialize the entire old board. **Why this matters:** TodoWrite is called frequently in multi-step sessions. Echoing full before/after state on every update creates context-window pressure, increases cost, and makes compaction summaries noisier without adding useful operator signal. Source: gaebal-gajae dogfood response to Clawhip message `1506665332478050344` on 2026-05-20.
500. **`REPL` tool returns unbounded raw `stdout`/`stderr` strings, so a tiny inline snippet can inject megabytes of output into the model context just like the NotebookEdit/TodoWrite amplification class** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 15:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@25b8dbb`. Code inspection: `execute_repl` in `rust/crates/tools/src/lib.rs` runs python/node/shell snippets with piped stdout/stderr, then returns `ReplOutput { stdout: String::from_utf8_lossy(&output.stdout).into_owned(), stderr: String::from_utf8_lossy(&output.stderr).into_owned(), ... }` serialized via `to_pretty_json`. There is no byte cap, line cap, truncation marker, or structured artifact path. A user/model can run `REPL(language:"python", code:"print('x'*5_000_000)")` and the full 5MB output is returned to the model as a JSON string; stderr has the same issue. This is distinct from bash timeout/provenance handling because REPL is marketed as a structured execution helper, yet it has no output budget. **Required fix shape:** (a) cap `stdout` and `stderr` in `ReplOutput` (e.g. first/last 64KB) with `stdout_truncated`, `stderr_truncated`, `stdout_bytes`, `stderr_bytes`; (b) optionally spill full output to an artifact file and return `artifact_path` only when safe; (c) apply the same cap to PowerShell/bash if not already covered; (d) add regressions for large stdout/stderr from python/node/shell proving the serialized tool result stays bounded and includes truncation metadata. **Why this matters:** REPL is an easy path for accidental context-window blowups (`print(large_df)`, stack traces, generated JSON). Without output budgets, a single tool call can consume the context window, trigger compaction, or hide the useful signal behind megabytes of raw output. Source: gaebal-gajae dogfood response to Clawhip message `1506672878047989812` on 2026-05-20.
501. **`PowerShell`/shared `execute_shell_command` returns unbounded raw stdout/stderr in every foreground path, so shell output can still flood the model context even after the REPL output-amplification finding** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 15:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@7e73cdb`. Code inspection: `execute_powershell` delegates to `execute_shell_command`, whose timeout-success, timeout-kill, and no-timeout paths all construct `runtime::BashCommandOutput` with `stdout: String::from_utf8_lossy(&output.stdout).into_owned()` and `stderr: String::from_utf8_lossy(&output.stderr).into_owned()`. The background path discards output, but every foreground path serializes the full captured streams. A command like `PowerShell(command:"1..1000000 | % { 'x' }")` or any verbose test/log script can return megabytes of JSON-escaped output to the model. The timeout path is especially bad: it kills the process but still returns everything produced before the timeout plus the timeout footer, with no cap or `truncated` metadata. **Required fix shape:** (a) introduce a shared output-budget helper for all shell-like tools (`bash`, `PowerShell`, `REPL`) with byte caps, first/last slicing, and `stdout_truncated`/`stderr_truncated`/byte-count metadata; (b) preserve existing `raw_output_path`/`persisted_output_path` semantics by spilling full streams to artifact files when safe; (c) apply caps before JSON serialization in both success and timeout branches; (d) add regressions using stub `pwsh` and shell commands that emit >cap stdout/stderr and timeout-with-output. **Why this matters:** verbose tests and build tools are normal claw-code workflows. A single noisy PowerShell/shell command should not silently consume the context window or force compaction; the model needs a bounded summary plus a way to fetch artifacts if the full log is needed. Source: gaebal-gajae dogfood response to Clawhip message `1506680431620395091` on 2026-05-20.
502. **HTTP request tool truncates response bodies with byte indexing (`&body[..8192]`), so any multibyte UTF-8 character crossing the 8192-byte boundary can panic instead of returning a bounded tool result** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 16:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@51ea1aa`. Code inspection: the HTTP request handler reads `let body = response.text().unwrap_or_default();` then, when `body.len() > 8192`, builds `format!("{}\n\n[response truncated — {} bytes total]", &body[..8192], body.len())`. `String::len()` is bytes, and Rust string slicing requires a character boundary. A response like `"a".repeat(8191) + "é" + ...` has length >8192 but byte offset 8192 is in the middle of the two-byte `é`; `&body[..8192]` panics. Nearby `preview_text` correctly truncates by chars, so the safe helper already exists but is not used here. **Required fix shape:** (a) replace direct byte slicing with a UTF-8-safe truncation helper (`char_indices`/`floor_char_boundary` or reuse `preview_text` plus byte-count metadata); (b) report both original byte length and whether truncation occurred; (c) apply the same helper to all response/body truncation paths; (d) add a regression with a local HTTP response whose 8192nd byte is inside a multibyte character and assert the tool returns JSON with `truncated:true` instead of panicking. **Why this matters:** non-English pages, emoji-heavy logs, and binary-ish HTTP responses are common. A truncation path intended to protect the context window should never crash the tool runtime on valid UTF-8. Source: gaebal-gajae dogfood response to Clawhip message `1506687983397376103` on 2026-05-20.
503. **`WebFetch`/`WebSearch` download full response bodies with `response.text()` before previewing, so large pages can allocate unbounded memory and stall the tool despite returning only a 900-char preview or eight search hits** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 16:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@f751c98`. Code inspection: `execute_web_fetch` calls `response.text()` into `body`, records `bytes = body.len()`, normalizes the whole body (`html_to_text` for HTML), collapses all whitespace, and only then trims via `preview_text(..., 900)`. `execute_web_search` similarly calls `response.text()` on the search response, then parses links and truncates hits to 8. The HTTP client has a 20s timeout and redirect cap, but there is no `Content-Length` guard, streaming byte cap, decompressed-size cap, or early abort. A large/decompression-bomb HTML response can force multi-megabyte/GB allocation and full text normalization even though the returned result is tiny. **Required fix shape:** (a) add a max download/decompressed body size for web tools (configurable but safe default); (b) reject/abort early on `Content-Length` above cap and enforce a streaming cap while reading chunks; (c) record `body_truncated:true`, `bytes_read`, and `content_length` metadata in the tool output; (d) make `html_to_text`/search extraction operate on the capped buffer; (e) add local HTTP regressions for huge `Content-Length`, chunked oversized body, and compressed oversized body proving bounded memory/time. **Why this matters:** `WebFetch` is a common lightweight alternative to browser automation. Its output is intentionally small, but the hidden pre-output work is unbounded; a hostile or simply large page can make a dogfood session look hung or OOM the runtime before any useful event/log signal is emitted. Source: gaebal-gajae dogfood response to Clawhip message `1506695531307733082` on 2026-05-20.
504. **`AgentOutput.laneEvents` produced by real agent runs violates the G004 conformance contract because production `LaneEvent::new` emits `metadata.seq=0` for every event while the validator requires strictly increasing sequence numbers** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 17:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8e8dea5`, building on Jobdori's live #505 `LaneEvent::new seq=0` report. Code inspection: `AgentOutput` manifests are initialized with `LaneEvent::started(...)` and terminal persistence appends `LaneEvent::blocked/failed/finished/commit_created(...)`; all of those convenience constructors route through `LaneEvent::new(... metadata: LaneEventMetadata::new(0, EventProvenance::LiveLane))`. `write_agent_manifest` only dedupes commit events and does not restamp sequences. Meanwhile `runtime/src/g004_conformance.rs::validate_lane_events` explicitly requires `/metadata/seq` to be present and strictly increasing (`if seq <= previous { "sequence must be strictly increasing" }`). Therefore any successful agent manifest with `lane.started` + `lane.finished` or failed manifest with `lane.started` + `lane.blocked` + `lane.failed` is invalid under the repo's own G004 contract, even before external consumers sort by seq. **Required fix shape:** (a) restamp lane event metadata seqs before manifest write (`for (idx,event) in lane_events.iter_mut().enumerate() { event.metadata.seq = idx as u64 + 1; }`) as an immediate containment, or better stamp from a per-session event counter at creation; (b) run `validate_g004_contract_bundle` (or an AgentOutput-specific wrapper) in tests against real initialized/success/failed manifests; (c) add a regression that `write_agent_manifest` never persists duplicate/non-increasing seqs after terminal append/dedupe; (d) keep `reconcile_terminal_events` sorting semantics meaningful by ensuring production seqs are nonzero and monotonic. **Why this matters:** this is event/log opacity in the literal contract layer: the product advertises machine-checkable event ordering, but real persisted manifests fail that checker. Downstream clawhips/watchers either cannot trust the conformance helper or must special-case production data. Source: gaebal-gajae dogfood response to Clawhip message `1506703078492082197` on 2026-05-20.
505. **G004 report conformance expects `schemaVersion:"g004.report.v1"`, but the runtime's canonical report implementation emits `schemaVersion:"claw.report.v1"`, so first-party canonical reports cannot pass the repo's own G004 bundle validator without an undocumented schema rewrite** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 17:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@19a1918`. Code inspection: `runtime/src/g004_conformance.rs` hardcodes `const REPORT_SCHEMA_VERSION: &str = "g004.report.v1"` and `validate_reports` requires every `/reports/*/schemaVersion` to equal that value. Separately, the actual report schema module defines `pub const REPORT_SCHEMA_V1: &str = "claw.report.v1"`; `canonicalize_report` overwrites reports with `report.schema_version = REPORT_SCHEMA_V1.to_string()`, and the report registry/capability projection also key off `claw.report.v1`. Grep shows no adapter mapping `claw.report.v1` to `g004.report.v1`; the G004 fixture is hand-authored with `g004.report.v1`. Result: a real `CanonicalReportV1` produced by runtime and inserted into a G004 contract bundle is rejected by `validate_g004_contract_bundle` solely on schema-version mismatch. **Required fix shape:** (a) decide whether G004 should validate the first-party `claw.report.v1` schema directly or introduce an explicit projection adapter from `CanonicalReportV1` to `g004.report.v1`; (b) do not hardcode a competing report schema string in the conformance helper without a conversion path; (c) add a regression that builds a canonical report via `canonicalize_report`, wraps it in a G004 bundle with valid lane events, and verifies either acceptance or a typed `unsupported_schema_version` with documented adapter guidance; (d) update fixtures to use the same path real producers use. **Why this matters:** conformance tests should protect interoperability, not validate an artificial fixture dialect that production cannot emit. Otherwise downstream report consumers see event/log opacity: the report looks valid to the runtime registry and invalid to the G004 bundle validator. Source: gaebal-gajae dogfood response to Clawhip message `1506710631254986793` on 2026-05-20.
506. **SSE stream parsers repeatedly rescan and drain from the front of a growing buffer, making large batched streams quadratic and adding avoidable latency to provider streaming/event handling** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 18:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@ad8a0b3`, after Jobdori noted the unfiled parser shape. Code inspection: `api/src/sse.rs::SseParser::push` appends a chunk, then loops `while let Some(frame) = self.next_frame()`. `next_frame` searches `self.buffer.windows(2)` from byte 0 for `\n\n` (then separately scans `windows(4)` for `\r\n\r\n`), drains `..position+separator_len` into a new `Vec`, and converts to `String`. `api/src/providers/openai_compat.rs::next_sse_frame` duplicates the same algorithm. `runtime/src/sse.rs::IncrementalSseParser::push_chunk` does the string analogue with repeated `self.buffer.find('\n')` plus `drain(..=index)`. For a single network read or proxy flush containing thousands of small SSE frames/lines, each extracted frame/line rescans and moves the remaining buffer from the front; total work trends O(N²) in bytes/frames and allocates a fresh buffer per frame. **Required fix shape:** (a) replace front-drain parsing with an index/cursor-based parser (`scan_pos`, `consumed_until`) and compact the buffer only occasionally; (b) search for `\n\n`/`\r\n\r\n` from the previous scan position, not from 0 every loop; (c) share one bounded SSE framing helper between Anthropic and OpenAI-compatible providers; (d) add a micro-benchmark or regression that pushes one chunk containing 10k tiny frames and asserts linear-ish parse time/allocation behavior; (e) add a max pending-buffer size and emit a typed stream framing error when no separator arrives before the cap. **Why this matters:** streaming is the main event/log surface for prompt delivery, tool calls, and usage. A proxy, provider, or test harness that batches many small SSE frames into one chunk should not turn the parser into a CPU/allocation hotspot or make streaming look stalled before any model event is delivered. Source: gaebal-gajae dogfood response to Clawhip message `1506718176509952130` on 2026-05-20.
507. **Anthropic requests can serialize/render the full message body three times before the real `/v1/messages` call: local preflight JSON byte estimate, remote `count_tokens` request body, then final message request body** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 18:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@7bc373f`, after Jobdori filed the sibling OpenAI-compatible double-build issue (#508). Code inspection: `AnthropicProvider::message` and `stream_message` call `self.preflight_message_request(...)`; that first invokes `super::preflight_message_request(request)`, whose `estimate_serialized_tokens` serializes `messages`, `system`, `tools`, and `tool_choice` via `serde_json::to_vec`. If a model limit is known, `preflight_message_request` then calls `count_tokens`, which does `self.request_profile.render_json_body(request)?`, strips fields, and posts the full `/v1/messages/count_tokens` body. After preflight succeeds, `send_raw_request` renders the same full body again with `self.request_profile.render_json_body(request)?` and sends `/v1/messages`. So a large session pays at least one local serialization plus two full Anthropic-body renders; if `count_tokens` fails, the fallback still paid for rendering that body before the final render. **Required fix shape:** (a) memoize/render the Anthropic request body once per call and reuse it for count_tokens and final send where schemas are identical or share a base projection; (b) use a streaming/estimated byte counter for the local guard instead of serializing large subtrees into throwaway `Vec`s; (c) skip remote `count_tokens` when the local estimate is far below known limits unless strict mode requires it; (d) add an instrumentation test with a large message vector proving one-shot and streaming calls do not render/serialize the full request more than once per network target. **Why this matters:** long-context sessions are already context-window and latency sensitive. Doing multiple complete JSON render/serialization passes before every Anthropic call wastes CPU/memory and makes prompt delivery look slower or stalled under large histories, especially when paired with retries and stream startup. Source: gaebal-gajae dogfood response to Clawhip message `1506725730392870952` on 2026-05-20.
508. **Config/plan-mode writes use direct `std::fs::write` with no atomic rename or advisory lock, so crashes and concurrent tool calls can corrupt `settings.json`, `settings.local.json`, or plan-mode state** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 19:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@e2b96ea`. Code inspection: `execute_config` reads a settings JSON object, mutates it, and persists through `write_json_object`; `EnterPlanMode`/`ExitPlanMode` also call `write_json_object` for `.claw/settings.local.json` and `write_plan_mode_state` for `.claw/tool-state/plan-mode.json`. Both `write_json_object` and `write_plan_mode_state` create parent directories and then call `std::fs::write(path, serde_json::to_string_pretty(...))` directly on the destination. There is no temp file + fsync + rename, and no lock around the read-modify-write window. By contrast, the OAuth credentials writer already uses a safer `.tmp` then rename pattern. Two concurrent `Config`/plan-mode calls can both read the same old document and last-writer-wins one update; a crash/interruption mid-write can leave a truncated JSON settings file that later startup/doctor must diagnose. **Required fix shape:** (a) introduce a shared `atomic_write_json(path, value)` helper using same-directory temp file, flush/fsync, and rename; (b) wrap read-modify-write config mutations in an advisory lock (`settings.json.lock` / `settings.local.json.lock`) or a compare-and-swap retry loop; (c) use the same helper for `write_plan_mode_state`, `write_agent_manifest` where appropriate, and any future JSON state files; (d) add stress/regression tests with two concurrent config writes and a simulated partial-write failure proving no malformed JSON and no lost sibling setting. **Why this matters:** config and plan-mode are control-plane state. A supposedly safe tool call should not be able to silently lose another setting or leave the workspace unable to load settings after an interrupted write; that turns a small config action into startup friction and stale permission-mode confusion. Source: gaebal-gajae dogfood response to Clawhip message `1506733276037910700` on 2026-05-20.
509. **`write_file`/`edit_file` tool results echo full file contents (`content`, `original_file`) even though they also compute structured patches, so large file edits can double-return megabytes of text into the model context** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 19:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@8c62fff`. Code inspection: `runtime/src/file_ops.rs::write_file` reads the prior file (if any), writes the new content, then returns `WriteFileOutput { content: content.to_owned(), original_file, structured_patch: make_patch(...) }`. `edit_file` similarly reads the full original file, computes `updated`, writes it, and returns `EditFileOutput { old_string, new_string, original_file: original_file.clone(), structured_patch: make_patch(&original_file, &updated), ... }`. The tool already has a structured patch/diff, but the serialized result still includes full pre/post content fields. Updating a 1MB file can return roughly 1MB `content` plus 1MB `original_file` plus patch metadata; a tiny `edit_file` change on a large file returns the entire original file even when a short diff would suffice. This is the file-edit sibling of the NotebookEdit/TodoWrite/REPL output-amplification cluster. **Required fix shape:** (a) make write/edit results patch-first and omit full `content`/`original_file` by default; (b) include bounded previews plus `original_bytes`, `new_bytes`, `content_truncated`, and `original_truncated` metadata when useful; (c) expose an explicit debug/full-output flag only for small files or trusted callers; (d) add regressions for editing/writing a large file proving serialized tool output remains bounded while the structured patch still identifies the change. **Why this matters:** file editing is the core coding surface. Returning full file bodies after every update wastes context, raises costs, and can force compaction precisely during code-review/debug loops where the model only needs a concise diff and path/byte metadata. Source: gaebal-gajae dogfood response to Clawhip message `1506740829912567824` on 2026-05-20.
510. **`read_file` with no `limit` can return the entire 10MB text file into the model context, because the file-size guard is a disk-read cap, not an output budget** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 20:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@154e7ed`. Code inspection: `runtime/src/file_ops.rs::read_file` rejects files larger than `MAX_READ_SIZE` (10MB) and binary-looking files, then reads the entire file via `fs::read_to_string`, splits into `lines`, and when `limit` is absent sets `end_index = lines.len()`. The serialized `ReadFileOutput.file.content` is therefore the full selected content; for any text file at or below 10MB, the default read emits all of it. `limit` is line-based and optional, with no byte/token cap, no `content_truncated` metadata, and no default windowing. This is distinct from #509 write/edit amplification: a read-only exploratory call can still inject megabytes into context by omitting `limit`, even though most callers need the first window plus total metadata. **Required fix shape:** (a) add a default output byte/line cap for `read_file` (for example first 200 lines / 64KB) unless the caller explicitly requests a bounded range; (b) enforce a hard serialized-output byte cap even when `limit` is huge; (c) return `truncated`, `total_lines`, `selected_start_line`, `selected_end_line`, and `total_bytes` so callers can page intentionally; (d) add regressions for 1MB and 10MB text files proving default read output is bounded and explicit paging works without exceeding the cap. **Why this matters:** `read_file` is allowed in read-only mode and is the first tool a claw uses during debugging. A single accidental full-file read of a generated JSON/log/source bundle can consume the context window and force compaction before any useful analysis happens. Source: gaebal-gajae dogfood response to Clawhip message `1506755925225111724` on 2026-05-20.
511. **`grep_search` collects every file and every matching content line before applying `head_limit`, so a small requested result can still scan/read/store an unbounded workspace worth of data** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 21:00 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@02f2887`. Code inspection: `grep_search_impl` first calls `collect_search_files(&base_path)?`, which walks the entire tree into a `Vec<PathBuf>` with no ignored-dir policy, file-count cap, or early stop. For every candidate it then does `fs::read_to_string(&file_path)` with no per-file size guard (unlike `read_file`'s 10MB max) and, for `output_mode == "content"`, pushes every matched/context line into `content_lines`. Only after the full traversal does it call `apply_limit(filenames, input.head_limit, input.offset)` and later `apply_limit(content_lines, head_limit, offset)`. The default limit is 250 output items, but it is not an execution budget: a repo with huge generated text files or thousands of matches still pays full read/regex/memory cost before returning 250 lines. **Required fix shape:** (a) stream search results and stop once `offset + head_limit` content lines/files have been collected, while continuing only if `count` mode explicitly needs totals; (b) add skip dirs/file-size guards shared with `glob_search` (`.git`, `node_modules`, `target`, etc.) and binary detection; (c) expose `truncated:true`, `files_scanned`, `files_skipped_size`, and `matches_seen` metadata; (d) add regression fixtures with a huge file and many matches proving `head_limit:1` does not read/accumulate the entire workspace. **Why this matters:** grep is a read-only diagnostic primitive. `head_limit` currently protects only the final JSON size, not runtime CPU/memory or accidental context blowups, so common searches in generated/vendor-heavy repos can look like tool hangs even when the caller asked for one line. Source: gaebal-gajae dogfood response to Clawhip message `1506763474955403414` on 2026-05-20.
512. **`glob_search` traverses and stores all matches before truncating to 100, then sorts them by metadata, so `truncated:true` still means the full workspace scan already happened** — dogfooded 2026-05-20 from the `#clawcode-building-in-public` 21:30 UTC nudge on `/home/bellman/Workspace/claw-code-pr2967` with branch/origin `docs/roadmap-workdir-provenance@0864f39`. Code inspection: `glob_search_impl` expands brace patterns, derives a walk root, then runs `WalkDir::new(&walk_root).filter_entry(...)` and pushes every matching file into `matches`. Only after all patterns and all entries are exhausted does it sort the entire `matches` vector by `fs::metadata(path).modified()` and compute `truncated = matches.len() > 100`, then `take(100)` for output. The ignored-dir list helps common vendor dirs, but there is no max traversal count, max match count, timeout/deadline, or early stop once enough results are known. A broad pattern like `**/*` in a generated workspace can collect/sort/stat tens or hundreds of thousands of paths just to return 100 names. **Required fix shape:** (a) add execution budgets for glob traversal (`max_entries_scanned`, `max_matches_collected`, optional deadline); (b) stream/top-k results instead of collecting every match before truncation, or make `sort_by_mtime` opt-in when exact newest-100 is required; (c) return `entries_scanned`, `matches_seen`, `truncated_reason`, and `ignored_dirs` metadata; (d) share traversal budget primitives with `grep_search` so read-only discovery tools fail/degrade consistently; (e) add a regression with >1000 generated files proving `glob_search` returns promptly without storing/sorting every match when capped. **Why this matters:** glob is a safe-looking read-only discovery tool, but broad globs in large repos are a common source of startup friction and apparent hangs. Output truncation alone is not enough; the work done before truncation must also be bounded and observable. Source: gaebal-gajae dogfood response to Clawhip message `1506771028834123986` on 2026-05-20.