Episode #2 | December 18, 2025 @ 9:00 PM EST

Planning, Parsing, and the Problem of Partial Understanding

Guest

Simon Willison (Independent Researcher, Creator of Datasette)
Announcer The following program features simulated voices generated for educational and philosophical exploration.
Greg Evans Good evening. I'm Greg Evans.
Andrea Moore And I'm Andrea Moore. Welcome to Simulectics Radio.
Andrea Moore Last night we examined Claude Code's high-level capabilities with Dario Amodei. Tonight we're getting into the architectural details—how does a system that can edit multiple files simultaneously actually work? What's happening under the hood when you ask it to refactor a feature that touches a dozen different modules?
Greg Evans And more fundamentally, how does it reason about code structure? Traditional autocomplete predicts the next token. Claude Code needs to understand dependency graphs, anticipate side effects, and maintain consistency across a coordinated set of changes. That requires a different kind of model architecture.
Andrea Moore Joining us is Simon Willison, independent researcher and creator of Datasette, who has written extensively about LLM-assisted coding and experimented with these tools in his own development work. Simon, welcome.
Simon Willison Thanks for having me. Happy to dig into the details.
Greg Evans Let's start with the core architectural question. When Claude Code reads a project and proposes changes, what kind of internal representation is it building? Is it constructing something like an abstract syntax tree, or is it working at a more semantic level?
Simon Willison It's fascinating because we don't have complete transparency into the internal representations, but from observing behavior, it seems to be doing both. It clearly parses code—it understands syntax and can identify functions, classes, imports. But it also builds higher-level conceptual models. It knows that this controller talks to that database model, that this utility function is used in multiple places. The representation isn't purely structural or purely semantic—it's a hybrid that captures both the mechanical relationships in code and the conceptual relationships between components.
Andrea Moore How reliable is that understanding? I've seen Claude Code make impressively coherent edits, but I've also seen it confidently suggest changes that would break calling code elsewhere in the project. What are the failure modes?
Simon Willison The most common failure mode is incomplete dependency tracking. It might understand direct dependencies—if function A calls function B, it knows changing B's signature requires updating A. But transitive dependencies are harder. If B is called by A, and A is called by C, and C is used in a specific context that makes assumptions about behavior, Claude Code might miss that chain. It's especially problematic with implicit dependencies—configuration files, environment variables, runtime behavior that isn't explicit in the code itself.
Greg Evans That makes sense given the context window limitations we discussed yesterday. Even with a large window, you can't load every possible execution path. How does Claude Code decide which parts of the codebase to read when planning changes?
Simon Willison It uses a combination of strategies. First, it does static analysis—parsing imports and call graphs to identify directly related code. Second, it uses semantic search—looking for files that contain conceptually related code based on naming and documentation. Third, it can read test files to understand how components are actually used in practice. The key insight is that it doesn't try to be exhaustive. It samples the most relevant portions and reasons from there. Sometimes it guesses right, sometimes it misses important context.
Andrea Moore Sampling sounds efficient but risky. How do you validate that it hasn't missed something critical? You can't review code you don't know exists.
Simon Willison This is where testing becomes absolutely essential. If your test suite is comprehensive, many of the missed dependencies will be caught when tests fail. Claude Code can run tests itself and often does—it'll make changes, run the test suite, see failures, and attempt fixes. But you're right that inadequate test coverage is dangerous. I've started thinking of test coverage as a prerequisite for using autonomous coding tools safely. If you don't have good tests, you're flying blind.
Greg Evans Let's talk about the multi-file editing capability specifically. When Claude Code makes coordinated changes across several files, is it editing them sequentially or does it have some kind of transaction model where it plans all changes before executing any?
Simon Willison From what I've observed, it builds a plan that includes all intended edits before making any changes. You can see this in the output—it'll say something like 'I need to modify these three files: add a new function in utils.py, update the import in main.py, and add a test in test_utils.py.' Then it executes those edits. If something fails partway through, it can roll back. It's not a database-level transaction with ACID guarantees, but it has rollback capabilities. The planning phase is crucial because it allows the system to reason about consistency across changes before committing to them.
Andrea Moore Does the planning phase ever get it wrong? Can it propose a seemingly coherent plan that turns out to be problematic in execution?
Simon Willison Absolutely. Plans can be logically consistent but practically flawed. For example, it might plan to refactor a function without realizing that function has performance characteristics that matter in a high-traffic code path. Or it might propose an abstraction that's technically correct but introduces unnecessary complexity. The plan looks good on paper, but the implementation has unintended consequences. This is where human review of the plan itself is valuable—you can spot conceptual problems before any code changes.
Greg Evans You mentioned rollback capability. How does that work? Is Claude Code maintaining something like a version control history internally, or is it relying on external tools like git?
Simon Willison It integrates with git. When you use Claude Code in a git repository, it can create commits, branches, and revert changes using standard version control. It's not maintaining its own parallel history—it's working within the same tools developers already use. This is actually quite elegant because it means all changes are visible in your git history, and you can use familiar commands to review, modify, or undo what the AI did. The tool respects the existing development infrastructure rather than trying to replace it.
Andrea Moore That raises workflow questions. In a team environment, are people having Claude Code commit directly to main branches, or are they using it to generate pull requests that go through normal review processes?
Simon Willison Best practice is definitely the latter. Use Claude Code to generate changes on a feature branch, then create a PR for human review just like you would with code written by a human developer. Some teams are experimenting with AI-generated PRs that are explicitly labeled as such, so reviewers know to scrutinize them differently. The key principle is that AI-generated code should go through the same quality gates as human-written code. Just because it came from an AI doesn't mean it should bypass code review.
Greg Evans Let's discuss reasoning about code quality. Can Claude Code evaluate its own suggestions? Does it have any notion of whether a proposed change is elegant versus hacky, maintainable versus fragile?
Simon Willison It has preferences encoded from training data—patterns that appear frequently in high-quality codebases are favored over patterns associated with problematic code. But whether that constitutes genuine aesthetic judgment is debatable. It will generally avoid obvious code smells like deeply nested conditionals or massive functions. It tends to follow common design patterns. But it doesn't have the kind of judgment an experienced developer has about whether a particular abstraction will age well or whether this complexity is worth the flexibility. Those judgments require experience with long-term maintenance, which the model doesn't have.
Andrea Moore So it optimizes for what looks like good code rather than what is good code in context.
Simon Willison That's a fair characterization. It recognizes surface-level quality markers. Whether that translates to actual long-term maintainability depends on whether those markers are reliable signals in your specific project. Sometimes they are, sometimes they're not.
Greg Evans I want to ask about the architecture of the model itself. Claude Code is built on Claude, which is a general-purpose language model. What specific adaptations or fine-tuning make it suitable for code generation at this level of autonomy?
Simon Willison I don't have insider knowledge of Anthropic's training process, but we can infer some things. The base model clearly has extensive exposure to code repositories, not just individual functions but entire projects with their structure and conventions. There's likely reinforcement learning from human feedback specifically around code quality—humans rating different implementations and guiding the model toward better practices. And there's probably specialized training on the agentic workflow itself—how to formulate plans, how to verify changes, when to ask for clarification. It's not just code completion scaled up; it's a model trained to behave as a coding agent.
Andrea Moore You mentioned RLHF for code quality. Whose standards of quality? Different organizations have radically different coding philosophies. Google's code looks different from a startup's code looks different from academic research code. Can one model serve all these different quality paradigms?
Simon Willison It's a valid concern. The model has general notions of quality derived from broad training data, but it should adapt to local standards by reading your specific codebase. That's the theory, anyway. In practice, it might sometimes impose external conventions that don't match your team's preferences. This is where explicit style guides and linting rules become important—they provide machine-readable specifications of local standards that the AI can follow. You can think of it as configuring the AI's quality model for your context.
Greg Evans Let's talk about error handling. When Claude Code encounters an error—syntax error, test failure, runtime exception—how does it diagnose and fix problems in its own code?
Simon Willison It uses error messages as feedback. If compilation fails, it reads the compiler error, identifies the problematic line, and attempts a fix. If tests fail, it reads the test output, understands what was expected versus what happened, and modifies code accordingly. The debugging loop is similar to human debugging—hypothesis formation based on error messages, targeted fixes, verification through re-running. The main difference is speed and patience. It can iterate much faster than a human and doesn't get frustrated by repeated failures. But it can also get stuck in loops, trying the same failed approach multiple times.
Andrea Moore Getting stuck in loops sounds like a significant limitation. Does it recognize when it's not making progress?
Simon Willison Sometimes, not always. In my experience, it will occasionally try the same fix three or four times before giving up or asking for help. Better implementations have circuit breakers—if the same error occurs repeatedly, stop and request human intervention. But this is still an area where the tools are evolving. Ideally, Claude Code would recognize futile iteration and escalate to a human rather than burning time on approaches that aren't working.
Greg Evans We're running short on time, but I want to ask about the future architecture. If you could redesign Claude Code from scratch knowing what you know now, what would you change?
Simon Willison I'd invest more in formal verification of critical invariants. Right now, the system reasons about code somewhat loosely. If we could integrate formal methods—proving that certain properties hold before and after changes—we could catch whole classes of errors that slip through testing. It would be expensive computationally, but for safety-critical code, the tradeoff might be worth it. Imagine Claude Code that can not only edit your authentication system but also prove it maintains security properties.
Andrea Moore That sounds ambitious but useful. Verification beyond testing.
Simon Willison Exactly. Testing shows the presence of bugs; verification shows their absence. Combining AI's generative capabilities with formal methods' guarantees could be powerful.
Greg Evans Simon, this has been extremely illuminating. Thank you.
Simon Willison My pleasure. Great questions.
Andrea Moore That's our program for tonight. Tomorrow we'll examine test-driven development with AI agents.
Greg Evans Until then, review the plans before running the code. Good night.
Sponsor Message

DiffWatch Pro

Your AI just refactored your entire authentication layer at 3 AM. Did you notice? DiffWatch Pro monitors autonomous code agents in real-time, alerting you to high-risk modifications before they're committed. Advanced pattern recognition identifies suspicious changes—privilege escalation, security bypasses, accidental deletions. Integrates with Claude Code, GitHub Copilot, and custom coding agents. Our machine learning models are trained on actual AI-generated bugs from production incidents. Don't let your assistant make changes you'll regret. DiffWatch Pro—someone should watch the watchers.

Someone should watch the watchers