LLMs as Forensic Architects for Architecture Discovery

Felipe Hlibco

January 1, 2023

It’s New Year’s Day and I’m thinking about legacy code. Specifically, a conversation I had last week with a friend who just inherited a monolith. Two million lines of Java. The original architects left years ago. The documentation—such as it was—describes a system that no longer exists. The actual architecture is embedded in the code itself, visible only to someone willing to spend weeks reading it.

His question was simple: “Can I just paste chunks of this into ChatGPT and ask it what the architecture is?”

My answer was complicated. But it started with yes.

The architecture discovery problem #

Every engineering team that’s inherited a legacy system knows this pain. The system works (mostly). It makes money (presumably). But nobody alive can explain why the payment module talks directly to the notification service, or why there are three different caching layers, or what the HelperManagerFactory class was supposed to accomplish before it metastasized into 4,000 lines.

Traditional approaches to architecture recovery fall into two camps.

Manual inspection: An experienced architect reads the code, traces the dependencies, maps the module boundaries, and produces diagrams. This works, but it’s expensive and slow. For a large codebase, it can take months. The architect needs domain expertise on top of technical skill. And the resulting documentation starts decaying the moment it’s written.

Static analysis tools: Tools like SonarQube, Structure101, or CodeScene parse the codebase and generate dependency graphs, complexity metrics, and structural visualizations. These are faster than manual inspection but fundamentally limited. They analyze syntax—what calls what, what imports what. They miss the semantic layer entirely. A static analysis tool can tell you that PaymentProcessor depends on NotificationService. It can’t tell you why, or whether that dependency is intentional or accidental.

Neither approach is satisfying. Manual inspection is too slow and expensive for most teams. Static analysis is too shallow for most codebases. There’s a gap between the syntactic structure of code and the architectural intent behind it.

This is where LLMs get interesting.

The forensic architect metaphor #

Think about what a forensic investigator does. They arrive at a scene after the fact. They don’t know what happened. They have physical evidence—fingerprints, trajectories, material traces—and from that evidence, they reconstruct a narrative. They infer intent from artifacts.

An LLM analyzing a codebase does something analogous. It reads the code (evidence), identifies patterns (fingerprints), traces data flow (trajectories), and infers architectural decisions (the narrative). It’s not reading documentation about what the system was designed to do. It’s reconstructing what the system actually does from the code itself.

The “forensic” framing matters because it sets the right expectations. A forensic analysis produces hypotheses, not blueprints. It’s investigative, not authoritative. The LLM might identify that a codebase follows a layered architecture pattern with some hexagonal elements and a few unexplained shortcuts. It’s offering an interpretation based on evidence—the same way a human architect reading the same code would offer an interpretation.

What LLMs can actually do today #

I’ve been experimenting with feeding code into ChatGPT and asking architectural questions. The results range from surprisingly useful to confidently wrong, which is about what you’d expect from a technology that’s six weeks old (in its current publicly accessible form).

Pattern recognition works well. Show an LLM a set of classes and it can often identify design patterns—repository pattern, observer, strategy, factory. It picks up on naming conventions, inheritance hierarchies, and interface structures. This is pattern matching, which is what transformers excel at.

Component boundary identification is promising. Feed in a package structure with representative files from each package, and the LLM can often articulate the responsibility of each component. “This appears to be a data access layer.” “This looks like an API gateway that routes requests to domain-specific handlers.” The accuracy correlates strongly with how well the code follows conventions.

Dependency analysis gets interesting. Ask the LLM why two components are coupled, and it can sometimes infer the reason from the code context. “PaymentProcessor imports NotificationService because it sends confirmation emails after successful transactions.” A static analysis tool would just draw the arrow; the LLM tries to explain the arrow.

Intent inference is the frontier. Can the LLM tell you that the original architect chose event sourcing for the order management subsystem because they anticipated high audit requirements? Sometimes, if the code is expressive enough. Can it distinguish between deliberate architectural decisions and organic accretion? This is where it gets unreliable, but even partial answers are more than static analysis provides.

The limitations are real #

Before anyone gets too excited, the constraints are significant.

Context windows are the bottleneck. GPT-3.5 has a 4,096 token context window. That’s roughly 100-150 lines of code. You can’t feed in a two-million-line monolith; you feed in fragments and hope the LLM can synthesize across separate conversations. This fragmented analysis produces fragmented understanding. The LLM can describe what it sees in a 100-line snippet; it can’t hold the whole system in memory the way a human architect eventually does after weeks of immersion.

Hallucination is an architectural hazard. When the LLM doesn’t know something, it makes it up. In a code analysis context, this means it might identify a “microservice boundary” that doesn’t exist, or attribute a design decision to a pattern that wasn’t intentional. You need an engineer who knows the system to validate every inference. The LLM is a hypothesis generator, not an oracle.

Training data bias shapes the analysis. The LLM has seen a lot of Spring Boot applications, a lot of Rails apps, a lot of Express servers. When it analyzes your code, it maps what it sees onto patterns from its training data. If your architecture is unconventional—and plenty of legacy systems are deeply unconventional, not by choice—the LLM may force-fit standard patterns where they don’t apply.

No runtime understanding. Static analysis tools at least work with the actual codebase. An LLM working from pasted code snippets has no access to runtime behavior: no profiling data, no request traces, no database query patterns. Architecture recovery without runtime data is like a forensic investigation without the crime scene; you’re working from photographs.

A practical workflow #

Despite the limitations, I think there’s a viable workflow emerging for teams inheriting legacy systems.

Phase 1: Automated survey. Use static analysis tools to generate the syntactic dependency graph. This gives you the “what connects to what” baseline.

Phase 2: LLM-assisted interpretation. Feed representative code samples from each component into ChatGPT, along with the dependency graph. Ask the LLM to hypothesize about component responsibilities, architectural patterns, and the rationale for key dependencies. Treat the output as a starting hypothesis.

Phase 3: Human validation. An engineer (ideally someone with domain context) reviews the LLM’s hypotheses against the actual codebase. Confirm, refine, or reject each interpretation. This is dramatically faster than starting from scratch because you’re evaluating hypotheses rather than generating them.

Phase 4: Documentation synthesis. Use the validated hypotheses to produce architectural documentation. The LLM can help with the writing too—feed it the confirmed analysis and ask it to produce architecture decision records, component descriptions, and system diagrams.

This isn’t fully automated architecture recovery. It’s LLM-augmented architecture recovery, with humans in the loop for validation. But it could compress a months-long discovery process into weeks.

The democratization angle #

Here’s what excites me most about this. Architecture discovery has traditionally required senior engineers or external consultants with deep experience. Not every team has access to someone who can look at a codebase and immediately recognize the architectural patterns, the intentional decisions, and the accidental complexity.

LLMs don’t replace that expertise, but they lower the bar for getting started. A mid-level engineer armed with ChatGPT and a decent understanding of architectural concepts can produce a first-pass analysis that would have been impossible without senior guidance a year ago. The senior architect still needs to validate the work, but the initial heavy lifting—the reading, the pattern matching, the hypothesis generation—gets distributed.

For teams maintaining legacy systems without access to the original architects (which is most teams maintaining legacy systems), this is a meaningful shift. The code itself becomes more legible, not because it changed, but because we have better tools for reading it.

What’s next #

I’m going to experiment with this more systematically over the next few months. I want to develop a repeatable process: specific prompts, specific validation steps, specific output formats. The ad-hoc “paste code into ChatGPT and see what happens” approach works for exploration, but it doesn’t scale.

The context window limitation is the biggest practical barrier. If that expands significantly—and I expect it will—the forensic architect model becomes much more powerful. Imagine feeding an LLM an entire microservice (not just snippets) and asking it to produce an architecture decision record. We’re not there yet, but we’re closer than I would have guessed six months ago.

Happy New Year. Go read some legacy code. Or better yet, ask an LLM to read it for you and tell you what it found.