Is the software architecture model in the code?

The code is often cited as the "single point of truth", but why isn't a description of the architecture in the code? Let's look at this in the context of the C4 model.

1. System Context

A good starting point for describing a software system is to draw a system context diagram. This shows the system in question along with the key types of user (e.g. actors, roles, personas, etc) and system dependencies. Is it possible to get this information from the code? The answer is, "not really".

  • Users: In theory, we should be able to get a list of user roles from the code. For example, many software systems will have some security configuration that describes the various user roles, Active Directory groups, etc and the parts of the system that such users have access too. The implementation details will differ from codebase to codebase and technology to technology, but in theory this information is available somewhere in the absence of an explicit list of user types.
  • System dependencies: The list of system dependencies is a little harder to extract from a codebase. Again, we can scrape security configuration to identify links to systems such as LDAP and Active Directory. We could also search the codebase for links to known libraries, APIs and service endpoints (e.g. URLs), and make the assumption that these are system dependencies. But what about those system interactions that are done by copying a file to a network share? This sounds archaic, but it still happens. Understanding inbound dependencies is also tricky.

2. Containers

The next level in the C4 model is a container diagram that shows the various web applications, mobile apps, databases, file systems, standalone applications, etc and how they interact to form the overall software system. Again, some of this information will be present, in one form or another, in the codebase. For example, you could scrape this information from:

  • IDE project files: Information about executable artifacts (and therefore containers) could in theory be extracted from IntelliJ IDEA project files, Microsoft Visual Studio solution files, Eclipse workspaces, etc.
  • Build scripts: Automated build scripts (e.g. Ant, Maven, Gradle, MSBuild, etc) typically generate executable artifacts or have module definitions that can again be used to identify containers.
  • Infrastructure provisioning and deployment scripts: Infrastructure provisioning and deployment scripts (e.g. Puppet, Chef, Vagrant, Docker, etc) will probably result in deployable units, which can again be identified and this information used to create the containers model.

Extracting information from such sources is useful if you have a microservices architecture with hundreds of separate containers but, if you simply have a web application talking to a database, it may be easier to explicitly define this rather than going to the effort of scraping it from the code.

3. Components

The third level of the C4 model is a component diagram. Since even a relatively small software system may consist of a large number of components, this is a level that we certainly want to automate. But it turns out that even this is tricky. Usually there's a lack of an architecturally-evident coding style, which means you get a conflict between the software architecture model and the code. This is particularly true in older systems where the codebase lacks modularity and looks like a sea of thousands of classes interacting with one another. Assuming that there *is* some structure to the code, "components" can be extracted using a number of different approaches, depending on the codebase and the degree to which an architecturally-evident coding style has been adopted:

  • Metadata: The simplest approach is to annotate the architecturally significant elements in the codebase and extract them automatically. Examples include finding Java classes with specific annotations or C# classes with specific attributes. These could be your own annotations or those provided by a framework such as Spring (e.g. @Controller, @Service, @Repository, etc), Java EE (e.g. @EJB, @MessageDriven, etc) and so on.
  • Naming conventions: If no metadata is present in the code, often a naming convention will have been consciously or unconsciously adopted that can assist with finding those architecturally significant code elements. For example, finding all classes where the name matches "xxxService" or "xxxRepository" may do the trick.
  • Packaging conventions: Alternatively, perhaps each sub-package or sub-namespace (e.g. represents a component.
  • Module systems: If a module system is being used (e.g. OSGi), perhaps each of the module bundles represents a component.
  • Build scripts: Similarly, build scripts often create separate modules/JARs/DLLs from a single codebase and perhaps each of these represents a component.

Auto-generating the software architecture model

Ideally, we should auto-generate as much of the software architecture model as possible from the code, but this isn't currently realistic because most codebases don't include enough information about the software architecture to be able to do this effectively. Another approach is to extract and supplement.