The network structure of computer programs

Posted: 2019/01/28
Last edited: 2019/04/29

In an earlier post summarizing "Linked", we learned about scale-free networks, and other mathematical models of real-world networks. A fun and interesting question is: Can we view computer programs as networks, and what are the properties of these networks? (Note: by computer program, I mean a program including all its dependencies, so it can in fact be a software system with millions of lines of code.)

For example, consider the call graph of a Java application. We can view this graph as a network whose nodes are the methods in the program, and there is an undirected edge between two methods A and B if A calls B, or if B calls A. (We follow the simplifying assumption of "Linked" and use undirected edges for now.) It would be useful to measure some basic facts about this network: the degree distribution, the average distance between nodes, the clustering coefficient, etc. Is this network random, scale free, or something else? Intuitively, I believe that the degree distribution is not a bell curve, so the network is not random. There are many methods, such as Collections methods, that are called much more often than others, so I believe that we would see hubs in the network.

Besides the network of method calls, there are other networks hidden in computer programs.

  • A network where classes are the nodes, and two classes A and B are linked if they are referenced in the same source file.
  • A network with modules as the nodes, and two modules are linked if one imports the other.
  • A network of objects at runtime, where two objects are linked if one calls methods of the other one. We can use code instrumentation to record such a network.

Can we make informed guesses about the structure of these networks? It is likely that all of them contain hubs, so the edge distribution is not a bell curve. It is common to have a few classes (or modules) that are used in many places in a program, and others that are used very little. At runtime, most objects are short lived, but a few objects are live for the duration of the program, and interact with many other objects.

After we figure out the structure of these networks, we could try to explain how the structure emerged. Suppose that the network of methods is scale free. By looking at the revision history of a program, we can see the growth of the network, with methods being added and removed. We may observe preferential attachment: methods that are called often in early revisions keep collecting callers as the program evolves, and eventually become hubs. Do the old methods always become the hubs, or can a fit newcomer become a hub as well?

In order to study these networks of program elements, we need to create them first, and the creation itself presents challenges. In the network of methods, when a class method A calls an interface method B, do we draw an edge between A and B, or do we try to find which class methods could be the actual callees, using control-flow analysis (CFA)? In an untyped or optionally typed language, can CFA give us enough precision to create the network of functions? If we take directionality of the edges into account, does the structure of the network change drastically?

Suppose that we create such networks for many large programs, in many languages. What can we do with this information? Are there compelling applications of these networks? Here are some hand-wavy possibilities:

The network structure allows us to weigh the importance of each piece of an application differently. Bad code in methods and classes that are hubs has greater negative impact than bad code in obscure parts of the application. We can prioritize bug fixes, performance fixes, refactorings, etc to improve the well-connected parts of the application first. A static analysis can use more precise techniques (that run slower) for the hubs in the program, and use cheaper and less precise techniques for the unimportant nodes. Similarly, when we try to increase code coverage in our tests, we can focus on coverage of the hubs first.

Maybe the networks can be used in program-understanding tools in IDEs, to give insights to developers. By browsing the network, a developer may see some unexpected connection and decide to investigate, or may see some overly complicated connectivity pattern, and decide to refactor the code.

Say we are writing a fuzzer that creates random programs for a language X, to discover bugs in a compiler for X. It is valuable if the fuzzer produces programs that somewhat resemble what a human could write. In that case, the bugs discovered by the fuzzer are more likely to be encountered in real code as well. Incorporating information about the network structure of X programs into the fuzzer can allow the fuzzer to create more realistic programs.

Studying how the network structure of a program evolves over time (via commit history) can give us insights about software evolution: how pieces of functionality move from one place to another. For example, it is likely that pieces of code that are well connected in the network eventually get moved to the same file/class/module. A tool may be able to predict such refactorings and suggest them to the programmer automatically.

Program networks may allow us to understand build dependencies better. In a monorepo such as Google's, how much code gets pulled in on average when we add a new dependency to our application? One can plot the amount of code that becomes reachable as a function of the number of direct dependencies of a build target. If program networks are random (a la Erdos and Renyi), the reachable code may increase linearly or sublinearly as we add dependencies. If program networks are scale free, the reachable code may increase superlinearly. Such information can influence maintainability decisions in large codebases.

The network structure of a program may give us some clues about the program's quality. For example, in a well organized application we may see clearly identifiable clusters. In a poorly organized application with a lot of shared state, we may see a single giant cluster. It might be possible to devise a metric of software complexity (like cyclomatic complexity) based on the network structure of a program. As new code gets checked in, the metric gets recalculated, and developers are warned if code quality goes down significantly.

These are some of the applications of network science to software engineering that I can think of. I'm sure there are others, and I'd love to see PL researchers study the network structure of programs. Some folks outside the mainstream PL research community have noticed the connection.
  • Devon Estes gave a talk at Lambda Days last year, where he studied the connectivity of modules in Elixir, a functional language similar to Erlang. He uses Watts and Strogatz's model, and does not mention scale-free networks. I'm guessing he is not aware of Barabasi's work. He found that the network of modules has low clustering. This is expected, as module dependencies are typically trees. He also found that the network has small average distance. Because of low clustering, he concluded that the network resembles a random network more than a small world. He did not calculate the edge distribution. My guess is that it is closer to a power law than a bell curve, so the network is actually scale free, not random.
  • This blog post by Kevin Gullikson studies the network of package dependencies in Python.
  • Subelj and Bajec wrote a paper on community detection, and they mention communities formed in a network of Java packages.
  • A few more networks related to software can be found on this website. (Select the software subdomain.)
To conclude, I believe that studying the network structure of computer programs would be a fruitful area of research. I have wanted to try this myself for several years, but life got in the way and I never found the time. Hopefully, this post will encourage others to do it. I vaguely remember a speech by Tony Hoare at POPL 2012 in Philadelphia, where he suggested that we should study programs as people write them; make these programs the subject of scientific inquiry, and study their properties. I really like this vision. The space of programs that people write is much more constrained that the space of all possible programs. We can think of programs written by people as biological objects, and try to understand their properties. Applying network science to programs can be seen as a small part of that vision.

EDIT(2019/04/29): Section 4.4 of this survey paper on software complexity mentions several papers studying the network structure of programs.
  • "Software systems as complex networks: Structure, function, and evolvability of software collaboration graphs" by C.R. Myers (2003). Studies class dependencies (compile time) and finds scale-free distributions.
  • "Scale-free geometry in OO programs" by Potanin et al (2005). Studies the object graph at runtime.
  • "Power laws in a large object-oriented software system" by Concas et al (2007). Studies several static properties of programs.
  • "Power laws in software" by Louridas et al (2008).
  • I have only skimmed these papers, so I can't comment in detail, but they all seem interesting, and confirm the intuition that networks in computer programs are characterized by power laws. I hope to see more work in this area, both to understand software networks better, and to explore potential applications.



    Comments