Posted: 2019/01/28
Last edited: 2019/03/05
One of the books I listened to in 2018 was "Linked", by Laszlo Barabasi. I first read Linked when I was a graduate student at Northeastern. It was around 2008 or 2009, soon after Barabasi moved from Notre Dame to Northeastern. It is a wonderful book and I have revisited it several times since.
The book talks about networks in the real world, and studies their structure. Examples of such networks are:
Last edited: 2019/03/05
One of the books I listened to in 2018 was "Linked", by Laszlo Barabasi. I first read Linked when I was a graduate student at Northeastern. It was around 2008 or 2009, soon after Barabasi moved from Notre Dame to Northeastern. It is a wonderful book and I have revisited it several times since.
The book talks about networks in the real world, and studies their structure. Examples of such networks are:
- The world wide web, where webpages are the nodes and hyperlinks are the edges.
- The social network of people, where there is a link between two people that know each other.
- The Hollywood actor collaboration network, where two actors are linked if they have played in the same movie.
- The scientific collaboration network, where two scientists are linked if they have authored a paper together.
This post is intentionally light on math. I did this both to make it more approachable, but also because I’m a novice in this area myself, so I only understand the math at a high level and I’m fuzzy on the details.
The book starts with the network model of mathematicians Paul Erdos and Alfred Renyi, introduced in 1959. In their model, a network is a graph of N nodes that are connected to each other randomly, with undirected links. When each node gets its links randomly, all nodes tend to have a similar number of links. If you plot the degree distribution of the network (a histogram showing how many nodes have 0 links, 1 link, … , all the way up to N - 1 links), you see that it forms a bell curve with some mean M. Most nodes have M links. The curve decays exponentially as we move away from M, meaning that there aren’t really any nodes that are outliers, i.e., nodes that have a vastly different number of links. Erdos and Renyi also found that if each node has just one link on average, this is sufficient to connect most nodes into a single giant cluster. Fewer than one link on average leaves the network as a collection of disconnected islands.
The phenomenon now called "six degrees of separation" first appeared in the writings of Hungarian writer Frigyes Karinthy in 1929. He guessed that each pair of people can be connected by 5 links. In the 1960s, the psychologist Stanley Milgram conducted an experiment for the United States, which estimated that the average distance between two random people is 5.5, which was rounded up to 6. For people new to networks, it is counterintuitive that a network with hundreds of millions or billions of nodes has such a short average distance. Such networks are often called "small worlds". (See also the work of Watts and Strogatz later in this post.)
In his 1973 paper "The Strength of Weak Ties", sociologist Mark Granovetter proposed the following model for the social network: people are in small tight clusters with their close friends, and these clusters are linked by a person's acquaintances, called "weak ties". He found that most people find new jobs through weak ties, because your strong ties are usually exposed to exactly the same opportunities as you.
Another very important development was the work of Watts and Strogatz, published in Nature in 1998. They define the clustering coefficient (CC) of a node as the number of links between the node’s neighbors, divided by the total number of links the neighbors could have among them. The CC of a network is the average CC of its nodes. A random network, as in Erdos and Renyi’s model, has short average distance, and also a very small clustering coefficient; when nodes are connected randomly, clusters are hard to form. CC gives us a way to check whether a network was formed randomly: a network whose CC differs significantly from what is predicted by randomness was not formed randomly.
Watts and Strogatz came up with a network with two main characteristics: high degree of clustering, and short average distance. They proposed a network model that is ring-like: nodes live in small highly-connected neighborhoods, and these neighborhoods are linked together in a ring. Such a network has high CC. Also, they rewire a small number of links randomly around the network, and this makes the network have short average distance as well. They call networks with high CC and small average distance "small worlds". They showed that several real-world networks are small worlds. They also showed that only a small percentage of random connections in a network is sufficient to create a small world.
The network of Watts and Strogatz starts in a very different initial state than Erdos and Renyi’s network: a ring instead of a collection of disconnected nodes. However, after that, nodes in both networks acquire links in the same way, randomly. As a result, both models predict that the degree distribution is a bell curve. All nodes have a similar number of edges, and it is extremely rare to deviate a lot from the average.
In two papers in 1999, Barabasi and his students Reka Albert and Hawoong Jeong studied the connectivity of the world wide web, and found that the degree distribution is a power law, which is very different from a bell curve. In a power law distribution, there isn’t a typical node, with a typical number of edges. There are a few nodes that are hubs: they have an extremely high number of edges. Then, a few more nodes with a lot of edges, but not as many as hubs, then more nodes with even fewer edges, and so on. Two nodes can differ greatly in the number of edges they have. Barabasi et al call such networks "scale free". They also estimated that the average distance between nodes on the web is 19. Like the web, the network of actors in Hollywood and the network of scientific collaborations are scale free.
After Barabasi and his students observed the power-law distribution on the web, they wanted to understand how it happens. They knew that power laws appear in physics when systems undergo a phase transition (through Wilson's work on renormalization), and wondered if networks undergo a phase transition as well. They did not find one.
The model they came up with relies on two things to explain power laws: growth and preferential attachment. Growth means that the network starts with only a few nodes and we gradually add new nodes and connect them to existing ones. Preferential attachment means that when we add a new node, the probability that it connects to an existing node N is proportional to the number of N's links. It is an instance of the "rich get richer" phenomenon. With this model, as the network grows, hubs are formed. The older nodes have an advantage because they have more time to acquire links, and due to preferential attachment, their many links cause them to get even more links. Preferential attachment makes intuitive sense on the web and other networks. When new sites get created, they are more likely to link to popular sites instead of obscure ones.
An oversimplification of this model is that old nodes always win. But in the real world, a strong newcomer can get many links. The example Barabasi uses in the book is Google, and how it eventually beat previous search engines. To account for this possibility, Barabasi and Ginestra Bianconi modified the model to assign a "fitness" coefficient to each node: a number from 0 to 1 which describes the quality of the node. Then, they changed preferential attachment so that new nodes look at both fitness and the number of edges of existing nodes when deciding where to connect to. They used some math from Bose-Einstein condensates to show that when a node is much fitter than all other nodes, it may get all links in the network, resulting in a star topology instead of a graph.
What we have seen so far are the network models discussed in the first nine chapters of the book. The rest of the book expands on scale-free networks. The first topic discussed is network resilience: when you start removing nodes from a network, how many nodes do you need to remove before the network breaks down to isolated clusters?
Let’s say that the removal of nodes is random. In a random network, you can remove up to some percentage of the nodes, and then the network splits apart. But in a scale free network with a degree exponent between 2 and 3 (meaning that the degree distribution decays gradually, and there are several hubs) there is no such threshold. You can keep removing nodes randomly and the hubs keep the network together. Now, what happens when the removal of nodes is not random? It is very easy to break a scale-free network apart if you remove the nodes with the most links first. A random network, having no hubs, is not impacted by targeted removal of nodes.
There is also the question of diffusion in networks (spread of diseases, new ideas, products, etc). Barabasi mentions a study where the researchers looked at the rate of adoption of new crops. They found that the number of people who adopt a new crop each year follows a bell curve. In the first year, only a few people adopt the crop. In the following years, there is a fast ramp up as more and more people adopt it, and then there is a fast drop when most people have adopted it.
How does the structure of a network affect diffusion? The typical diffusion models assume a random network. Then, they define an epidemic threshold: how likely is it that a node with the disease spreads it to its neighbors? Then, they conclude that if the threshold is below some level, the spread dies down. Over that level, the disease quickly spreads across the network.
Researchers studying computer viruses found that viruses stick around long after they were supposed to be gone. This behavior does not conform to the diffusion models in random networks. When they realized that computer networks are scale free, they studied diffusion using the scale-free model, and found that there is no epidemic threshold; even a mildly infecting virus can spread quickly.
When talking about the spread of new ideas and products in a scale-free network, diffusion starts from nodes called innovators. Then, the ideas spread to the early adopters, which are often hubs, because hubs are well connected, so they hear about new ideas quickly. From the hubs, diffusion to the rest of the network is very rapid.
Researchers have also studied the spread of AIDS. One challenge there is creating the network of sexual partners, because you cannot ask people to tell you the names of people they have slept with. By only asking people about the number of partners they have had, researchers estimated that the sex network is scale free. If you are trying to reduce the spread of a sexually-transmitted disease in a scale-free network, and assuming that there is not enough money to treat everyone, what do you do? One possibility suggested is to treat hubs first, which of course raises serious ethical questions.
After diffusion, Barabasi spends some time on the history of the internet and the web. He mentions that Paul Baran at Rand Corporation was tasked with designing a computer network that could be resistant to attacks. He proposed a design where the routers are connected in a mesh pattern. In such a network, there are no hubs, so it is hard to take down the network by disabling a few nodes. In addition, he proposed that messages be broken down into small digital packets. AT&T, who was the megacorp of the time, and largely based on analog technology, opposed the idea. Later, the idea of digital packets was discovered independently by the creators of Arpanet, which became the internet. But the connectivity of routers on the internet is scale free, not a mesh.
A brief mention to the rest of the topics in the book.
- He discusses network models with directed edges instead of undirected, and mentions that directed edges limit the navigability of a network.
- He mentions the importance of communities in networks, and discusses some work on community-detection algorithms.
- He mentions that many biological networks are scale free. In the metabolic network, two molecules are connected if they participate in the same chemical reaction. The metabolic networks of many organisms are scale free, with ATP, ADP, and water being the hub molecules.
- Last, he discusses some networks related to economics and finance. If I remember correctly, the network of directors participating in boards of Fortune 500 companies is scale free.
- Sync, by Steven Strogatz.
- Six Degrees: The Science of a Connected Age, by Duncan Watts. I have not read this one yet. From the table of contents, there is some overlap with Linked, but several new topics as well.
- The Formula: The Universal Laws of Success, by Laszlo Barabasi. Using network science as a base, the book discusses how people succeed and fail. Highly entertaining.
Update (2019/03/05): The pervasiveness of scale-free networks has been challenged by other network scientists, most recently by Aaron Clauset and his collaborators. This comment in Nature nicely explains the current state of affairs.
Comments
Post a Comment