Statistical Analysis of Network Data with R

Bioengenharia e Bioinformática

•
UNESP

Aron Roger
29/11/2016
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 214 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 214 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 214 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Bioengenharia e Bioinformática

40 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
Use R !
Eric D. Kolaczyk
Gábor Csárdi 
Statistical 
Analysis of 
Network Data 
with R
Use R!
Series Editors:
Robert Gentleman Kurt Hornik Giovanni Parmigiani
For further volumes:
http://www.springer.com/series/6991
Use R!
Albert: Bayesian Computation with R (2nd ed. 2009)
Bivand/Pebesma/Go´mez-Rubio: Applied Spatial Data Analysis with R (2nd ed.
2013)
Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis:
With R and GGobi
Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies
Paradis: Analysis of Phylogenetics and Evolution with R (2nd ed. 2012)
Pfaff: Analysis of Integrated and Cointegrated Time Series with R (2nd ed. 2008)
Sarkar: Lattice: Multivariate Data Visualization with R
Spector: Data Manipulation with R
Eric D. Kolaczyk • Ga´bor Csa´rdi
Statistical Analysis of
Network Data with R
123
Eric D. Kolaczyk
Department of Mathematics and Statistics
Boston University Professor
Boston, MA, USA
Ga´bor Csa´rdi
Department of Statistics
Harvard University Research Associate
Cambridge, MA, USA
ISSN 2197-5736 ISSN 2197-5744 (electronic)
ISBN 978-1-4939-0982-7 ISBN 978-1-4939-0983-4 (eBook)
DOI 10.1007/978-1-4939-0983-4
Springer New York Heidelberg Dordrecht London
Library of Congress Control Number: 2014936989
© Springer Science+Business Media New York 2014
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered
and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of
this publication or parts thereof is permitted only under the provisions of the Copyright Law of the
Publisher’s location, in its current version, and permission for use must always be obtained from Springer.
Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations
are liable to prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of pub-
lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
Pour Jose´e, sans qui ce livre n’aurait pas vu
le jour—E.D.K.
Z-nak, a gye´ma´nte´rt e´s az aranye´rt—G.CS.
Preface
Networks and network analysis are arguably one of the largest recent growth areas
in the quantitative sciences. Despite roots in social network analysis going back
to the 1930s and roots in graph theory going back centuries, the phenomenal rise
and popularity of the modern field of ‘network science’, as it is sometimes called,
is something that likely could not have been predicted 10–15 years ago. Networks
have permeated everyday life, far beyond the realm of research and methodology,
through now-familiar realities like the Internet, social networks, viral marketing,
and more.
Measurement and data analysis are integral components of network research. As
a result, there is a critical need for all sorts of statistics for network analysis, both
common and sophisticated, ranging from applications, to methodology, to theory.
As with other areas of statistics, there are both descriptive and inferential statistical
techniques available, aimed at addressing a host of network-related tasks, including
basic visualization and characterization of network structure; sampling, modeling,
and inference of network topology; and modeling and prediction of network-indexed
processes, both static and dynamic.
Software for performing many such network-related analyses is now available
in various languages and environments, across different platforms. Not surprisingly,
the R community has been particularly active in the development of software for do-
ing statistical analysis of network data. As of this writing there are already dozens
of contributed R packages devoted to some aspect of network analysis. Together,
these packages address tasks ranging from standard manipulation, visualization, and
characterization of network data (e.g., igraph, network, and sna), to modeling of
networks (e.g., igraph, eigenmodel, ergm, and mixer), to network topology infer-
ence (e.g., glasso and huge). In addition, there is a great deal of analysis that can be
done using tools and functions from the R base package.
In this book we aim to provide an easily accessible introduction to the statistical
analysis of network data, by way of the R programming language. As a result, this
book is not, on the one hand, a detailed manual for using the various R packages en-
countered herein, nor, on the other hand, does it provide exhaustive coverage of the
conceptual and technical foundations of the topic area. Rather, we have attempted
vii
viii Preface
to strike a balance between the two and, in addition, to do so using a (hopefully!)
optimal level of brevity. Accordingly, we envision the book being used, for example,
by (i) statisticians looking to begin engaging in the statistical analysis of network
data, whether at a research level or in conjunction with a new collaboration, and
hoping to use R as a natural segue, (ii) researchers from other similarly quantitative
fields (e.g., computer science, statistical physics, economics, etc.) working in the
area of complex networks, who seek to get up to speed relatively quickly on how to
do statistical analyses (both familiar and unfamiliar) of network data in R, and (iii)
practitioners in applied areas wishing to get a foothold into how to do a specific type
of analysis relevant to a particular application of interest.
More generally, the book has been written at a level aimed at graduate students
and researchers in quantitative disciplines engaged in the statistical analysis of net-
work data, although advanced undergraduates already comfortable with R should
find much of the book fairly accessible as well. At present, therefore, we antici-
pate the book being of interest to readers not only in statistics, of course, but also
in areas like computational biology, computer science and machine learning, eco-
nomics, neuroscience, quantitative finance, signal processing, statistical physics,
and the quantitative social sciences.
There are a number of people we wish to thank, whose help at various stages
of development and writing is greatly appreciated. Thanks go to the editorial team
at Springer for their enthusiasm in encouraging us to take on this project and for
their feedback along the way and to the students in the course Statistical Analysis
of Network Data (MA703) at Boston University in Fall 2013, for their comments
on many of the earlier chapters. Special thanks are also due to Xinyu Kang, Heather
Shappell, and Yaonan Zhang, who went through the entire first complete draft, care-
fully reading every chapter and testing the code blocks throughout the book. We are
grateful as well to Christophe Ambroise, AlainBarrat, Mark Coates, Suchi Gopal,
Emmanuel Lazega, and Petra Staufer for kindly making available their data. More
broadly, we would like to express our appreciation in general for the countless hours
of effort invested by the developers of the many R packages that we have made use
of throughout the pages of this book. Without their work, the breadth and scope of
our own here would be significantly reduced. Finally, although surely still inade-
quate, we wish to express our deepest gratitude to our respective families for their
love, patience, and support throughout the writing of this book.
All code and data used in this book have been made available in the R package
sand, distributed through the CRAN archive.
Boston, MA, USA Eric D. Kolaczyk
Cambridge, MA, USA Ga´bor Csa´rdi
March, 2014
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Why Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Types of Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Visualizing and Characterizing Networks . . . . . . . . . . . . . . . . 3
1.2.2 Network Modeling and Inference . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.3 Network Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Why Use R for Network Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.4 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5 About the R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2 Manipulating Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Creating Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Undirected and Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Representations for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.3 Operations on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Decorating Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3.1 Vertex, Edge, and Graph Attributes . . . . . . . . . . . . . . . . . . . . . 18
2.3.2 Using Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Talking About Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.1 Basic Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.4.2 Special Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3 Visualizing Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Elements of Graph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3 Graph Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Decorating Graph Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5 Visualizing Large Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
ix
x Contents
3.6 Using Visualization Tools Outside of R . . . . . . . . . . . . . . . . . . . . . . . . 39
3.7 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4 Descriptive Analysis of Network Graph Characteristics . . . . . . . . . . . . 43
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Vertex and Edge Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.1 Vertex Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Vertex Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.3 Characterizing Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.3 Characterizing Network Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.1 Subgraphs and Censuses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3.2 Density and Related Notions of Relative Frequency . . . . . . . 55
4.3.3 Connectivity, Cuts, and Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.4 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.4.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.4.2 Spectral Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.4.3 Validation of Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 64
4.5 Assortativity and Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5 Mathematical Models for Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . 69
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Classical Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3 Generalized Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Network Graph Models Based on Mechanisms . . . . . . . . . . . . . . . . . . 74
5.4.1 Small-World Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.2 Preferential Attachment Models . . . . . . . . . . . . . . . . . . . . . . . . 77
5.5 Assessing Significance of Network Graph Characteristics . . . . . . . . . 79
5.5.1 Assessing the Number of Communities in a Network . . . . . . 79
5.5.2 Assessing Small World Properties . . . . . . . . . . . . . . . . . . . . . . 81
5.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6 Statistical Models for Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Exponential Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.2 Specifying a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.2.4 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Network Block Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
6.3.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.3.2 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.3 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Latent Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.4.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
Contents xi
6.4.2 Specifying the Latent Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.4.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 104
6.4.4 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
6.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7 Network Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
7.3 Association Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
7.3.1 Correlation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
7.3.2 Partial Correlation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.3.3 Gaussian Graphical Model Networks . . . . . . . . . . . . . . . . . . . . 126
7.4 Tomographic Network Topology Inference . . . . . . . . . . . . . . . . . . . . . 129
7.4.1 Constraining the Problem: Tree Topologies . . . . . . . . . . . . . . 129
7.4.2 Tomographic Inference of Tree Topologies:
An Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
8 Modeling and Prediction for Processes on Network Graphs . . . . . . . . . 135
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
8.2 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
8.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.3.1 General Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.3.2 Auto-Logistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
8.3.3 Inference and Prediction for Auto-logistic Models . . . . . . . . . 144
8.3.4 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.4 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.4.1 Designing Kernels on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.4.2 Kernel Regression on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.5 Modeling and Prediction for Dynamic Processes . . . . . . . . . . . . . . . . 154
8.5.1 Epidemic Processes: An Illustration . . . . . . . . . . . . . . . . . . . . . 154
8.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
9 Analysis of Network Flow Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.2 Modeling Network Flows: Gravity Models . . . . . . . . . . . . . . . . . . . . . 162
9.2.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.2.2 Inference for Gravity Models . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.3 Predicting Network Flows: Traffic Matrix Estimation . . . . . . . . . . . . 170
9.3.1 An Ill-Posed Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 171
9.3.2 The Tomogravity Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.4 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
xii Contents
10 Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
10.2 Representation and Manipulation of Dynamic Networks . . . . . . . . . . 180
10.3 Visualization of Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
10.4 Characterization of Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . 191
10.5 Modeling Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
Biographies
xiii
Eric D. Kolaczyk is a professor of statistics and Director of the Program in
Statistics, in the Department of Mathematics and Statistics at Boston University,
where he also is an affiliated faculty member in the Bioinformatics Program, the
Division of Systems Engineering, and the Program in Computational Neuroscience.
His publications on network-based topics, beyond the development of statistical
methodology and theory, include work on applications ranging from the detection
of anomalous traffic patterns in computer networks to the prediction of biological
function in networks of interacting proteins to the characterization of influence of
groups of actors in social networks. He is an elected fellow of the American Statis-
tical Association (ASA) and an elected senior member of the Institute of Electrical
and Electronics Engineers (IEEE).
Ga´bor Csa´rdi is a research associate at the Department of Statistics at Harvard
University, Cambridge, MA, USA. He holds a Ph.D. in computer science from
Eo¨tvo¨s University, Hungary. His research includes applications of network analy-
sis in biology and social sciences, bioinformatics and computational biology, and
graph algorithms. He created the igraph software package in 2005 and has been one
of the lead developers since then.
Chapter 1
Introduction
1.1 Why Networks?
The oft-repeated statement that “we live in a connected world” perhaps best
captures, in its simplicity, why networks have come to hold such interest in re-
cent years. From on-line social networks like Facebook to the World Wide Web and
the Internet itself, we are surrounded by examples of ways in which we interact
with each other. Similarly, we are connected as well at the level of various human
institutions (e.g., governments), processes (e.g., economies), and infrastructures
(e.g., the global airline network). And, of course, humans are surely not unique
in being members of various complex, inter-connected systems. Looking at the
natural world around us, we see a wealth of examples of such systems, from en-
tire eco-systems, to biological food webs, to collections of inter-acting genes or
communicating neurons.
The image of a network—that is, essentially, something resembling a net—is a
natural one to use to capture the notion of elements in a system and their inter-
connectedness. Note, however, that the term ‘network’ seems to be used in a variety
of ways, at various levels of formality. The Oxford English Dictionary, for example,
defines the word network in its most general form simply as “a collection of inter-
connected things.” On the other hand, frequently ‘network’ is used inter-changeably
with the term ‘graph’ since, for mathematical purposes, networks are most com-
monly represented in a formal manner using graphs of various kinds. In an effort to
emphasize the distinction between the general concept and its mathematical formal-
ization, in this book we will use the term ‘network’ in its most general sense above,
and—at the risk of the impression of a slight redundancy—we will sometimes for
emphasis refer to a graph representing such a network as a ‘network graph.’
The seeds of network-based analysis in the sciences, particularly its mathematical
foundation of graph theory, are often placed in the 1735 solution of Euler to the now
famous Ko¨nigsberg bridge problem, in which he proved that it was impossible to
walk the seven bridges of that cityin such a way as to traverse each only once.
Since then, particularly since the mid-1800s onward, these seeds have grown in a
E.D. Kolaczyk and G. Csa´rdi, Statistical Analysis of Network Data with R, Use R! 65,
DOI 10.1007/978-1-4939-0983-4 1, © Springer Science+Business Media New York 2014
1
2 1 Introduction
number of key areas. For example, in mathematics the formal underpinnings were
systematically laid, with Ko¨nig [92] cited as the first key architect. The theory of
electrical circuits has always had a substantial network component, going back to
work of Kirchoff, and similarly the study of molecular structure in chemistry, going
back to Cayley. As the fields of operations research and computer science grew
during the mid-1900s, networks were incorporated in a major fashion in problems
involving transportation, allocation, and the like. And similarly during that time
period, a small subset of sociologists, taking a particularly quantitative view towards
the topic of social structure, began developing the use of networks in characterizing
interactions within social groups.
More recently—starting, perhaps, in the early to mid-1990s—there has been an
explosion of interest in networks and network-based approaches to modeling and
analysis of complex systems. Much of the impetus for this growth derives from
work by researchers in two particular areas of science: statistical physics and com-
puter science. To the former can be attributed a seminal role in encouraging what
has now become a pervasive emphasis across the sciences on understanding how
the interacting behaviors of constituent parts of a whole system lead to collective
behavior and systems-level properties or outcomes. Indeed the term complex system
was coined by statistical physicists, and a network-based perspective has become
central to the analysis of complex systems. To the latter can be attributed much of
the theory and methodology for conceptualizing, storing, manipulating, and doing
computations with networks and related data, particularly in ways that enable effi-
cient handling of the often massive quantities of such data. Moreover, information
networks (e.g, the World Wide Web) and related social media applications (e.g.,
Twitter), the development of which computer scientists have played a key role, are
examples of some of the most studied of complex systems (arguably reflecting our
continued fascination with studying ourselves!).
More broadly, a network-based perspective recently has been found to be useful
in the study of complex systems across a diverse range of application areas. These
areas include computational biology (e.g., studying systems of interacting genes,
proteins, chemical compounds, or organisms), engineering (e.g., establishing how
best to design and deploy a network of sensing devices), finance (e.g., studying the
interplay among, say, the world’s banks as part of the global economy), marketing
(e.g., assessing the extent to which product adoption can be induced as a type of
‘contagion’), neuroscience (e.g., exploring patterns of voltage dynamics in the brain
associated with epileptic seizures), political science (e.g., studying how voting pref-
erences in a group evolve in the face of various internal and external forces), and
public health (e.g., studying the spread of infectious disease in a population, and
how best to control that spread).
In general, two important contributing factors to the phenomenal growth of
interest in networks are (i) an increasing tendency towards a systems-level perspec-
tive in the sciences, away from the reductionism that characterized much of the
previous century, and (ii) an accompanying facility for high-throughput data collec-
tion, storage, and management. The quintessential example is perhaps that of the
changes in biology over the past 10–20 years, during which the complete mapping
1.2 Types of Network Analysis 3
of the human genome, a triumph of computational biology in and of itself, has now
paved the way for fields like systems biology to be pursued aggressively, wherein a
detailed understanding is sought of how the components of the human body, at the
genetic level and higher, work together.
Ultimately, the study of systems such as those just described is accompanied by
measurement and, accordingly, the need for statistical analysis. The focus of this
book is on how to use tools in R to do statistical analysis of network data. More
specifically, we aim to present tools for performing what are arguably a core set of
analyses of measurements that are either of or from a system conceptualized as a
network.
1.2 Types of Network Analysis
Network data are collected daily in a host of different areas. Each area, naturally,
has its own unique questions and problems under study. Nevertheless, from a sta-
tistical perspective, there is a methodological foundation that has emerged, com-
posed of tasks and tools that are each common to some non-trivial subset of research
areas involved with network science. Furthermore, it is possible—and indeed quite
useful—to categorize many of the various tasks faced in the analysis of network
data across different domains according to a statistical taxonomy. It is along the
lines of such a taxonomy that this book is organized, progressing from descriptive
methods to methods of modeling and inference, with the latter conveniently sepa-
rated into two sub-areas, corresponding to the modeling and inference of networks
themselves versus processes on networks. We illustrate with some examples.
1.2.1 Visualizing and Characterizing Networks
Descriptive analysis of data (i.e., as opposed to statistical modeling and inference)
typically is one of the first topics encountered in a standard introductory course in
statistics. Similarly, the visualization and numerical characterization of a network
usually is one of the first steps in network analysis. Indeed, in practice, descriptive
analyses arguably constitute the majority of network analyses published.
Consider the network in Fig. 1.1. Shown is a visualization1 of part of a dataset
on collaborative working relationships among members of a New England law firm,
collected by Lazega [98]. These data were collected for the purpose of studying
cooperation among social actors in an organization, through the exchange of various
types of resources among them. The organization observed was a law firm, consist-
ing of over 70 lawyers (roughly half partners and the other half associates) in three
offices located in three different cities. Relational data reflecting resource exchange
1 The R code for generating this visualization is provided in Chap. 3.
4 1 Introduction
were collected, and additional attribute information was recorded for each lawyer,
including type of practice, gender, and seniority. See Lazega and Pattison [99] for
additional details.
ll
ll
ll
ll
ll
ll
ll
ll
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
2627
28
29
30
3132
33
34
35
36
Fig. 1.1 Visualization of Lazega’s network of collaborative working relationships among lawyers.
Vertices represent partners and are labeled according to their seniority (with 1 being most senior).
Vertex colors (i.e., red, blue, and yellow) indicate three different office locations, while vertex shape
corresponds to the type of practice [i.e., litigation (circle) and corporate (square)]. Vertex area is
proportional to number of years with the law firm. Edges indicate collaboration between partners.
There are three female partners (i.e., those with seniority labels 27, 29, and 34); the rest are male
The visual summary in Fig. 1.1 manages to combine a number of important
aspects of these data in one diagram. A graph is used to represent the network,
with vertices correspondingto lawyers, and edges, to collaboration between pairs
of lawyers. In addition, differences in vertex color, shape, size, and label are used
to indicate office location, type of practice, years with the law firm, and seniority.
However, this is far from the only manner in which to visualize these data. Decid-
ing how best to do so is itself both an art and science. Furthermore, visualization
is aided here by the fact that the network of lawyers is so small. Suppose instead
our interest lay in a large on-line social network, such as the Facebook network of
people that have ‘friended’ each other. With well over 1 billion active users reported
as of this writing, it is impossible to display this network in an analogous manner,
with every individual user and their friendships evident. A similar statement often
can be made in biology, for example, such as for networks of proteins and their
1.2 Types of Network Analysis 5
affinity for binding or of genes and their regulatory relationships. As a result, the
visualization of large networks is a separate challenge of its own. We will look at
tools for network visualization in Chap. 3.
Characterization of network data through numerical summaries is another
important aspect of descriptive analysis for networks. However, unlike in an int-
roductory statistics course, where the types of summaries encountered typically
are just simple measures of center (e.g., mean, median, etc.) and dispersion (e.g.,
standard deviation, range, etc.) for a set of real numbers, summary measures for
networks necessarily seek to capture characteristics of a graph. For example, for
the lawyer data in Fig. 1.1, it is natural to ask to what extent two lawyers that
both work with a third lawyer are likely to work with each other as well. This
notion corresponds to the social network concept of transitivity and can be captured
numerically through an enumeration of the proportion of vertex triples that form
triangles (i.e., all three vertex pairs are connected by edges), typically summarized
in a so-called clustering coefficient. In this case, the relevant characterization is an
explicit summary of network structure. On the other hand, given the two types of
lawyers represented in these data (i.e., corporate and litigation), it is also natural
to ask to what extent lawyers of each type collaborate with those of same versus
different types. This notion corresponds to another social network concept—that of
assortativity—and can be quantified by a type of correlation statistic (the so-called
assortativity coefficient), in which labels of connected pairs of vertices are com-
pared. In this case, the focus is on an attribute associated with network vertices (i.e.,
lawyer practice) and the network structure plays a comparatively more implicit role.
There are a host of network summaries available, with more being defined reg-
ularly. A variety of such measures are covered in Chap. 4. These range from char-
acterization of properties associated with individual vertices or edges to properties
of subgraphs to properties of the graph as a whole. The trick to employing net-
work characterizations in a useful fashion generally is in matching the question(s) of
interest in an underlying complex system with an appropriate summary measure(s)
of the corresponding network.
1.2.2 Network Modeling and Inference
Beyond asking what an observed network looks like and characterizing its structure,
at a more fundamental level we may be interested in understanding how it may have
arisen. That is, we can conceive of the network as having resulted from some under-
lying processes associated with the complex system of interest to us and ask what
are the essential aspects of these processes. In addition, the actual manner in which
the network was obtained, i.e., the corresponding measurement and construction
process, may well be important to take into consideration. Such concerns provide
the impetus for network modeling and associated tools of statistical inference.
Network modeling has received much attention. Broadly speaking, there are
two classes of network models: mathematical and statistical. By ‘mathematical
6 1 Introduction
models’ we mean models specified through (typically) simple probabilistic rules
for generating a network, where often the rules are defined in an attempt to capture
a particular mechanism or principle (e.g., ‘the rich get richer’). In contrast, by ‘sta-
tistical models’ we mean models (often probabilistic as well) specified at least in
part with the intention that they be fit to observed data—allowing, for example, the
evaluation of the explanatory power of certain variables on edge formation in the
network—and that the fit be effected and assessed using formal principles of statis-
tical inference. While certainly there is some overlap between these two classes of
models, the relevant literatures nevertheless are largely distinct.
The simplest example of a mathematical network model is one in which edges
are assigned randomly to pairs of vertices based on the result of a collection of
independent and identically distributed coin tosses—one toss for each vertex pair.
Corresponding to a variant of the famous Erdo˝s–Re´nyi formulation of a random
graph, this model has been studied extensively since the 1960s. Its strength is in
the fact not only that its properties are so well understood (e.g., in terms of how,
for example, cohesive structure emerges as a function of the probability of an edge)
but also in the role it plays as a standard against which to compare other, more
complicated models.
From the statistical perspective, however, such mathematical models generally
are too simple to be a good match to real network data. Nevertheless, they are not
only useful in allowing for formal insight to be gained into how specific mechanisms
of edge formation may affect network structure, but they also are used commonly
in defining null classes of networks against which to assess the ‘significance’ of
structural characteristics found in an observed network. For example, in Fig. 1.1,
the clustering coefficient for the network of lawyer collaboration turns out to be just
slightly less than 0.40. Is this value large or not? It is difficult to say without a frame
of reference. One way of imposing such a frame of reference is to define a simple
but comparable class of networks (e.g., all graphs with the same numbers of vertices
and edges) and look at the distribution of the clustering coefficient across all such
graphs. In the spirit of a permutation test and similar data re-use methods, examining
where our observed value of 0.40 falls within this distribution gives us a sense of
how unusual it is in comparison to values for networks across the specified class.
This approach has been used not only for examining clustering in social networks
but also, for example, in identifying recurring sub-graph patterns (aka ‘motifs’) in
gene regulatory networks and functional connectivity networks in neuroscience. We
will explore the use of mathematical network models for such purposes in Chap. 5.
There are a variety of statistical network models that have been offered in the
literature, many of which parallel well-known model classes in classical statis-
tics. For example, exponential random graph models are analogous to generalized
linear models, being based on an exponential family form. Similarly, latent net-
work models, in specifying that edges may arise at least in part from an unmeasured
(and possibly unknown) variable(s), directly parallel the use of latent variables in
hierarchical modeling. Finally, stochastic block models may be viewed as a form of
mixture model. Nevertheless, importantly, the specification of such models and their
fitting typically are decidedly less standard, given the usually high-dimensional and
1.2 Types of Network Analysis 7
dependent natureof the data. Such models have been used in a variety of settings,
from the modeling of collaborations like that in Lazega’s lawyer network to the pre-
diction of protein pairs likely to have an affinity for physically binding. We will look
at examples of the use of each of these classes of models in Chap. 6.
1.2.3 Network Processes
As objects for representing interactions among elements of a complex system, net-
work graphs are frequently the primary focus of network analysis. On the other
hand, in many contexts it is actually some quantity (or attribute) associated with
each of the elements in the system that ultimately is of most interest. Nevertheless,
in such settings it often is not unreasonable to expect that this quantity be influenced
in an important manner by the interactions among the elements, and hence the net-
work graph may still be relevant for modeling and analysis. More formally, we can
picture a stochastic process as ‘living’ on the network and indexed by the vertices
in the network. A variety of questions regarding such processes can be interpreted
as problems of prediction of either static or dynamic network processes.
Returning to Fig. 1.1, for example, suppose that we do not know the practice
of a particular lawyer. It seems plausible to suspect that lawyers collaborate more
frequently with other lawyers in the same legal practice. If so, knowledge of col-
laboration may be useful in predicting practice. That is, for our lawyer of unknown
practice, we may be able to predict that practice with some accuracy if we know
(i) the vertices that are neighbors of that lawyer in the network graph, and (ii) the
practice of those neighbors.
While in fact this information is known for all lawyers in our data set, in other
contexts we generally are not so fortunate. For example, in on-line social networks
like Facebook, where users can choose privacy settings of various levels of sever-
ity, it can be of interest (e.g., for marketing purposes) to predict user attributes
(e.g., perhaps indicative of consumer preferences) based on that of ‘friends’. Simi-
larly, in biology, traditionally the functional role of proteins (e.g., in the context of
communication within the cell or in controlling cell metabolism) has been estab-
lished through labor-intensive experimental techniques. Because proteins that work
together to effect certain functions often have a higher affinity for physically bind-
ing to each other, the clusters of highly connected proteins in protein–protein int-
eraction networks can be exploited to make computational predictions of protein
function, by propagating information on proteins of known function to their neigh-
bors of unknown function in the graph. We will encounter methods for making such
predictions of static network processes in Chap. 8.
Ultimately, many (most?) of the systems studied from a network-based perspec-
tive are intrinsically dynamic in nature. Not surprisingly, therefore, many processes
defined on networks are more accurately thought of as dynamic, rather than static,
processes. Consider, for example, the context of public health and disease control.
Understanding the spread of a disease (e.g., the H1N1 flu virus) through a population
8 1 Introduction
can be modeled as the diffusion of a binary dynamic process (indicating infected or
not infected) through a network graph, in which vertices represent individuals, and
edges, contact between individuals. Mathematical modeling (using both determin-
istic and stochastic models) arguably is still the primary tool for modeling such
processes, but network-based statistical models gradually are seeing increased use,
particularly as better and more extensive data on contact networks becomes avail-
able. Another setting in which analogous ideas and models are used—and wherein
it is possible to collected substantially more data—is in modeling the adoption of
new products among consumers (e.g., toward predicting the early adopters of, say,
the next iPhone released). We look briefly at techniques in this emerging area of
modeling and prediction of dynamic network processes in Chap. 8 as well.
In a related direction, we will look at statistical methods for analyzing network
flows in Chap. 9. Referring to the movement of something—materials, people, or
commodities, for example—from origin to destination, flows are a special type
of dynamic process fundamental to transportation networks (e.g., airlines moving
people) and communication networks (e.g., the Internet moving packets of informa-
tion), among others.
Finally, in Chap. 10, we will look briefly at the emerging area of dynamic network
analysis, wherein the network, the process(es) on the network, or indeed both, are
expected to be evolving in time.
1.3 Why Use R for Network Analysis?
Various tools are available for network analysis. Some of these are standalone pro-
grams, like the classic Pajek tool or the newer Gephi, while others are embedded
into a programming environment and are essentially used as a programming library.
Some examples of the latter are NetworkX in Python and igraph in R.
R is rapidly becoming the de facto standard of statistical research, if this has
not happened already. The majority of new statistical developments are immedi-
ately available in R and no other programming languages or software packages.
While network analysis is an interdisciplinary field, it is not an exception from
this trend. Several R extension packages implement one or more network analysis
algorithms or provide general tools for manipulating network data and implement
network algorithms. R supports high quality graphics and virtually any common
graphical file format. R is extensible with currently over 5,000 extension packages,
and this number is growing.
Being a complete programming language, R offers great flexibility for network
research. New network analysis algorithms can be prototyped rapidly by building on
the existing network science extension packages, the most commonly used one of
which is the igraph package. In addition to implementations of classic and recently
published methods of network analysis, igraph provides tools to import, manipu-
late and visualize graphs, and can be used as a platform for new algorithms. It is
currently used in over a hundred other R packages and will be featured often in this
book.
1.4 About This Book 9
1.4 About This Book
Our goal in writing this book is to provide an easily accessible introduction to
the statistical analysis of network data, by way of the R programming language.
The book has been written at a level aimed at graduate students and researchers in
quantitative disciplines engaged in the statistical analysis of network data, although
advanced undergraduates already comfortable with R should find the book fairly
accessible as well. At present, therefore, we anticipate the book being of interest to
readers in statistics, of course, but also in areas like computational biology, com-
puter science and machine learning, economics, neuroscience, quantitative finance,
signal processing, statistical physics, and the quantitative social sciences.
The material in this book is organized to flow from descriptive statistical methods
to topics centered on modeling and inference with networks, with the latter sepa-
rated into two sub-areas, corresponding first to the modeling and inference of net-
works themselves, and then, to processes on networks. More specifically, we begin
by covering tools for the manipulation of network data in Chap. 2. The visualization
and characterization of networks is then addressed in Chaps. 3 and 4, respectively.
Next, the topics of mathematical and statistical network modeling are investigated,
in Chaps. 5 and 6, respectively. In Chap. 7 the focus is on the special case of network
modeling wherein the network topology must itself be inferred. Network processes,both static and dynamic, are addressed in Chap. 8, while network flows are featured
in Chap. 9. Finally, in Chap. 10 a brief look is provided at how the topics of these
earlier chapters are beginning to be extended to the context of dynamic networks.
It is not our intent to provide a comprehensive coverage here of the conceptual
and technical foundations of the statistical analysis of network data. For something
more of that nature, we recommend the book by Kolaczyk [91]. There are also a
number of excellent treatments of network analysis more generally, written from
the perspective of various other fields. These include the book by Newman [118],
from the perspective of statistical physics, and the book by Jackson [81], from the
perspective of economics. The book by Easley and Kleinberg [49], written at an
introductory undergraduate level, provides a particularly accessible treatment of
networks and social behavior. Finally, the book by Wasserman and Faust [144],
although less recent than these others, is still an important resource, particularly
regarding basic tools for characterization and classical network modeling, from a
social network perspective.
On the other hand, neither is this book a complete manual for the various R
packages encountered herein. The reader will want to consult the manuals for these
packages themselves for details omitted from our coverage. In addition, it should
be noted that we assume a basic familiarity with R and the base package. For gen-
eral background on R, particularly for those not already familiar with the software,
there are any number of resources to be had, including the tutorial by Venebles and
Smith [143], the beginner’s guide by Zuur et al. [152], and the book by Crawley [36].
Ultimately, we have attempted to strike a reasonable balance between the con-
cepts and technical background, on the one hand, and software details, on the other.
Accordingly, we envision the book being used, for example, by (i) statisticians
10 1 Introduction
looking to begin engaging in the statistical analysis of network data, whether at
a research level or in conjunction with a new collaboration, and hoping to use R as
a natural segue; (ii) researchers from other similarly quantitative fields (e.g., com-
puter science, statistical physics, economics, etc.) working in the area of complex
networks, who seek to get up to speed relatively quickly on how to do statistical
analyses (both familiar and unfamiliar) of network data in R, and (iii) practitioners
in applied areas wishing to get a foothold into how to do a specific type of analysis
relevant to a particular application of interest.
1.5 About the R code
The R code in this book requires a number of data sets and R packages. We have
collected the data sets into a standalone R package called sand (i.e., for ‘statistical
analysis of network data’). If the reader aims to run the code while reading the book,
she needs to install this package from CRAN.2 Once installed, the package can then
be used to also install all other R packages required in the code chunks of the book:
#1.1 1 > install.packages("sand")
2 > library(sand)
3 > install_sand_packages()
To avoid the need for constant typing while reading the book, we have extracted
all R code and placed it online, at http://github.com/kolaczyk/sand. We suggest that
the reader open this web page to simply copy-and-paste the code chunks into an R
session. All code is also available in the sand package. In the package manual are
details on how to run the code:
#1.2 1 > ?sand
Note that code chunks are not independent, in that they often rely upon the results
of previous chunks, within the same chapter. On the other hand, code in separate
chapters is meant to be run separately—in fact, restarting R at the beginning of each
chapter is the best way to ensure a clean state to start from. For example, to run the
code in Chap. 3, it is not required (nor recommended) to first run all code in Chaps. 1
and 2, but it is prudent to run any given code block in Chap. 3 only after having run
all other relevant code blocks earlier in the chapter.
In writing this book, we have sought to strike an optimal balance between repro-
ducibility and exposition. It should be possible, for example, for the reader to exactly
reproduce all numerical results presented herein (i.e., summary statistics, estimates,
etc.). In order to ensure this capability, we have set random seeds where neces-
sary (e.g., prior to running methods relying on Markov chain Monte Carlo). The
reader adapting our code for his own purposes will likely, of course, want to use his
own choice of seeds. In contrast, however, we have compromised when it comes to
2 The Comprehensive R Archive Network (CRAN) is a worldwide network of ftp and web servers
from which R code and related materials may be downloaded. See http://cran.r-project.org.
1.5 About the R code 11
visualizations. In particular, while the necessary code has been supplied to repeat all
visualizations shown in this book, in most cases we have not attempted to wrap the
main code with all necessary supplemental code (e.g., random seeds, margin set-
tings, space between subplots, etc.) required for exact reproducibility, which would
have come at what we felt was an unacceptable cost of reduction in readability and
general aesthetics.
Given the fact that the R environment in general, and the R packages used in
this book more specifically, can be expected to continue to evolve over time, we
anticipate that we will need to update the online version of the code (and the version
in the sand package) in the future, to make it work with newer versions of R and
with updated packages. Therefore, if the reader finds that the code here in the book
appears to fail, the online version should be most current and working.
On a related note, as can be seen above, the code chunks in the book are num-
bered, by chapter and within each chapter. This convention was adopted mainly to
ensure ease in referencing code chunks from the website and vice versa. Even if we
update the code on the web site, our intention is that the chunk numbers will stay
the same.
In addition to having the most current version of the code chunks, the web page
also includes an issue tracker, for reporting mistakes in the text or the code, and for
general discussion.
Chapter 2
Manipulating Network Data
2.1 Introduction
We have seen that the term ‘network,’ broadly speaking, refers to a collection of
elements and their inter-relations. The mathematical concept of a graph lends preci-
sion to this notion. We will introduce the basic elements of graphs—both undirected
and directed—in Sect. 2.2 and discuss how to generate network graphs, both ‘by
hand’ and from network data of various forms.
As a representation of a complex system, a graph alone (i.e., as merely a col-
lection of vertices and edges) is often insufficient. Rather, there may be additional
information important to the application at hand, in the form of variables that can be
indexed by the vertices (e.g., gender of members of a social network) or the edges
(e.g., average time required to traverse a link in a transportation network). Alterna-
tively, at a coarser level of granularity, it may be convenient to associate vertices
or edges with groups (e.g., all proteins in a protein–protein interaction network that
are involved with a certain type of signaling event in a cell). Indeed, we can imagine
potentially equipping vertices and edges with several variables of interest. Doing
so corresponds to the notion of decorating a network graph, which is discussed in
Sect. 2.3.
Finally, in using graphs to represent network data, a certain level of familiarity
with basic graph theoretic concepts, as well as an ability to assess certain basic
properties of graphs, is essential. We therefore devote Sect. 2.4 to a brief overviewof such concepts and properties, including a quick look at a handful of important
special classes of graphs.
For creating, decorating, and assessing basic properties of network graphs,
igraph is particularly useful.1 A library and R package for network analysis, igraph
contains a set of data types and functions for (relatively!) straightforward implemen-
tation and rapid prototyping of graph algorithms, and allows for the fast handling of
1 Alternatively, there is within the network and sna packages, found in the statnet suite, a similarly
rich set of tools for the manipulation and characterization of network graphs. These packages share
nontrivial overlap with igraph.
E.D. Kolaczyk and G. Csa´rdi, Statistical Analysis of Network Data with R, Use R! 65,
DOI 10.1007/978-1-4939-0983-4 2, © Springer Science+Business Media New York 2014
13
14 2 Manipulating Network Data
large graphs (e.g., on the order of millions of vertices and edges). As such, its use
will figure heavily in this and the following two chapters (i.e., where the emphasis is
on descriptive methods). The fact that igraph was developed as a research tool and
that its focus originally was to be able to handle large graphs efficiently, means that
its learning curve used to be somewhat steep. Recent versions do not necessarily
flatten the learning curve, but are nevertheless friendlier to the user, once she has
mastered the basics.
2.2 Creating Network Graphs
2.2.1 Undirected and Directed Graphs
Formally, a graph G = (V,E) is a mathematical structure consisting of a set V of
vertices (also commonly called nodes) and a set E of edges (also commonly called
links), where elements of E are unordered pairs {u,v} of distinct vertices u,v ∈ V .
The number of vertices Nv = |V | and the number of edges Ne = |E| are sometimes
called the order and size of the graph G, respectively. Often, and without loss of
generality,2 we will label the vertices simply with the integers 1, . . . ,Nv, and the
edges, analogously.
In igraph there is an ‘igraph’ class for graphs.3 In this section, we will see a
number of ways to create an object of the igraph class in R, and various ways to
extract and summarize the information in that object.
For small, toy graphs, the function graph.formula can be used. For example,
#2.1 1 > library(igraph)
2 > g <- graph.formula(1-2, 1-3, 2-3, 2-4, 3-5, 4-5, 4-6,
3 + 4-7, 5-6, 6-7)
creates a graph object g with Nv = 7 vertices
#2.2 1 > V(g)
2 Vertex sequence:
3 [1] "1" "2" "3" "4" "5" "6" "7"
and Ne = 10 edges
#2.3 1 > E(g)
2 Edge sequence:
3
4 [1] 2 -- 1
5 [2] 3 -- 1
6 [3] 3 -- 2
7 [4] 4 -- 2
2 Technically, a graph G is unique only up to relabellings of its vertices and edges that leave the
structure unchanged. Two graphs that are equivalent in this sense are called isomorphic.
3 The exact representation of ‘igraph’ objects is not visible for the user and is subject to change.
2.2 Creating Network Graphs 15
8 [5] 5 -- 3
9 [6] 5 -- 4
10 [7] 6 -- 4
11 [8] 7 -- 4
12 [9] 6 -- 5
13 [10] 7 -- 6
This same information, in a slightly more compressed format, is recovered easily
using the relevant structure command.
#2.4 1 > str(g)
2 IGRAPH UN-- 7 10 --
3 + attr: name (v/c)
4 + edges (vertex names):
5 1 -- 2, 3
6 2 -- 1, 3, 4
7 3 -- 1, 2, 5
8 4 -- 2, 5, 6, 7
9 5 -- 3, 4, 6
10 6 -- 4, 5, 7
11 7 -- 4, 6
A visual representation of this graph, generated simply through the command4
#2.5 1 > plot(g)
is shown in Fig. 2.1, on the left.
The character U seen accompanying the summary of g above indicates that our
graph is undirected, in that there is no ordering in the vertices defining an edge.
A graph G for which each edge in E has an ordering to its vertices (i.e., so that (u,v)
is distinct from (u,v), for u,v ∈V ) is called a directed graph or digraph. Such edges
are called directed edges or arcs, with the direction of an arc (u,v) read from left to
right, from the tail u to the head v. Note that digraphs may have two arcs between
a pair of vertices, with the vertices playing opposite roles of head and tail for the
respective arcs. In this case, the two arcs are said to be mutual.
Directed edges in graph.formula are indicated using a minus/plus conven-
tion. In Fig. 2.1, on the right, is shown an example of a digraph consisting of three
vertices, with two directed edges and one mutual edge.
#2.6 1 > dg <- graph.formula(1-+2, 1-+3, 2++3)
2 > plot(dg)
We note that in defining both of the graphs above we have used the standard con-
vention of labeling vertices with the numbers 1 through Nv, which is also the default
in igraph. In practice, however, we may already have natural labels, such as the
names of people in a social network, or of genes in a gene regulatory network. Such
labels can be used instead of the default choice by generating the graph with them
explicitly.
4 This is the most basic visualization. We will explore the topic of visualization on its own in more
depth in Chap. 3.
16 2 Manipulating Network Data
1
2
3
4
5
6
7
1
2
3
Fig. 2.1 Left: an undirected graph. Right: a directed graph
#2.7 1 > dg <- graph.formula(Sam-+Mary, Sam-+Tom, Mary++Tom)
2 > str(dg)
3 IGRAPH DN-- 3 4 --
4 + attr: name (v/c)
5 + edges (vertex names):
6 [1] Sam ->Mary Sam ->Tom Mary->Tom Tom ->Mary
Alternatively, vertex labels can be changed from the default after initially creating
the graph, by modifying the name vertex attribute of the graph object.
#2.8 1 > V(dg)$name <- c("Sam", "Mary", "Tom")
2.2.2 Representations for Graphs
Realistically, we do not usually expect to enter a graph by hand, since most net-
works encountered in practice have at least tens of vertices and edges, if not tens of
thousands (or even millions!). Rather, information for constructing a network graph
typically will be stored in a data file. At the most elementary level, there are three
basic formats: adjacency lists, edge lists, and adjacency matrices.
An adjacency list representation of a graph G is simply an array of size Nv,
ordered with respect to the ordering of the vertices in V , each element of which
is a list, where the ith list contains the set of all vertices j for which there is an edge
from i to j. This is the representation usually used by igraph, evident in printing the
output from the structure function str in the examples above.
An edge list is a simple two-column list of all vertex pairs that are joined by an
edge. In igraph, edge lists are used, for example, when printing the edge set E .
#2.9 1 > E(dg)
2 Edge sequence:
3
2.2 Creating Network Graphs 17
4 [1] Sam -> Mary
5 [2] Sam -> Tom
6 [3] Mary -> Tom
7 [4] Tom -> Mary
The function get.edgelist returns an edge list as a two-column R matrix.
Finally, graphs can also be stored in matrix form. The Nv ×Nv adjacency matrix
for a graph G = (V,E), say A, is defined so that
Ai j =
{
1, if {i, j} ∈ E ,
0, otherwise . (2.1)
In words, A is non-zero for entries whose row-column indices (i, j) correspond to
vertices in G joined by an edge, from i to j, and zero, for those that are not. The
matrix A will be symmetric for undirected graphs.
#2.10 1 > get.adjacency(g)
2 7 x 7 sparse Matrix of class "dgCMatrix"
3 1 2 3 4 5 6 7
4 1 . 1 1 . . . .
5 2 1 . 1 1 . . .
6 3 1 1 . . 1 . .
7 4 . 1 . . 1 1 1
8 5 . . 1 1 . 1 .
9 6 . . . 1 1 . 1
10 7 . . . 1 . 1 .
This last choice of representation is often a natural one, given that matrices are
fundamental data objects in most programming and software environments and that
network graphs frequently are encoded in statistical models through their adjacency
matrices. However, their use with the type of large, sparse networks commonly enc-
ountered in practice can be inefficient, unless coupled with the use of sparse matrix
tools.
In igraph, network data already loaded into R in these specific formats canbe used to generate graphs using the functions graph.adjlist, graph.
edgelist, and graph.adjacency, respectively. For data stored in a file,
the function read.graph can be used. In fact, this latter function not only sup-
ports the three formats discussed above, but also a number of other formats (e.g.,
such as GraphML, Pajek, etc.). Conversely, the function write.graph can be
used to save graphs in various formats.
2.2.3 Operations on Graphs
The graph(s) that we are able to load into R may not be the graph that we ultimately
want. Various operations on the graph(s) we have available may be necessary,
including extracting part of a graph, deleting vertices, adding edges, or even com-
bining multiple graphs.
18 2 Manipulating Network Data
The notion of a ‘part’ of a graph is captured through the concept of a subgraph.
A graph H = (VH ,EH) is a subgraph of another graph G = (VG,EG) if VH ⊆ VG
and EH ⊆ EG. Often we are interested in an induced subgraph of a graph G, i.e.,
a subgraph G′ = (V ′,E ′), where V ′ ⊆ V is a prespecified subset of vertices and
E ′ ⊆ E is the collection of edges to be found in G among that subset of vertices. For
example, consider the subgraph of g induced by the first five vertices.
#2.11 1 > h <- induced.subgraph(g, 1:5)
2 > str(h)
3 IGRAPH UN-- 5 6 --
4 + attr: name (v/c)
5 + edges (vertex names):
6 [1] 1--2 1--3 2--3 2--4 3--5 4--5
The inclusion or exclusion of vertices or edges in a graph G = (V,E) can be con-
ceived of as the application of addition or subtraction operators, respectively, to the
sets V and E . For example, the subgraph h generated just above could also have
been created from g by removing the vertices 6 and 7.
#2.12 1 > h <- g - vertices(c(6,7))
Similarly, g can be recovered from h by first adding these two vertices back in, and
then, adding the appropriate edges.
#2.13 1 > h <- h + vertices(c(6,7))
2 > g <- h + edges(c(4,6),c(4,7),c(5,6),c(6,7))
Finally, the basic set-theoretic concepts of union, disjoint union, intersection, dif-
ference, and complement all extend in a natural fashion to graphs. For example, the
union of two graphs, say H1 and H2, is a graph G in which vertices and edges are
included if and only if they are included in at least one of H1 or H2. For example, our
toy graph g may be created through the union of the (induced) subgraph h defined
above and a second appropriately defined subgraph.
#2.14 1 > h1 <- h
2 > h2 <- graph.formula(4-6, 4-7, 5-6, 6-7)
3 > g <- graph.union(h1,h2)
2.3 Decorating Network Graphs
2.3.1 Vertex, Edge, and Graph Attributes
At the heart of a network-based representation of data from a complex system will
be a graph. But frequently there are other relevant data to be had as well. From a
network-centric perspective, these other data can be thought of as attributes, i.e.,
values associated with the corresponding network graph. Equipping a graph with
such attributes is referred to as decorating the graph. Typically, the vertices or edges
of a graph (or both) are decorated with attributes, although the graph as a whole
2.3 Decorating Network Graphs 19
may be decorated as well. In igraph, the elements of graph objects (i.e., particularly
the vertex and edge sequences, and subsets thereof) may be equipped with attributes
simply by using the ‘$’ operator.
Vertex attributes are variables indexed by vertices, and may be of discrete or
continuous type. Instances of the former type include the gender of actors in a social
network, the infection status of computers in an Internet network in the midst of an
on-line virus (e.g., a worm), and a list of biological pathways in which a protein in
a protein–protein interaction network is known to participate, while an example of
the latter type is the voltage potential levels in the brain measured at electrodes in
an electrocorticogram (ECoG) grid. For example, recall that the names of the three
actors in our toy digraph are
#2.15 1 > V(dg)$name
2 [1] "Sam" "Mary" "Tom"
Their gender is added to dg as
#2.16 1 > V(dg)$gender <- c("M","F","M")
Note that the notion of vertex attributes also may be used advantageously to equip
vertices with properties during the course of an analysis, either as input to or output
from calculations within R. For example, this might mean associating the color red
with our vertices
#2.17 1 > V(g)$color <- "red"
to be used in plotting the graph (see Chap. 3). Or it might mean saving the values of
some vertex characteristic we have computed, such as the types of vertex centrality
measures to be introduced in Chap. 4.
Edge attributes similarly are values of variables indexed by adjacent vertex pairs
and, as with vertex attributes, they may be of both discrete or continuous type.
Examples of discrete edge attributes include whether one gene regulates another
in an inhibitory or excitatory fashion, or whether two countries have a friendly or
antagonistic political relationship. Continuous edge attributes, on the other hand,
often represent some measure of the strength of relationship between vertex pairs.
For example, we might equip each edge in a network of email exchanges (with ver-
tices representing email addresses) by the rate at which emails were exchanged over
a given period of time. Or we might define an attribute on edges between adjacent
stations in a subway network (e.g., the Paris metro) to represent the average time
necessary during a given hour of the day for trains to run from one to station to the
next.
Often edge attributes can be thought of usefully, for the purposes of various anal-
yses, as weights. Edge weights generally are non-negative, by convention, and often
are scaled to fall between zero and one. A graph for which the edges are equipped
with weights is referred to as a weighted graph.5
5 More generally, a weighted graph can be defined as a pair (V,E), where V is a set of vertices, as
before, but the elements in E are now non-negative numbers, with one such number for each vertex
pair. Analogously, the adjacency matrix A for a weighted graph is defined such that the entry Ai j
is equal to the corresponding weight for the vertex pair i and j.
20 2 Manipulating Network Data
#2.18 1 > is.weighted(g)
2 [1] FALSE
3 > wg <- g
4 > E(wg)$weight <- runif(ecount(wg))
5 > is.weighted(wg)
6 [1] TRUE
As with vertex attributes, edge attributes may also be used to equip edges with prop-
erties to be used in calls to other R functions, such as the plot function.
In principle, a graph itself may be decorated with an attribute, and indeed, it is
possible to equip graph objects with attributes in igraph. The most natural use of
this feature arguably is to equip a graph with relevant background information, such
as a name
#2.19 1 > g$name <- "Toy Graph"
or a seminal data source.
2.3.2 Using Data Frames
Just as network graphs typically are not entered by hand for graphs of any nontrivial
magnitude, but rather are encoded in data frames and files, so too attributes tend to
be similarly encoded. For example, in R, a network graph and all vertex and edge
attributes can be conveniently represented using two data frames, one with vertex
information, and the other, with edge information. Under this approach, the first
column of the vertex data frame contains the vertex names (i.e., either the default
numerical labels or symbolic), while each of the other columns contain the values
of a given vertex attribute. Similarly, the first two columns of the edge data frame
contain an edge list defining the graph, while each of the other columns contain the
values of a given edge attribute.
Consider, for example, the lawyer data set of Lazega [98], introduced in Chap. 1.
Collecting the information on collaborative working relationships, in the form of
an edge list, in the data frame elist.lazega, and the various vertex attribute
variables, in the data frame v.attr.lazega,they may be combined into a single
graph object in igraph as
#2.20 1 > library(sand)
2 > g.lazega <- graph.data.frame(elist.lazega,
3 + directed="FALSE",
4 + vertices=v.attr.lazega)
5 > g.lazega$name <- "Lazega Lawyers"
Our full set of network information on these
#2.21 1 > vcount(g.lazega)
2 [1] 36
2.4 Talking About Graphs 21
lawyers now consists of the
#2.22 1 > ecount(g.lazega)
2 [1] 115
pairs that declared they work together, along with the eight vertex attributes
#2.23 1 > list.vertex.attributes(g.lazega)
2 [1] "name" "Seniority" "Status" "Gender"
3 [5] "Office" "Years" "Age" "Practice"
4 [9] "School"
(in addition to the vertex name).
We will see a variety of ways in the chapters that follow to characterize and
model these network data and others like them.
2.4 Talking About Graphs
2.4.1 Basic Graph Concepts
With the adoption of a graph-based framework for representing relational data in
network analysis we inherit a rich vocabulary for discussing various important con-
cepts related to graphs. We briefly review and demonstrate some of these here, as
they are necessary for doing even the most basic of network analyses.
As defined at the start of this chapter, a graph has no edges for which both ends
connect to a single vertex (called loops) and no pairs of vertices with more than one
edge between them (called multi-edges). An object with either of these properties
is called a multi-graph.6 A graph that is not a multi-graph is called a simple graph,
and its edges are referred to as proper edges.
It is straightforward to determine whether or not a graph is simple. Our toy graph
g is simple.
#2.24 1 > is.simple(g)
2 [1] TRUE
But duplicating the edge between vertices 2 and 3, for instance, yields a multi-graph.
#2.25 1 > mg <- g + edge(2,3)
2 > str(mg)
3 IGRAPH UN-- 7 11 -- Toy Graph
4 + attr: name (g/c), name (v/c), color (v/c)
5 + edges (vertex names):
6 1 -- 2, 3
7 2 -- 1, 3, 3, 4
8 3 -- 1, 2, 2, 5
9 4 -- 2, 5, 6, 7
6 In fact, the igraph data model is more general than described above, and allows for multi-graphs,
with multiple edges between the same pair of vertices and edges from a vertex to itself.
22 2 Manipulating Network Data
10 5 -- 3, 4, 6
11 6 -- 4, 5, 7
12 7 -- 4, 6
13 > is.simple(mg)
14 [1] FALSE
Checking whether or not a network graph is simple is a somewhat trivial but nev-
ertheless important preliminary step in doing a typical network analysis, as many
models and methods assume the input graph to be simple or behave differently if it
is not.
Note that it is straightforward, and indeed not uncommon in practice, to trans-
form a multi-graph into a weighted graph, wherein each resulting proper edge is
equipped with a weight equal to the multiplicity of that edge in the original multi-
graph. For example, converting our toy multi-graph mg to a weighted graph results
in a simple graph,
#2.26 1 > E(mg)$weight <- 1
2 > wg2 <- simplify(mg)
3 > is.simple(wg2)
4 [1] TRUE
the edges which match our initial toy graph g,
#2.27 1 > str(wg2)
2 IGRAPH UNW- 7 10 -- Toy Graph
3 + attr: name (g/c), name (v/c), color (v/c),
4 weight (e/n)
5 + edges (vertex names):
6 1 -- 2, 3
7 2 -- 1, 3, 4
8 3 -- 1, 2, 5
9 4 -- 2, 5, 6, 7
10 5 -- 3, 4, 6
11 6 -- 4, 5, 7
12 7 -- 4, 6
but for which the third edge (i.e., connecting vertices 2 and 3) has a weight of 2.
#2.28 1 > E(wg2)$weight
2 [1] 1 1 2 1 1 1 1 1 1 1
Moving beyond such basic concerns regarding the nature of the edges in a graph,
it is necessary to have a language for discussing the connectivity of a graph. The
most basic notion of connectivity is that of adjacency. Two vertices u,v ∈ V are
said to be adjacent if joined by an edge in E . Such vertices are also referred to as
neighbors. For example, the three neighbors of vertex 5 in our toy graph g are
#2.29 1 > neighbors(g, 5)
2 [1] 3 4 6
Similarly, two edges e1,e2 ∈ E are adjacent if joined by a common endpoint in V .
A vertex v∈V is incident on an edge e∈E if v is an endpoint of e. From this follows
the notion of the degree of a vertex v, say dv, defined as the number of edges incident
on v.
2.4 Talking About Graphs 23
#2.30 1 > degree(g)
2 1 2 3 4 5 6 7
3 2 3 3 4 3 3 2
For digraphs, vertex degree is replaced by in-degree (i.e., dinv ) and out-degree
(i.e., doutv ), which count the number of edges pointing in towards and out from a
vertex, respectively.
#2.31 1 > degree(dg, mode="in")
2 Sam Mary Tom
3 0 2 2
4 > degree(dg, mode="out")
5 Sam Mary Tom
6 2 1 1
It is also useful to be able to discuss the concept of movement about a graph.
For example, a walk on a graph G, from v0 to vl , is an alternating sequence
{v0,e1,v1,e2, . . . ,vl−1,el ,vl}, where the endpoints of ei are {vi−1,vi}. The length
of this walk is said to be l. Refinements of a walk include trails, which are walks
without repeated edges, and paths, which are trails without repeated vertices. A trail
for which the beginning and ending vertices are the same is called a circuit. Sim-
ilarly, a walk of length at least three, for which the beginning and ending vertices
are the same, but for which all other vertices are distinct from each other, is called
a cycle. Graphs containing no cycles are called acyclic. In a digraph, these notions
generalize naturally. For example, a directed walk from v0 to vl proceeds from tail
to head along arcs between v0 and vl .
A vertex v in a graph G is said to be reachable from another vertex u if there exists
a walk from u to v. The graph G is said to be connected if every vertex is reachable
from every other. A component of a graph is a maximally connected subgraph. That
is, it is a connected subgraph of G for which the addition of any other remaining
vertex in V would ruin the property of connectivity. The toy graph g, for example,
is connected
#2.32 1 > is.connected(g)
2 [1] TRUE
and therefore consists of only a single component
#2.33 1 > clusters(g)
2 $membership
3 [1] 1 1 1 1 1 1 1
4
5 $csize
6 [1] 7
7
8 $no
9 [1] 1
For a digraph, there are two variations of the concept of connectedness. A digraph
G is weakly connected if its underlying graph (i.e., the result of stripping away the
24 2 Manipulating Network Data
labels ‘tail’ and ‘head’ from G) is connected. It is called strongly connected if every
vertex v is reachable from every u by a directed walk. The toy graphdg, for example,
is weakly connected but not strongly connected.
#2.34 1 > is.connected(dg, mode="weak")
2 [1] TRUE
3 > is.connected(dg, mode="strong")
4 [1] FALSE
A common notion of distance between vertices on a graph is defined as the length
of the shortest path(s) between the vertices (which we set equal to infinity if no such
path exists). This distance is often referred to as geodesic distance, with ‘geodesic’
being another name for shortest paths. The value of the longest distance in a graph
is called the diameter of the graph. Our toy graph g has diameter
#2.35 1 > diameter(g, weights=NA)
2 [1] 3
Ultimately, the concepts above are only the most basic of graph-theoretic quanti-
ties. There are a wide variety of queries one might make about graphs and quantities
to calculate as a part of doing descriptive network analysis. We cover more of these
in Chap. 4.
2.4.2 Special Types of Graphs
Graphs come in all ‘shapes and sizes,’ as it were, but there are a number of families
of graphs that are commonly encountered in practice. We illustrate this notion with
the examples of four such families shown in Fig. 2.2.
#2.36 1 > g.full <- graph.full(7)
2 > g.ring <- graph.ring(7)
3 > g.tree <- graph.tree(7, children=2, mode="undirected")
4 > g.star <- graph.star(7, mode="undirected")
5 > par(mfrow=c(2, 2))
6 > plot(g.full)
7 > plot(g.ring)
8 > plot(g.tree)
9 > plot(g.star)
A complete graph is a graph where every vertex is joined to every other vertex by
an edge.