Baixe o app para aproveitar ainda mais
Prévia do material em texto
Use R ! Eric D. Kolaczyk Gábor Csárdi Statistical Analysis of Network Data with R Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni Parmigiani For further volumes: http://www.springer.com/series/6991 Use R! Albert: Bayesian Computation with R (2nd ed. 2009) Bivand/Pebesma/Go´mez-Rubio: Applied Spatial Data Analysis with R (2nd ed. 2013) Cook/Swayne: Interactive and Dynamic Graphics for Data Analysis: With R and GGobi Hahne/Huber/Gentleman/Falcon: Bioconductor Case Studies Paradis: Analysis of Phylogenetics and Evolution with R (2nd ed. 2012) Pfaff: Analysis of Integrated and Cointegrated Time Series with R (2nd ed. 2008) Sarkar: Lattice: Multivariate Data Visualization with R Spector: Data Manipulation with R Eric D. Kolaczyk • Ga´bor Csa´rdi Statistical Analysis of Network Data with R 123 Eric D. Kolaczyk Department of Mathematics and Statistics Boston University Professor Boston, MA, USA Ga´bor Csa´rdi Department of Statistics Harvard University Research Associate Cambridge, MA, USA ISSN 2197-5736 ISSN 2197-5744 (electronic) ISBN 978-1-4939-0982-7 ISBN 978-1-4939-0983-4 (eBook) DOI 10.1007/978-1-4939-0983-4 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2014936989 © Springer Science+Business Media New York 2014 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of pub- lication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) Pour Jose´e, sans qui ce livre n’aurait pas vu le jour—E.D.K. Z-nak, a gye´ma´nte´rt e´s az aranye´rt—G.CS. Preface Networks and network analysis are arguably one of the largest recent growth areas in the quantitative sciences. Despite roots in social network analysis going back to the 1930s and roots in graph theory going back centuries, the phenomenal rise and popularity of the modern field of ‘network science’, as it is sometimes called, is something that likely could not have been predicted 10–15 years ago. Networks have permeated everyday life, far beyond the realm of research and methodology, through now-familiar realities like the Internet, social networks, viral marketing, and more. Measurement and data analysis are integral components of network research. As a result, there is a critical need for all sorts of statistics for network analysis, both common and sophisticated, ranging from applications, to methodology, to theory. As with other areas of statistics, there are both descriptive and inferential statistical techniques available, aimed at addressing a host of network-related tasks, including basic visualization and characterization of network structure; sampling, modeling, and inference of network topology; and modeling and prediction of network-indexed processes, both static and dynamic. Software for performing many such network-related analyses is now available in various languages and environments, across different platforms. Not surprisingly, the R community has been particularly active in the development of software for do- ing statistical analysis of network data. As of this writing there are already dozens of contributed R packages devoted to some aspect of network analysis. Together, these packages address tasks ranging from standard manipulation, visualization, and characterization of network data (e.g., igraph, network, and sna), to modeling of networks (e.g., igraph, eigenmodel, ergm, and mixer), to network topology infer- ence (e.g., glasso and huge). In addition, there is a great deal of analysis that can be done using tools and functions from the R base package. In this book we aim to provide an easily accessible introduction to the statistical analysis of network data, by way of the R programming language. As a result, this book is not, on the one hand, a detailed manual for using the various R packages en- countered herein, nor, on the other hand, does it provide exhaustive coverage of the conceptual and technical foundations of the topic area. Rather, we have attempted vii viii Preface to strike a balance between the two and, in addition, to do so using a (hopefully!) optimal level of brevity. Accordingly, we envision the book being used, for example, by (i) statisticians looking to begin engaging in the statistical analysis of network data, whether at a research level or in conjunction with a new collaboration, and hoping to use R as a natural segue, (ii) researchers from other similarly quantitative fields (e.g., computer science, statistical physics, economics, etc.) working in the area of complex networks, who seek to get up to speed relatively quickly on how to do statistical analyses (both familiar and unfamiliar) of network data in R, and (iii) practitioners in applied areas wishing to get a foothold into how to do a specific type of analysis relevant to a particular application of interest. More generally, the book has been written at a level aimed at graduate students and researchers in quantitative disciplines engaged in the statistical analysis of net- work data, although advanced undergraduates already comfortable with R should find much of the book fairly accessible as well. At present, therefore, we antici- pate the book being of interest to readers not only in statistics, of course, but also in areas like computational biology, computer science and machine learning, eco- nomics, neuroscience, quantitative finance, signal processing, statistical physics, and the quantitative social sciences. There are a number of people we wish to thank, whose help at various stages of development and writing is greatly appreciated. Thanks go to the editorial team at Springer for their enthusiasm in encouraging us to take on this project and for their feedback along the way and to the students in the course Statistical Analysis of Network Data (MA703) at Boston University in Fall 2013, for their comments on many of the earlier chapters. Special thanks are also due to Xinyu Kang, Heather Shappell, and Yaonan Zhang, who went through the entire first complete draft, care- fully reading every chapter and testing the code blocks throughout the book. We are grateful as well to Christophe Ambroise, AlainBarrat, Mark Coates, Suchi Gopal, Emmanuel Lazega, and Petra Staufer for kindly making available their data. More broadly, we would like to express our appreciation in general for the countless hours of effort invested by the developers of the many R packages that we have made use of throughout the pages of this book. Without their work, the breadth and scope of our own here would be significantly reduced. Finally, although surely still inade- quate, we wish to express our deepest gratitude to our respective families for their love, patience, and support throughout the writing of this book. All code and data used in this book have been made available in the R package sand, distributed through the CRAN archive. Boston, MA, USA Eric D. Kolaczyk Cambridge, MA, USA Ga´bor Csa´rdi March, 2014 Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Why Networks? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Types of Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Visualizing and Characterizing Networks . . . . . . . . . . . . . . . . 3 1.2.2 Network Modeling and Inference . . . . . . . . . . . . . . . . . . . . . . . 5 1.2.3 Network Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Why Use R for Network Analysis? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.4 About This Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5 About the R code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2 Manipulating Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Creating Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.1 Undirected and Directed Graphs . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Representations for Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.3 Operations on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3 Decorating Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.1 Vertex, Edge, and Graph Attributes . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Using Data Frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Talking About Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.1 Basic Graph Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.4.2 Special Types of Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Visualizing Network Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Elements of Graph Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Graph Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4 Decorating Graph Layouts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5 Visualizing Large Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 ix x Contents 3.6 Using Visualization Tools Outside of R . . . . . . . . . . . . . . . . . . . . . . . . 39 3.7 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4 Descriptive Analysis of Network Graph Characteristics . . . . . . . . . . . . 43 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.2 Vertex and Edge Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 Vertex Degree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.2 Vertex Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.2.3 Characterizing Edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.3 Characterizing Network Cohesion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.1 Subgraphs and Censuses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.3.2 Density and Related Notions of Relative Frequency . . . . . . . 55 4.3.3 Connectivity, Cuts, and Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4 Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 4.4.1 Hierarchical Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.4.2 Spectral Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.4.3 Validation of Graph Partitioning . . . . . . . . . . . . . . . . . . . . . . . . 64 4.5 Assortativity and Mixing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 4.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5 Mathematical Models for Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . 69 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 Classical Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.3 Generalized Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 5.4 Network Graph Models Based on Mechanisms . . . . . . . . . . . . . . . . . . 74 5.4.1 Small-World Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.4.2 Preferential Attachment Models . . . . . . . . . . . . . . . . . . . . . . . . 77 5.5 Assessing Significance of Network Graph Characteristics . . . . . . . . . 79 5.5.1 Assessing the Number of Communities in a Network . . . . . . 79 5.5.2 Assessing Small World Properties . . . . . . . . . . . . . . . . . . . . . . 81 5.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6 Statistical Models for Network Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 6.2 Exponential Random Graph Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 6.2.2 Specifying a Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 6.2.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 6.2.4 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.3 Network Block Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 6.3.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3.2 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 6.3.3 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4 Latent Network Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.4.1 General Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Contents xi 6.4.2 Specifying the Latent Effects . . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.4.3 Model Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . 104 6.4.4 Goodness-of-Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 6.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7 Network Topology Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 7.2 Link Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.3 Association Network Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3.1 Correlation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.3.2 Partial Correlation Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.3.3 Gaussian Graphical Model Networks . . . . . . . . . . . . . . . . . . . . 126 7.4 Tomographic Network Topology Inference . . . . . . . . . . . . . . . . . . . . . 129 7.4.1 Constraining the Problem: Tree Topologies . . . . . . . . . . . . . . 129 7.4.2 Tomographic Inference of Tree Topologies: An Illustration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 7.5 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 8 Modeling and Prediction for Processes on Network Graphs . . . . . . . . . 135 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.2 Nearest Neighbor Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 8.3 Markov Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.1 General Characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3.2 Auto-Logistic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3.3 Inference and Prediction for Auto-logistic Models . . . . . . . . . 144 8.3.4 Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 8.4 Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.1 Designing Kernels on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 148 8.4.2 Kernel Regression on Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 152 8.5 Modeling and Prediction for Dynamic Processes . . . . . . . . . . . . . . . . 154 8.5.1 Epidemic Processes: An Illustration . . . . . . . . . . . . . . . . . . . . . 154 8.6 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 9 Analysis of Network Flow Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 9.2 Modeling Network Flows: Gravity Models . . . . . . . . . . . . . . . . . . . . . 162 9.2.1 Model Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 9.2.2 Inference for Gravity Models . . . . . . . . . . . . . . . . . . . . . . . . . . 166 9.3 Predicting Network Flows: Traffic Matrix Estimation . . . . . . . . . . . . 170 9.3.1 An Ill-Posed Inverse Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 171 9.3.2 The Tomogravity Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 9.4 Additional Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 xii Contents 10 Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 10.2 Representation and Manipulation of Dynamic Networks . . . . . . . . . . 180 10.3 Visualization of Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 10.4 Characterization of Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . 191 10.5 Modeling Dynamic Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 Biographies xiii Eric D. Kolaczyk is a professor of statistics and Director of the Program in Statistics, in the Department of Mathematics and Statistics at Boston University, where he also is an affiliated faculty member in the Bioinformatics Program, the Division of Systems Engineering, and the Program in Computational Neuroscience. His publications on network-based topics, beyond the development of statistical methodology and theory, include work on applications ranging from the detection of anomalous traffic patterns in computer networks to the prediction of biological function in networks of interacting proteins to the characterization of influence of groups of actors in social networks. He is an elected fellow of the American Statis- tical Association (ASA) and an elected senior member of the Institute of Electrical and Electronics Engineers (IEEE). Ga´bor Csa´rdi is a research associate at the Department of Statistics at Harvard University, Cambridge, MA, USA. He holds a Ph.D. in computer science from Eo¨tvo¨s University, Hungary. His research includes applications of network analy- sis in biology and social sciences, bioinformatics and computational biology, and graph algorithms. He created the igraph software package in 2005 and has been one of the lead developers since then. Chapter 1 Introduction 1.1 Why Networks? The oft-repeated statement that “we live in a connected world” perhaps best captures, in its simplicity, why networks have come to hold such interest in re- cent years. From on-line social networks like Facebook to the World Wide Web and the Internet itself, we are surrounded by examples of ways in which we interact with each other. Similarly, we are connected as well at the level of various human institutions (e.g., governments), processes (e.g., economies), and infrastructures (e.g., the global airline network). And, of course, humans are surely not unique in being members of various complex, inter-connected systems. Looking at the natural world around us, we see a wealth of examples of such systems, from en- tire eco-systems, to biological food webs, to collections of inter-acting genes or communicating neurons. The image of a network—that is, essentially, something resembling a net—is a natural one to use to capture the notion of elements in a system and their inter- connectedness. Note, however, that the term ‘network’ seems to be used in a variety of ways, at various levels of formality. The Oxford English Dictionary, for example, defines the word network in its most general form simply as “a collection of inter- connected things.” On the other hand, frequently ‘network’ is used inter-changeably with the term ‘graph’ since, for mathematical purposes, networks are most com- monly represented in a formal manner using graphs of various kinds. In an effort to emphasize the distinction between the general concept and its mathematical formal- ization, in this book we will use the term ‘network’ in its most general sense above, and—at the risk of the impression of a slight redundancy—we will sometimes for emphasis refer to a graph representing such a network as a ‘network graph.’ The seeds of network-based analysis in the sciences, particularly its mathematical foundation of graph theory, are often placed in the 1735 solution of Euler to the now famous Ko¨nigsberg bridge problem, in which he proved that it was impossible to walk the seven bridges of that cityin such a way as to traverse each only once. Since then, particularly since the mid-1800s onward, these seeds have grown in a E.D. Kolaczyk and G. Csa´rdi, Statistical Analysis of Network Data with R, Use R! 65, DOI 10.1007/978-1-4939-0983-4 1, © Springer Science+Business Media New York 2014 1 2 1 Introduction number of key areas. For example, in mathematics the formal underpinnings were systematically laid, with Ko¨nig [92] cited as the first key architect. The theory of electrical circuits has always had a substantial network component, going back to work of Kirchoff, and similarly the study of molecular structure in chemistry, going back to Cayley. As the fields of operations research and computer science grew during the mid-1900s, networks were incorporated in a major fashion in problems involving transportation, allocation, and the like. And similarly during that time period, a small subset of sociologists, taking a particularly quantitative view towards the topic of social structure, began developing the use of networks in characterizing interactions within social groups. More recently—starting, perhaps, in the early to mid-1990s—there has been an explosion of interest in networks and network-based approaches to modeling and analysis of complex systems. Much of the impetus for this growth derives from work by researchers in two particular areas of science: statistical physics and com- puter science. To the former can be attributed a seminal role in encouraging what has now become a pervasive emphasis across the sciences on understanding how the interacting behaviors of constituent parts of a whole system lead to collective behavior and systems-level properties or outcomes. Indeed the term complex system was coined by statistical physicists, and a network-based perspective has become central to the analysis of complex systems. To the latter can be attributed much of the theory and methodology for conceptualizing, storing, manipulating, and doing computations with networks and related data, particularly in ways that enable effi- cient handling of the often massive quantities of such data. Moreover, information networks (e.g, the World Wide Web) and related social media applications (e.g., Twitter), the development of which computer scientists have played a key role, are examples of some of the most studied of complex systems (arguably reflecting our continued fascination with studying ourselves!). More broadly, a network-based perspective recently has been found to be useful in the study of complex systems across a diverse range of application areas. These areas include computational biology (e.g., studying systems of interacting genes, proteins, chemical compounds, or organisms), engineering (e.g., establishing how best to design and deploy a network of sensing devices), finance (e.g., studying the interplay among, say, the world’s banks as part of the global economy), marketing (e.g., assessing the extent to which product adoption can be induced as a type of ‘contagion’), neuroscience (e.g., exploring patterns of voltage dynamics in the brain associated with epileptic seizures), political science (e.g., studying how voting pref- erences in a group evolve in the face of various internal and external forces), and public health (e.g., studying the spread of infectious disease in a population, and how best to control that spread). In general, two important contributing factors to the phenomenal growth of interest in networks are (i) an increasing tendency towards a systems-level perspec- tive in the sciences, away from the reductionism that characterized much of the previous century, and (ii) an accompanying facility for high-throughput data collec- tion, storage, and management. The quintessential example is perhaps that of the changes in biology over the past 10–20 years, during which the complete mapping 1.2 Types of Network Analysis 3 of the human genome, a triumph of computational biology in and of itself, has now paved the way for fields like systems biology to be pursued aggressively, wherein a detailed understanding is sought of how the components of the human body, at the genetic level and higher, work together. Ultimately, the study of systems such as those just described is accompanied by measurement and, accordingly, the need for statistical analysis. The focus of this book is on how to use tools in R to do statistical analysis of network data. More specifically, we aim to present tools for performing what are arguably a core set of analyses of measurements that are either of or from a system conceptualized as a network. 1.2 Types of Network Analysis Network data are collected daily in a host of different areas. Each area, naturally, has its own unique questions and problems under study. Nevertheless, from a sta- tistical perspective, there is a methodological foundation that has emerged, com- posed of tasks and tools that are each common to some non-trivial subset of research areas involved with network science. Furthermore, it is possible—and indeed quite useful—to categorize many of the various tasks faced in the analysis of network data across different domains according to a statistical taxonomy. It is along the lines of such a taxonomy that this book is organized, progressing from descriptive methods to methods of modeling and inference, with the latter conveniently sepa- rated into two sub-areas, corresponding to the modeling and inference of networks themselves versus processes on networks. We illustrate with some examples. 1.2.1 Visualizing and Characterizing Networks Descriptive analysis of data (i.e., as opposed to statistical modeling and inference) typically is one of the first topics encountered in a standard introductory course in statistics. Similarly, the visualization and numerical characterization of a network usually is one of the first steps in network analysis. Indeed, in practice, descriptive analyses arguably constitute the majority of network analyses published. Consider the network in Fig. 1.1. Shown is a visualization1 of part of a dataset on collaborative working relationships among members of a New England law firm, collected by Lazega [98]. These data were collected for the purpose of studying cooperation among social actors in an organization, through the exchange of various types of resources among them. The organization observed was a law firm, consist- ing of over 70 lawyers (roughly half partners and the other half associates) in three offices located in three different cities. Relational data reflecting resource exchange 1 The R code for generating this visualization is provided in Chap. 3. 4 1 Introduction were collected, and additional attribute information was recorded for each lawyer, including type of practice, gender, and seniority. See Lazega and Pattison [99] for additional details. ll ll ll ll ll ll ll ll 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 2627 28 29 30 3132 33 34 35 36 Fig. 1.1 Visualization of Lazega’s network of collaborative working relationships among lawyers. Vertices represent partners and are labeled according to their seniority (with 1 being most senior). Vertex colors (i.e., red, blue, and yellow) indicate three different office locations, while vertex shape corresponds to the type of practice [i.e., litigation (circle) and corporate (square)]. Vertex area is proportional to number of years with the law firm. Edges indicate collaboration between partners. There are three female partners (i.e., those with seniority labels 27, 29, and 34); the rest are male The visual summary in Fig. 1.1 manages to combine a number of important aspects of these data in one diagram. A graph is used to represent the network, with vertices correspondingto lawyers, and edges, to collaboration between pairs of lawyers. In addition, differences in vertex color, shape, size, and label are used to indicate office location, type of practice, years with the law firm, and seniority. However, this is far from the only manner in which to visualize these data. Decid- ing how best to do so is itself both an art and science. Furthermore, visualization is aided here by the fact that the network of lawyers is so small. Suppose instead our interest lay in a large on-line social network, such as the Facebook network of people that have ‘friended’ each other. With well over 1 billion active users reported as of this writing, it is impossible to display this network in an analogous manner, with every individual user and their friendships evident. A similar statement often can be made in biology, for example, such as for networks of proteins and their 1.2 Types of Network Analysis 5 affinity for binding or of genes and their regulatory relationships. As a result, the visualization of large networks is a separate challenge of its own. We will look at tools for network visualization in Chap. 3. Characterization of network data through numerical summaries is another important aspect of descriptive analysis for networks. However, unlike in an int- roductory statistics course, where the types of summaries encountered typically are just simple measures of center (e.g., mean, median, etc.) and dispersion (e.g., standard deviation, range, etc.) for a set of real numbers, summary measures for networks necessarily seek to capture characteristics of a graph. For example, for the lawyer data in Fig. 1.1, it is natural to ask to what extent two lawyers that both work with a third lawyer are likely to work with each other as well. This notion corresponds to the social network concept of transitivity and can be captured numerically through an enumeration of the proportion of vertex triples that form triangles (i.e., all three vertex pairs are connected by edges), typically summarized in a so-called clustering coefficient. In this case, the relevant characterization is an explicit summary of network structure. On the other hand, given the two types of lawyers represented in these data (i.e., corporate and litigation), it is also natural to ask to what extent lawyers of each type collaborate with those of same versus different types. This notion corresponds to another social network concept—that of assortativity—and can be quantified by a type of correlation statistic (the so-called assortativity coefficient), in which labels of connected pairs of vertices are com- pared. In this case, the focus is on an attribute associated with network vertices (i.e., lawyer practice) and the network structure plays a comparatively more implicit role. There are a host of network summaries available, with more being defined reg- ularly. A variety of such measures are covered in Chap. 4. These range from char- acterization of properties associated with individual vertices or edges to properties of subgraphs to properties of the graph as a whole. The trick to employing net- work characterizations in a useful fashion generally is in matching the question(s) of interest in an underlying complex system with an appropriate summary measure(s) of the corresponding network. 1.2.2 Network Modeling and Inference Beyond asking what an observed network looks like and characterizing its structure, at a more fundamental level we may be interested in understanding how it may have arisen. That is, we can conceive of the network as having resulted from some under- lying processes associated with the complex system of interest to us and ask what are the essential aspects of these processes. In addition, the actual manner in which the network was obtained, i.e., the corresponding measurement and construction process, may well be important to take into consideration. Such concerns provide the impetus for network modeling and associated tools of statistical inference. Network modeling has received much attention. Broadly speaking, there are two classes of network models: mathematical and statistical. By ‘mathematical 6 1 Introduction models’ we mean models specified through (typically) simple probabilistic rules for generating a network, where often the rules are defined in an attempt to capture a particular mechanism or principle (e.g., ‘the rich get richer’). In contrast, by ‘sta- tistical models’ we mean models (often probabilistic as well) specified at least in part with the intention that they be fit to observed data—allowing, for example, the evaluation of the explanatory power of certain variables on edge formation in the network—and that the fit be effected and assessed using formal principles of statis- tical inference. While certainly there is some overlap between these two classes of models, the relevant literatures nevertheless are largely distinct. The simplest example of a mathematical network model is one in which edges are assigned randomly to pairs of vertices based on the result of a collection of independent and identically distributed coin tosses—one toss for each vertex pair. Corresponding to a variant of the famous Erdo˝s–Re´nyi formulation of a random graph, this model has been studied extensively since the 1960s. Its strength is in the fact not only that its properties are so well understood (e.g., in terms of how, for example, cohesive structure emerges as a function of the probability of an edge) but also in the role it plays as a standard against which to compare other, more complicated models. From the statistical perspective, however, such mathematical models generally are too simple to be a good match to real network data. Nevertheless, they are not only useful in allowing for formal insight to be gained into how specific mechanisms of edge formation may affect network structure, but they also are used commonly in defining null classes of networks against which to assess the ‘significance’ of structural characteristics found in an observed network. For example, in Fig. 1.1, the clustering coefficient for the network of lawyer collaboration turns out to be just slightly less than 0.40. Is this value large or not? It is difficult to say without a frame of reference. One way of imposing such a frame of reference is to define a simple but comparable class of networks (e.g., all graphs with the same numbers of vertices and edges) and look at the distribution of the clustering coefficient across all such graphs. In the spirit of a permutation test and similar data re-use methods, examining where our observed value of 0.40 falls within this distribution gives us a sense of how unusual it is in comparison to values for networks across the specified class. This approach has been used not only for examining clustering in social networks but also, for example, in identifying recurring sub-graph patterns (aka ‘motifs’) in gene regulatory networks and functional connectivity networks in neuroscience. We will explore the use of mathematical network models for such purposes in Chap. 5. There are a variety of statistical network models that have been offered in the literature, many of which parallel well-known model classes in classical statis- tics. For example, exponential random graph models are analogous to generalized linear models, being based on an exponential family form. Similarly, latent net- work models, in specifying that edges may arise at least in part from an unmeasured (and possibly unknown) variable(s), directly parallel the use of latent variables in hierarchical modeling. Finally, stochastic block models may be viewed as a form of mixture model. Nevertheless, importantly, the specification of such models and their fitting typically are decidedly less standard, given the usually high-dimensional and 1.2 Types of Network Analysis 7 dependent natureof the data. Such models have been used in a variety of settings, from the modeling of collaborations like that in Lazega’s lawyer network to the pre- diction of protein pairs likely to have an affinity for physically binding. We will look at examples of the use of each of these classes of models in Chap. 6. 1.2.3 Network Processes As objects for representing interactions among elements of a complex system, net- work graphs are frequently the primary focus of network analysis. On the other hand, in many contexts it is actually some quantity (or attribute) associated with each of the elements in the system that ultimately is of most interest. Nevertheless, in such settings it often is not unreasonable to expect that this quantity be influenced in an important manner by the interactions among the elements, and hence the net- work graph may still be relevant for modeling and analysis. More formally, we can picture a stochastic process as ‘living’ on the network and indexed by the vertices in the network. A variety of questions regarding such processes can be interpreted as problems of prediction of either static or dynamic network processes. Returning to Fig. 1.1, for example, suppose that we do not know the practice of a particular lawyer. It seems plausible to suspect that lawyers collaborate more frequently with other lawyers in the same legal practice. If so, knowledge of col- laboration may be useful in predicting practice. That is, for our lawyer of unknown practice, we may be able to predict that practice with some accuracy if we know (i) the vertices that are neighbors of that lawyer in the network graph, and (ii) the practice of those neighbors. While in fact this information is known for all lawyers in our data set, in other contexts we generally are not so fortunate. For example, in on-line social networks like Facebook, where users can choose privacy settings of various levels of sever- ity, it can be of interest (e.g., for marketing purposes) to predict user attributes (e.g., perhaps indicative of consumer preferences) based on that of ‘friends’. Simi- larly, in biology, traditionally the functional role of proteins (e.g., in the context of communication within the cell or in controlling cell metabolism) has been estab- lished through labor-intensive experimental techniques. Because proteins that work together to effect certain functions often have a higher affinity for physically bind- ing to each other, the clusters of highly connected proteins in protein–protein int- eraction networks can be exploited to make computational predictions of protein function, by propagating information on proteins of known function to their neigh- bors of unknown function in the graph. We will encounter methods for making such predictions of static network processes in Chap. 8. Ultimately, many (most?) of the systems studied from a network-based perspec- tive are intrinsically dynamic in nature. Not surprisingly, therefore, many processes defined on networks are more accurately thought of as dynamic, rather than static, processes. Consider, for example, the context of public health and disease control. Understanding the spread of a disease (e.g., the H1N1 flu virus) through a population 8 1 Introduction can be modeled as the diffusion of a binary dynamic process (indicating infected or not infected) through a network graph, in which vertices represent individuals, and edges, contact between individuals. Mathematical modeling (using both determin- istic and stochastic models) arguably is still the primary tool for modeling such processes, but network-based statistical models gradually are seeing increased use, particularly as better and more extensive data on contact networks becomes avail- able. Another setting in which analogous ideas and models are used—and wherein it is possible to collected substantially more data—is in modeling the adoption of new products among consumers (e.g., toward predicting the early adopters of, say, the next iPhone released). We look briefly at techniques in this emerging area of modeling and prediction of dynamic network processes in Chap. 8 as well. In a related direction, we will look at statistical methods for analyzing network flows in Chap. 9. Referring to the movement of something—materials, people, or commodities, for example—from origin to destination, flows are a special type of dynamic process fundamental to transportation networks (e.g., airlines moving people) and communication networks (e.g., the Internet moving packets of informa- tion), among others. Finally, in Chap. 10, we will look briefly at the emerging area of dynamic network analysis, wherein the network, the process(es) on the network, or indeed both, are expected to be evolving in time. 1.3 Why Use R for Network Analysis? Various tools are available for network analysis. Some of these are standalone pro- grams, like the classic Pajek tool or the newer Gephi, while others are embedded into a programming environment and are essentially used as a programming library. Some examples of the latter are NetworkX in Python and igraph in R. R is rapidly becoming the de facto standard of statistical research, if this has not happened already. The majority of new statistical developments are immedi- ately available in R and no other programming languages or software packages. While network analysis is an interdisciplinary field, it is not an exception from this trend. Several R extension packages implement one or more network analysis algorithms or provide general tools for manipulating network data and implement network algorithms. R supports high quality graphics and virtually any common graphical file format. R is extensible with currently over 5,000 extension packages, and this number is growing. Being a complete programming language, R offers great flexibility for network research. New network analysis algorithms can be prototyped rapidly by building on the existing network science extension packages, the most commonly used one of which is the igraph package. In addition to implementations of classic and recently published methods of network analysis, igraph provides tools to import, manipu- late and visualize graphs, and can be used as a platform for new algorithms. It is currently used in over a hundred other R packages and will be featured often in this book. 1.4 About This Book 9 1.4 About This Book Our goal in writing this book is to provide an easily accessible introduction to the statistical analysis of network data, by way of the R programming language. The book has been written at a level aimed at graduate students and researchers in quantitative disciplines engaged in the statistical analysis of network data, although advanced undergraduates already comfortable with R should find the book fairly accessible as well. At present, therefore, we anticipate the book being of interest to readers in statistics, of course, but also in areas like computational biology, com- puter science and machine learning, economics, neuroscience, quantitative finance, signal processing, statistical physics, and the quantitative social sciences. The material in this book is organized to flow from descriptive statistical methods to topics centered on modeling and inference with networks, with the latter sepa- rated into two sub-areas, corresponding first to the modeling and inference of net- works themselves, and then, to processes on networks. More specifically, we begin by covering tools for the manipulation of network data in Chap. 2. The visualization and characterization of networks is then addressed in Chaps. 3 and 4, respectively. Next, the topics of mathematical and statistical network modeling are investigated, in Chaps. 5 and 6, respectively. In Chap. 7 the focus is on the special case of network modeling wherein the network topology must itself be inferred. Network processes,both static and dynamic, are addressed in Chap. 8, while network flows are featured in Chap. 9. Finally, in Chap. 10 a brief look is provided at how the topics of these earlier chapters are beginning to be extended to the context of dynamic networks. It is not our intent to provide a comprehensive coverage here of the conceptual and technical foundations of the statistical analysis of network data. For something more of that nature, we recommend the book by Kolaczyk [91]. There are also a number of excellent treatments of network analysis more generally, written from the perspective of various other fields. These include the book by Newman [118], from the perspective of statistical physics, and the book by Jackson [81], from the perspective of economics. The book by Easley and Kleinberg [49], written at an introductory undergraduate level, provides a particularly accessible treatment of networks and social behavior. Finally, the book by Wasserman and Faust [144], although less recent than these others, is still an important resource, particularly regarding basic tools for characterization and classical network modeling, from a social network perspective. On the other hand, neither is this book a complete manual for the various R packages encountered herein. The reader will want to consult the manuals for these packages themselves for details omitted from our coverage. In addition, it should be noted that we assume a basic familiarity with R and the base package. For gen- eral background on R, particularly for those not already familiar with the software, there are any number of resources to be had, including the tutorial by Venebles and Smith [143], the beginner’s guide by Zuur et al. [152], and the book by Crawley [36]. Ultimately, we have attempted to strike a reasonable balance between the con- cepts and technical background, on the one hand, and software details, on the other. Accordingly, we envision the book being used, for example, by (i) statisticians 10 1 Introduction looking to begin engaging in the statistical analysis of network data, whether at a research level or in conjunction with a new collaboration, and hoping to use R as a natural segue; (ii) researchers from other similarly quantitative fields (e.g., com- puter science, statistical physics, economics, etc.) working in the area of complex networks, who seek to get up to speed relatively quickly on how to do statistical analyses (both familiar and unfamiliar) of network data in R, and (iii) practitioners in applied areas wishing to get a foothold into how to do a specific type of analysis relevant to a particular application of interest. 1.5 About the R code The R code in this book requires a number of data sets and R packages. We have collected the data sets into a standalone R package called sand (i.e., for ‘statistical analysis of network data’). If the reader aims to run the code while reading the book, she needs to install this package from CRAN.2 Once installed, the package can then be used to also install all other R packages required in the code chunks of the book: #1.1 1 > install.packages("sand") 2 > library(sand) 3 > install_sand_packages() To avoid the need for constant typing while reading the book, we have extracted all R code and placed it online, at http://github.com/kolaczyk/sand. We suggest that the reader open this web page to simply copy-and-paste the code chunks into an R session. All code is also available in the sand package. In the package manual are details on how to run the code: #1.2 1 > ?sand Note that code chunks are not independent, in that they often rely upon the results of previous chunks, within the same chapter. On the other hand, code in separate chapters is meant to be run separately—in fact, restarting R at the beginning of each chapter is the best way to ensure a clean state to start from. For example, to run the code in Chap. 3, it is not required (nor recommended) to first run all code in Chaps. 1 and 2, but it is prudent to run any given code block in Chap. 3 only after having run all other relevant code blocks earlier in the chapter. In writing this book, we have sought to strike an optimal balance between repro- ducibility and exposition. It should be possible, for example, for the reader to exactly reproduce all numerical results presented herein (i.e., summary statistics, estimates, etc.). In order to ensure this capability, we have set random seeds where neces- sary (e.g., prior to running methods relying on Markov chain Monte Carlo). The reader adapting our code for his own purposes will likely, of course, want to use his own choice of seeds. In contrast, however, we have compromised when it comes to 2 The Comprehensive R Archive Network (CRAN) is a worldwide network of ftp and web servers from which R code and related materials may be downloaded. See http://cran.r-project.org. 1.5 About the R code 11 visualizations. In particular, while the necessary code has been supplied to repeat all visualizations shown in this book, in most cases we have not attempted to wrap the main code with all necessary supplemental code (e.g., random seeds, margin set- tings, space between subplots, etc.) required for exact reproducibility, which would have come at what we felt was an unacceptable cost of reduction in readability and general aesthetics. Given the fact that the R environment in general, and the R packages used in this book more specifically, can be expected to continue to evolve over time, we anticipate that we will need to update the online version of the code (and the version in the sand package) in the future, to make it work with newer versions of R and with updated packages. Therefore, if the reader finds that the code here in the book appears to fail, the online version should be most current and working. On a related note, as can be seen above, the code chunks in the book are num- bered, by chapter and within each chapter. This convention was adopted mainly to ensure ease in referencing code chunks from the website and vice versa. Even if we update the code on the web site, our intention is that the chunk numbers will stay the same. In addition to having the most current version of the code chunks, the web page also includes an issue tracker, for reporting mistakes in the text or the code, and for general discussion. Chapter 2 Manipulating Network Data 2.1 Introduction We have seen that the term ‘network,’ broadly speaking, refers to a collection of elements and their inter-relations. The mathematical concept of a graph lends preci- sion to this notion. We will introduce the basic elements of graphs—both undirected and directed—in Sect. 2.2 and discuss how to generate network graphs, both ‘by hand’ and from network data of various forms. As a representation of a complex system, a graph alone (i.e., as merely a col- lection of vertices and edges) is often insufficient. Rather, there may be additional information important to the application at hand, in the form of variables that can be indexed by the vertices (e.g., gender of members of a social network) or the edges (e.g., average time required to traverse a link in a transportation network). Alterna- tively, at a coarser level of granularity, it may be convenient to associate vertices or edges with groups (e.g., all proteins in a protein–protein interaction network that are involved with a certain type of signaling event in a cell). Indeed, we can imagine potentially equipping vertices and edges with several variables of interest. Doing so corresponds to the notion of decorating a network graph, which is discussed in Sect. 2.3. Finally, in using graphs to represent network data, a certain level of familiarity with basic graph theoretic concepts, as well as an ability to assess certain basic properties of graphs, is essential. We therefore devote Sect. 2.4 to a brief overviewof such concepts and properties, including a quick look at a handful of important special classes of graphs. For creating, decorating, and assessing basic properties of network graphs, igraph is particularly useful.1 A library and R package for network analysis, igraph contains a set of data types and functions for (relatively!) straightforward implemen- tation and rapid prototyping of graph algorithms, and allows for the fast handling of 1 Alternatively, there is within the network and sna packages, found in the statnet suite, a similarly rich set of tools for the manipulation and characterization of network graphs. These packages share nontrivial overlap with igraph. E.D. Kolaczyk and G. Csa´rdi, Statistical Analysis of Network Data with R, Use R! 65, DOI 10.1007/978-1-4939-0983-4 2, © Springer Science+Business Media New York 2014 13 14 2 Manipulating Network Data large graphs (e.g., on the order of millions of vertices and edges). As such, its use will figure heavily in this and the following two chapters (i.e., where the emphasis is on descriptive methods). The fact that igraph was developed as a research tool and that its focus originally was to be able to handle large graphs efficiently, means that its learning curve used to be somewhat steep. Recent versions do not necessarily flatten the learning curve, but are nevertheless friendlier to the user, once she has mastered the basics. 2.2 Creating Network Graphs 2.2.1 Undirected and Directed Graphs Formally, a graph G = (V,E) is a mathematical structure consisting of a set V of vertices (also commonly called nodes) and a set E of edges (also commonly called links), where elements of E are unordered pairs {u,v} of distinct vertices u,v ∈ V . The number of vertices Nv = |V | and the number of edges Ne = |E| are sometimes called the order and size of the graph G, respectively. Often, and without loss of generality,2 we will label the vertices simply with the integers 1, . . . ,Nv, and the edges, analogously. In igraph there is an ‘igraph’ class for graphs.3 In this section, we will see a number of ways to create an object of the igraph class in R, and various ways to extract and summarize the information in that object. For small, toy graphs, the function graph.formula can be used. For example, #2.1 1 > library(igraph) 2 > g <- graph.formula(1-2, 1-3, 2-3, 2-4, 3-5, 4-5, 4-6, 3 + 4-7, 5-6, 6-7) creates a graph object g with Nv = 7 vertices #2.2 1 > V(g) 2 Vertex sequence: 3 [1] "1" "2" "3" "4" "5" "6" "7" and Ne = 10 edges #2.3 1 > E(g) 2 Edge sequence: 3 4 [1] 2 -- 1 5 [2] 3 -- 1 6 [3] 3 -- 2 7 [4] 4 -- 2 2 Technically, a graph G is unique only up to relabellings of its vertices and edges that leave the structure unchanged. Two graphs that are equivalent in this sense are called isomorphic. 3 The exact representation of ‘igraph’ objects is not visible for the user and is subject to change. 2.2 Creating Network Graphs 15 8 [5] 5 -- 3 9 [6] 5 -- 4 10 [7] 6 -- 4 11 [8] 7 -- 4 12 [9] 6 -- 5 13 [10] 7 -- 6 This same information, in a slightly more compressed format, is recovered easily using the relevant structure command. #2.4 1 > str(g) 2 IGRAPH UN-- 7 10 -- 3 + attr: name (v/c) 4 + edges (vertex names): 5 1 -- 2, 3 6 2 -- 1, 3, 4 7 3 -- 1, 2, 5 8 4 -- 2, 5, 6, 7 9 5 -- 3, 4, 6 10 6 -- 4, 5, 7 11 7 -- 4, 6 A visual representation of this graph, generated simply through the command4 #2.5 1 > plot(g) is shown in Fig. 2.1, on the left. The character U seen accompanying the summary of g above indicates that our graph is undirected, in that there is no ordering in the vertices defining an edge. A graph G for which each edge in E has an ordering to its vertices (i.e., so that (u,v) is distinct from (u,v), for u,v ∈V ) is called a directed graph or digraph. Such edges are called directed edges or arcs, with the direction of an arc (u,v) read from left to right, from the tail u to the head v. Note that digraphs may have two arcs between a pair of vertices, with the vertices playing opposite roles of head and tail for the respective arcs. In this case, the two arcs are said to be mutual. Directed edges in graph.formula are indicated using a minus/plus conven- tion. In Fig. 2.1, on the right, is shown an example of a digraph consisting of three vertices, with two directed edges and one mutual edge. #2.6 1 > dg <- graph.formula(1-+2, 1-+3, 2++3) 2 > plot(dg) We note that in defining both of the graphs above we have used the standard con- vention of labeling vertices with the numbers 1 through Nv, which is also the default in igraph. In practice, however, we may already have natural labels, such as the names of people in a social network, or of genes in a gene regulatory network. Such labels can be used instead of the default choice by generating the graph with them explicitly. 4 This is the most basic visualization. We will explore the topic of visualization on its own in more depth in Chap. 3. 16 2 Manipulating Network Data 1 2 3 4 5 6 7 1 2 3 Fig. 2.1 Left: an undirected graph. Right: a directed graph #2.7 1 > dg <- graph.formula(Sam-+Mary, Sam-+Tom, Mary++Tom) 2 > str(dg) 3 IGRAPH DN-- 3 4 -- 4 + attr: name (v/c) 5 + edges (vertex names): 6 [1] Sam ->Mary Sam ->Tom Mary->Tom Tom ->Mary Alternatively, vertex labels can be changed from the default after initially creating the graph, by modifying the name vertex attribute of the graph object. #2.8 1 > V(dg)$name <- c("Sam", "Mary", "Tom") 2.2.2 Representations for Graphs Realistically, we do not usually expect to enter a graph by hand, since most net- works encountered in practice have at least tens of vertices and edges, if not tens of thousands (or even millions!). Rather, information for constructing a network graph typically will be stored in a data file. At the most elementary level, there are three basic formats: adjacency lists, edge lists, and adjacency matrices. An adjacency list representation of a graph G is simply an array of size Nv, ordered with respect to the ordering of the vertices in V , each element of which is a list, where the ith list contains the set of all vertices j for which there is an edge from i to j. This is the representation usually used by igraph, evident in printing the output from the structure function str in the examples above. An edge list is a simple two-column list of all vertex pairs that are joined by an edge. In igraph, edge lists are used, for example, when printing the edge set E . #2.9 1 > E(dg) 2 Edge sequence: 3 2.2 Creating Network Graphs 17 4 [1] Sam -> Mary 5 [2] Sam -> Tom 6 [3] Mary -> Tom 7 [4] Tom -> Mary The function get.edgelist returns an edge list as a two-column R matrix. Finally, graphs can also be stored in matrix form. The Nv ×Nv adjacency matrix for a graph G = (V,E), say A, is defined so that Ai j = { 1, if {i, j} ∈ E , 0, otherwise . (2.1) In words, A is non-zero for entries whose row-column indices (i, j) correspond to vertices in G joined by an edge, from i to j, and zero, for those that are not. The matrix A will be symmetric for undirected graphs. #2.10 1 > get.adjacency(g) 2 7 x 7 sparse Matrix of class "dgCMatrix" 3 1 2 3 4 5 6 7 4 1 . 1 1 . . . . 5 2 1 . 1 1 . . . 6 3 1 1 . . 1 . . 7 4 . 1 . . 1 1 1 8 5 . . 1 1 . 1 . 9 6 . . . 1 1 . 1 10 7 . . . 1 . 1 . This last choice of representation is often a natural one, given that matrices are fundamental data objects in most programming and software environments and that network graphs frequently are encoded in statistical models through their adjacency matrices. However, their use with the type of large, sparse networks commonly enc- ountered in practice can be inefficient, unless coupled with the use of sparse matrix tools. In igraph, network data already loaded into R in these specific formats canbe used to generate graphs using the functions graph.adjlist, graph. edgelist, and graph.adjacency, respectively. For data stored in a file, the function read.graph can be used. In fact, this latter function not only sup- ports the three formats discussed above, but also a number of other formats (e.g., such as GraphML, Pajek, etc.). Conversely, the function write.graph can be used to save graphs in various formats. 2.2.3 Operations on Graphs The graph(s) that we are able to load into R may not be the graph that we ultimately want. Various operations on the graph(s) we have available may be necessary, including extracting part of a graph, deleting vertices, adding edges, or even com- bining multiple graphs. 18 2 Manipulating Network Data The notion of a ‘part’ of a graph is captured through the concept of a subgraph. A graph H = (VH ,EH) is a subgraph of another graph G = (VG,EG) if VH ⊆ VG and EH ⊆ EG. Often we are interested in an induced subgraph of a graph G, i.e., a subgraph G′ = (V ′,E ′), where V ′ ⊆ V is a prespecified subset of vertices and E ′ ⊆ E is the collection of edges to be found in G among that subset of vertices. For example, consider the subgraph of g induced by the first five vertices. #2.11 1 > h <- induced.subgraph(g, 1:5) 2 > str(h) 3 IGRAPH UN-- 5 6 -- 4 + attr: name (v/c) 5 + edges (vertex names): 6 [1] 1--2 1--3 2--3 2--4 3--5 4--5 The inclusion or exclusion of vertices or edges in a graph G = (V,E) can be con- ceived of as the application of addition or subtraction operators, respectively, to the sets V and E . For example, the subgraph h generated just above could also have been created from g by removing the vertices 6 and 7. #2.12 1 > h <- g - vertices(c(6,7)) Similarly, g can be recovered from h by first adding these two vertices back in, and then, adding the appropriate edges. #2.13 1 > h <- h + vertices(c(6,7)) 2 > g <- h + edges(c(4,6),c(4,7),c(5,6),c(6,7)) Finally, the basic set-theoretic concepts of union, disjoint union, intersection, dif- ference, and complement all extend in a natural fashion to graphs. For example, the union of two graphs, say H1 and H2, is a graph G in which vertices and edges are included if and only if they are included in at least one of H1 or H2. For example, our toy graph g may be created through the union of the (induced) subgraph h defined above and a second appropriately defined subgraph. #2.14 1 > h1 <- h 2 > h2 <- graph.formula(4-6, 4-7, 5-6, 6-7) 3 > g <- graph.union(h1,h2) 2.3 Decorating Network Graphs 2.3.1 Vertex, Edge, and Graph Attributes At the heart of a network-based representation of data from a complex system will be a graph. But frequently there are other relevant data to be had as well. From a network-centric perspective, these other data can be thought of as attributes, i.e., values associated with the corresponding network graph. Equipping a graph with such attributes is referred to as decorating the graph. Typically, the vertices or edges of a graph (or both) are decorated with attributes, although the graph as a whole 2.3 Decorating Network Graphs 19 may be decorated as well. In igraph, the elements of graph objects (i.e., particularly the vertex and edge sequences, and subsets thereof) may be equipped with attributes simply by using the ‘$’ operator. Vertex attributes are variables indexed by vertices, and may be of discrete or continuous type. Instances of the former type include the gender of actors in a social network, the infection status of computers in an Internet network in the midst of an on-line virus (e.g., a worm), and a list of biological pathways in which a protein in a protein–protein interaction network is known to participate, while an example of the latter type is the voltage potential levels in the brain measured at electrodes in an electrocorticogram (ECoG) grid. For example, recall that the names of the three actors in our toy digraph are #2.15 1 > V(dg)$name 2 [1] "Sam" "Mary" "Tom" Their gender is added to dg as #2.16 1 > V(dg)$gender <- c("M","F","M") Note that the notion of vertex attributes also may be used advantageously to equip vertices with properties during the course of an analysis, either as input to or output from calculations within R. For example, this might mean associating the color red with our vertices #2.17 1 > V(g)$color <- "red" to be used in plotting the graph (see Chap. 3). Or it might mean saving the values of some vertex characteristic we have computed, such as the types of vertex centrality measures to be introduced in Chap. 4. Edge attributes similarly are values of variables indexed by adjacent vertex pairs and, as with vertex attributes, they may be of both discrete or continuous type. Examples of discrete edge attributes include whether one gene regulates another in an inhibitory or excitatory fashion, or whether two countries have a friendly or antagonistic political relationship. Continuous edge attributes, on the other hand, often represent some measure of the strength of relationship between vertex pairs. For example, we might equip each edge in a network of email exchanges (with ver- tices representing email addresses) by the rate at which emails were exchanged over a given period of time. Or we might define an attribute on edges between adjacent stations in a subway network (e.g., the Paris metro) to represent the average time necessary during a given hour of the day for trains to run from one to station to the next. Often edge attributes can be thought of usefully, for the purposes of various anal- yses, as weights. Edge weights generally are non-negative, by convention, and often are scaled to fall between zero and one. A graph for which the edges are equipped with weights is referred to as a weighted graph.5 5 More generally, a weighted graph can be defined as a pair (V,E), where V is a set of vertices, as before, but the elements in E are now non-negative numbers, with one such number for each vertex pair. Analogously, the adjacency matrix A for a weighted graph is defined such that the entry Ai j is equal to the corresponding weight for the vertex pair i and j. 20 2 Manipulating Network Data #2.18 1 > is.weighted(g) 2 [1] FALSE 3 > wg <- g 4 > E(wg)$weight <- runif(ecount(wg)) 5 > is.weighted(wg) 6 [1] TRUE As with vertex attributes, edge attributes may also be used to equip edges with prop- erties to be used in calls to other R functions, such as the plot function. In principle, a graph itself may be decorated with an attribute, and indeed, it is possible to equip graph objects with attributes in igraph. The most natural use of this feature arguably is to equip a graph with relevant background information, such as a name #2.19 1 > g$name <- "Toy Graph" or a seminal data source. 2.3.2 Using Data Frames Just as network graphs typically are not entered by hand for graphs of any nontrivial magnitude, but rather are encoded in data frames and files, so too attributes tend to be similarly encoded. For example, in R, a network graph and all vertex and edge attributes can be conveniently represented using two data frames, one with vertex information, and the other, with edge information. Under this approach, the first column of the vertex data frame contains the vertex names (i.e., either the default numerical labels or symbolic), while each of the other columns contain the values of a given vertex attribute. Similarly, the first two columns of the edge data frame contain an edge list defining the graph, while each of the other columns contain the values of a given edge attribute. Consider, for example, the lawyer data set of Lazega [98], introduced in Chap. 1. Collecting the information on collaborative working relationships, in the form of an edge list, in the data frame elist.lazega, and the various vertex attribute variables, in the data frame v.attr.lazega,they may be combined into a single graph object in igraph as #2.20 1 > library(sand) 2 > g.lazega <- graph.data.frame(elist.lazega, 3 + directed="FALSE", 4 + vertices=v.attr.lazega) 5 > g.lazega$name <- "Lazega Lawyers" Our full set of network information on these #2.21 1 > vcount(g.lazega) 2 [1] 36 2.4 Talking About Graphs 21 lawyers now consists of the #2.22 1 > ecount(g.lazega) 2 [1] 115 pairs that declared they work together, along with the eight vertex attributes #2.23 1 > list.vertex.attributes(g.lazega) 2 [1] "name" "Seniority" "Status" "Gender" 3 [5] "Office" "Years" "Age" "Practice" 4 [9] "School" (in addition to the vertex name). We will see a variety of ways in the chapters that follow to characterize and model these network data and others like them. 2.4 Talking About Graphs 2.4.1 Basic Graph Concepts With the adoption of a graph-based framework for representing relational data in network analysis we inherit a rich vocabulary for discussing various important con- cepts related to graphs. We briefly review and demonstrate some of these here, as they are necessary for doing even the most basic of network analyses. As defined at the start of this chapter, a graph has no edges for which both ends connect to a single vertex (called loops) and no pairs of vertices with more than one edge between them (called multi-edges). An object with either of these properties is called a multi-graph.6 A graph that is not a multi-graph is called a simple graph, and its edges are referred to as proper edges. It is straightforward to determine whether or not a graph is simple. Our toy graph g is simple. #2.24 1 > is.simple(g) 2 [1] TRUE But duplicating the edge between vertices 2 and 3, for instance, yields a multi-graph. #2.25 1 > mg <- g + edge(2,3) 2 > str(mg) 3 IGRAPH UN-- 7 11 -- Toy Graph 4 + attr: name (g/c), name (v/c), color (v/c) 5 + edges (vertex names): 6 1 -- 2, 3 7 2 -- 1, 3, 3, 4 8 3 -- 1, 2, 2, 5 9 4 -- 2, 5, 6, 7 6 In fact, the igraph data model is more general than described above, and allows for multi-graphs, with multiple edges between the same pair of vertices and edges from a vertex to itself. 22 2 Manipulating Network Data 10 5 -- 3, 4, 6 11 6 -- 4, 5, 7 12 7 -- 4, 6 13 > is.simple(mg) 14 [1] FALSE Checking whether or not a network graph is simple is a somewhat trivial but nev- ertheless important preliminary step in doing a typical network analysis, as many models and methods assume the input graph to be simple or behave differently if it is not. Note that it is straightforward, and indeed not uncommon in practice, to trans- form a multi-graph into a weighted graph, wherein each resulting proper edge is equipped with a weight equal to the multiplicity of that edge in the original multi- graph. For example, converting our toy multi-graph mg to a weighted graph results in a simple graph, #2.26 1 > E(mg)$weight <- 1 2 > wg2 <- simplify(mg) 3 > is.simple(wg2) 4 [1] TRUE the edges which match our initial toy graph g, #2.27 1 > str(wg2) 2 IGRAPH UNW- 7 10 -- Toy Graph 3 + attr: name (g/c), name (v/c), color (v/c), 4 weight (e/n) 5 + edges (vertex names): 6 1 -- 2, 3 7 2 -- 1, 3, 4 8 3 -- 1, 2, 5 9 4 -- 2, 5, 6, 7 10 5 -- 3, 4, 6 11 6 -- 4, 5, 7 12 7 -- 4, 6 but for which the third edge (i.e., connecting vertices 2 and 3) has a weight of 2. #2.28 1 > E(wg2)$weight 2 [1] 1 1 2 1 1 1 1 1 1 1 Moving beyond such basic concerns regarding the nature of the edges in a graph, it is necessary to have a language for discussing the connectivity of a graph. The most basic notion of connectivity is that of adjacency. Two vertices u,v ∈ V are said to be adjacent if joined by an edge in E . Such vertices are also referred to as neighbors. For example, the three neighbors of vertex 5 in our toy graph g are #2.29 1 > neighbors(g, 5) 2 [1] 3 4 6 Similarly, two edges e1,e2 ∈ E are adjacent if joined by a common endpoint in V . A vertex v∈V is incident on an edge e∈E if v is an endpoint of e. From this follows the notion of the degree of a vertex v, say dv, defined as the number of edges incident on v. 2.4 Talking About Graphs 23 #2.30 1 > degree(g) 2 1 2 3 4 5 6 7 3 2 3 3 4 3 3 2 For digraphs, vertex degree is replaced by in-degree (i.e., dinv ) and out-degree (i.e., doutv ), which count the number of edges pointing in towards and out from a vertex, respectively. #2.31 1 > degree(dg, mode="in") 2 Sam Mary Tom 3 0 2 2 4 > degree(dg, mode="out") 5 Sam Mary Tom 6 2 1 1 It is also useful to be able to discuss the concept of movement about a graph. For example, a walk on a graph G, from v0 to vl , is an alternating sequence {v0,e1,v1,e2, . . . ,vl−1,el ,vl}, where the endpoints of ei are {vi−1,vi}. The length of this walk is said to be l. Refinements of a walk include trails, which are walks without repeated edges, and paths, which are trails without repeated vertices. A trail for which the beginning and ending vertices are the same is called a circuit. Sim- ilarly, a walk of length at least three, for which the beginning and ending vertices are the same, but for which all other vertices are distinct from each other, is called a cycle. Graphs containing no cycles are called acyclic. In a digraph, these notions generalize naturally. For example, a directed walk from v0 to vl proceeds from tail to head along arcs between v0 and vl . A vertex v in a graph G is said to be reachable from another vertex u if there exists a walk from u to v. The graph G is said to be connected if every vertex is reachable from every other. A component of a graph is a maximally connected subgraph. That is, it is a connected subgraph of G for which the addition of any other remaining vertex in V would ruin the property of connectivity. The toy graph g, for example, is connected #2.32 1 > is.connected(g) 2 [1] TRUE and therefore consists of only a single component #2.33 1 > clusters(g) 2 $membership 3 [1] 1 1 1 1 1 1 1 4 5 $csize 6 [1] 7 7 8 $no 9 [1] 1 For a digraph, there are two variations of the concept of connectedness. A digraph G is weakly connected if its underlying graph (i.e., the result of stripping away the 24 2 Manipulating Network Data labels ‘tail’ and ‘head’ from G) is connected. It is called strongly connected if every vertex v is reachable from every u by a directed walk. The toy graphdg, for example, is weakly connected but not strongly connected. #2.34 1 > is.connected(dg, mode="weak") 2 [1] TRUE 3 > is.connected(dg, mode="strong") 4 [1] FALSE A common notion of distance between vertices on a graph is defined as the length of the shortest path(s) between the vertices (which we set equal to infinity if no such path exists). This distance is often referred to as geodesic distance, with ‘geodesic’ being another name for shortest paths. The value of the longest distance in a graph is called the diameter of the graph. Our toy graph g has diameter #2.35 1 > diameter(g, weights=NA) 2 [1] 3 Ultimately, the concepts above are only the most basic of graph-theoretic quanti- ties. There are a wide variety of queries one might make about graphs and quantities to calculate as a part of doing descriptive network analysis. We cover more of these in Chap. 4. 2.4.2 Special Types of Graphs Graphs come in all ‘shapes and sizes,’ as it were, but there are a number of families of graphs that are commonly encountered in practice. We illustrate this notion with the examples of four such families shown in Fig. 2.2. #2.36 1 > g.full <- graph.full(7) 2 > g.ring <- graph.ring(7) 3 > g.tree <- graph.tree(7, children=2, mode="undirected") 4 > g.star <- graph.star(7, mode="undirected") 5 > par(mfrow=c(2, 2)) 6 > plot(g.full) 7 > plot(g.ring) 8 > plot(g.tree) 9 > plot(g.star) A complete graph is a graph where every vertex is joined to every other vertex by an edge.
Compartilhar