Baixe o app para aproveitar ainda mais
Prévia do material em texto
M. Ataharul Islam · Rafiqul I. Chowdhury Analysis of Repeated Measures Data Analysis of Repeated Measures Data M. Ataharul Islam • Rafiqul I. Chowdhury Analysis of Repeated Measures Data 123 M. Ataharul Islam Institute of Statistical Research and Training (ISRT) University of Dhaka Dhaka Bangladesh Rafiqul I. Chowdhury Institute of Statistical Research and Training (ISRT) University of Dhaka Dhaka Bangladesh ISBN 978-981-10-3793-1 ISBN 978-981-10-3794-8 (eBook) DOI 10.1007/978-981-10-3794-8 Library of Congress Control Number: 2017939538 © Springer Nature Singapore Pte Ltd. 2017 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04GatewayEast, Singapore 189721, Singapore Preface During the past four decades, we have observed a steady increase in the use of repeated measures data. As the type of data in repeated measures can be discrete or continuous, quantitative or qualitative, there has been an increasing demand for models for not only normally distributed variables observed repeatedly over time but also for non-normal variables where classical regression models are clearly inadequate or fail to address the objectives of studies conducted in various fields. There are well-documented developments in the analysis of repeated measures data using normality assumption; however, the literature and textbooks are grossly inadequate for analyzing repeated measures data for non-normal variables. Since the introduction of the generalized linear model, the scope for generalizing the regression models for non-normal data in addition to data approximately based on normality assumption has been widened to a great extent. This book presents a broad range of statistical techniques to address the emerging needs in the field of repeated measures. The demand for statistical models for correlated outcomes grew rapidly during the recent past mainly attributable to two types of underlying associations: (i) association between outcomes and (ii) association between explanatory variables and outcomes. In real-life situations, repeated measures data are currently available from various sources. This book provides a systematic treatment of the problems in modeling repeated measures data for estimating the underlying relationships between covariates and outcome variables for correlated data. In other words, this book is prepared to fulfill a long-standing demand for addressing repeated measures data analysis in real-life situations with models applicable to a wide range of correlated outcome variables. This book starts with background chapters on linear model, exponential family of distributions, and generalized linear models. Throughout the book, except for Chap. 15, the concepts of generalized linear models have been used with extensions wherever necessary. The developments in repeated measures data analysis can be categorized under three different broad types: marginal models, conditional models, and joint models. In this book, we have included models belonging to all these types and examples are given to illustrate the estimation and test procedures. In Chap. 5, covariate-dependent Markov models are introduced for first or higher v orders. This book provides developments on modeling bivariate binary data in Chap. 6. In many occasions, researchers need conditional or joint models for analyzing correlated binary outcomes. Tests for dependence are also necessary to develop a modeling strategy for analyzing these data. These problems are discussed with applications in Chap. 6. In modeling repeated measures data, the use of geometric models is very scanty. The problems associated with characterization are available in the literature but bivariate geometric models with covariate dependence are scarce. However, it is noteworthy that applications of bivariate geometric models in various fields where incidence or first time occurrence of two events, such as incidence of two diseases can be very useful. For understanding the risk factors associated with the incidence of two diseases or two complications, bivariate geometric models can provide deeper insight to explain the underlying mechanism. The bivariate count models are useful in various disciplines such as economics, public health, epidemiology, environmental studies, reliability, and actuarial sci- ence. The count models are introduced in Chaps. 8 and 9 that include bivariate Poisson, bivariate double Poisson, bivariate negative binomial, and bivariate multinomial models. The bivariate Poisson models are introduced for truncated data too. The under- or overdispersion problems are discussed and test procedures are shown with examples. In reliability and other lifetime data analysis, the bivariate exponential models are very useful. In Chap. 9, an extended GLM is employed and test for dependence is illustrated. In repeated measures, the extended GLM approaches such as generalized estimating equations and generalized linear mixed models play very important roles. It is noteworthy that the use of quasi-likelihood methods created opportunities for exploring models when distributional assump- tions are difficult to attain but variance can be expressed as a function of mean. In Chaps. 11–13, quasi-likelihood, generalized estimating equations, and generalized linear mixed models are discussed. Generalized multivariate models by extending the concepts of GLM are shown in Chap. 14. This chapter includes simple ways to generalize the models for repeated measures data for two or more correlated out- come variables with covariate dependence. In this book, the semi-parametric haz- ards models are also highlighted which are being used extensively for analyzing failure time data arising from longitudinal studies that produce repeated measures. Multistate and multistage models, effective for analyzing repeated measures data, are illustrated for both the graduate level students and researchers. The problem of analyzing repeated measures data for failure time in the competing risk framework is included which appears to have an increasingly important role in the field of survival analysis, reliability, and actuarial science. For analyzing lifetime data, extended proportional hazards models such as multistate and multistage models with transitions, reverse transitions, and repeated transitions over time are intro- duced with applications in Chap. 15. In many instances, use of the techniques for repeated measures data cannot be explored conveniently due to lack of appropriate softwaresupport. In Chap. 16, newly developed R packages and functions along with the use of existing R packages, SAS codes, and macro/IML are shown. This book aims to provide important guidelines for both researchers and grad- uate students in the fields of statistics and applied statistics, biomedical sciences, vi Preface epidemiology, reliability, survival analysis, econometrics, environment, social science, actuarial science, etc. Both theory and applications are presented with details to make the book user-friendly. This book includes necessary illustrations and software usage outlines. In addition to the researchers, graduate students and other users of statistical techniques for analyzing repeated measures data will be benefitted from this book. The potential users will find it as a comprehensive reference book essential for addressing challenges in analyzing repeated measures data with a deeper understanding about nature of underlying relationships among outcome and explanatory variables in the presence of dependence among outcome variables. We are grateful to our colleagues and students at the University of Dhaka, University Sains Malaysia, King Saud University, and East West University. The idea of writing this book has stemmed from teaching and supervising research students on repeated measures data analysis for many years. We want to thank Shahariar Huda for his continued support to our work. We extend our deepest gratitude to Amiya Atahar for her unconditional help during the final stage of writing this book. Further we acknowledge gratefully the continued support from Tahmina Khatun, Farida Yeasmeen, and Jayati Atahar. We extend our deep grat- itude to the University Grants Commission, Bangladesh and the World Bank for supporting the Higher Education Quality Enhancement Sub-project 3293 on repe- ated measures. We are grateful to Rosihan M. Ali, Adam Baharum, V. Ravichandran, A.A. Kamil, Jahida Gulshan, O.I. Idais, and A.E. Tabl for their support. We are also indebted to Farzana Jahan, M. Aminul Islam and Mahfuza Begum for their support at different stages of writing this book. Dhaka, Bangladesh M. Ataharul Islam Rafiqul I. Chowdhury Preface vii Contents 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Multiple Regression Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Method of Least Squares. . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 15 2.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Exponential Family of Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1 Exponential Family and Sufficiency . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Some Important Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Exponential Family and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.3 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.4 Components of a GLM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.5 Multinomial Response Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.6 Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.7 Deviance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Covariate–Dependent Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 First Order Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.3 Conditional Model for Second Order Markov Chain with Covariate Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4 Covariate Dependent Model for Markov Chain of Order r . . . . . 57 ix 5.5 Tests for the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 6 Modeling Bivariate Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.2 Bivariate Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.3 Bivariate Binary Model with Covariate Dependence. . . . . . . . . . 69 6.3.1 Covariate-Dependent Model . . . . . . . . . . . . . . . . . . . . . . 70 6.3.2 Likelihood Function and Estimating Equations . . . . . . . . 71 6.4 Test for Dependence in Bivariate Binary Outcomes . . . . . . . . . . 72 6.4.1 Measure of Dependence . . . . . . . . . . . . . . . . . . . . . . . . . 72 6.4.2 Test for the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.4.3 Test for Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 6.5 Generalized Bivariate Bernoulli Model . . . . . . . . . . . . . . . . . . . . 76 6.5.1 The Bivariate Bernoulli Model . . . . . . . . . . . . . . . . . . . . 77 6.5.2 Estimating Equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.5.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.6 Some Alternative Binary Repeated Measures Models . . . . . . . . . 82 6.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 7 Bivariate Geometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 7.2 Univariate Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 88 7.3 Bivariate Geometric Distribution: Marginal and Conditional Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 7.4 Bivariate Geometric Distribution: Joint Model . . . . . . . . . . . . . . 91 7.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 8 Models for Bivariate Count Data: Bivariate Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 8.2 The Poisson–Poisson Distribution. . . . . . . . . . . . . . . . . . . . . . . . 98 8.3 Bivariate GLM for Poisson–Poisson . . . . . . . . . . . . . . . . . . . . . . 99 8.3.1 Model and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 8.3.2 Overdispersion in Count Data . . . . . . . . . . . . . . . . . . . . . 100 8.3.3 Tests for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . 101 8.3.4 Simple Tests for Overdispersion With or Without Covariate Dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . 102 8.4 Zero-Truncated Bivariate Poisson . . . . . . . . . . . . . . . . . . . . . . . . 103 8.4.1 Zero-Truncated Poisson Distribution . . . . . . . . . . . . . . . . 104 8.4.2 A Generalized Zero-Truncated BVP Linear Model . . . . . 105 8.4.3 Test for the Model . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 107 8.4.4 Deviance and Goodness of Fit . . . . . . . . . . . . . . . . . . . . 107 x Contents 8.5 Right-Truncated Bivariate Poisson Model. . . . . . . . . . . . . . . . . . 108 8.5.1 Bivariate Right-Truncated Poisson–Poisson Model . . . . . 108 8.5.2 Predicted Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 8.5.3 Test for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . 112 8.6 Double Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.6.1 Double Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . 114 8.6.2 Bivariate Double Poisson Model . . . . . . . . . . . . . . . . . . . 118 8.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 9 Bivariate Negative Binomial and Multinomial Models . . . . . . . . . . . 125 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 9.2 Review of GLM for Multinomial . . . . . . . . . . . . . . . . . . . . . . . . 126 9.3 Bivariate Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 9.4 Tests for Comparison of Models. . . . . . . . . . . . . . . . . . . . . . . . . 131 9.5 Negative Multinomial Distribution and Bivariate GLM . . . . . . . 133 9.5.1 GLM for Negative Multinomial . . . . . . . . . . . . . . . . . . . 134 9.6 Application of Negative Multinomial Model . . . . . . . . . . . . . . . 137 10 Bivariate Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 10.2 Bivariate Exponential Distributions. . . . . . . . . . . . . . . . . . . . . . . 139 10.3 Bivariate Exponential Generalized Linear Model . . . . . . . . . . . . 142 10.4 Bivariate Exponential GLM Proposed by Iwasaki and Tsubaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 10.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 11 Quasi-Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 11.2 Likelihood Function and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . 152 11.3 Quasi-likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 11.4 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 11.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 12 Generalized Estimating Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 12.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 12.3 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 12.4 Steps in a GEE: Estimation and Test . . . . . . . . . . . . . . . . . . . . . 164 12.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 13 Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 13.2 Generalized Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . 169 13.3 Identity Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 13.4 Logit Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 13.5 Log Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Contents xi 13.6 Multinomial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 13.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 14 Generalized Multivariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 14.2 Multivariate Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . 179 14.3 Multivariate Negative Binomial Distribution. . . . . . . . . . . . . . . . 181 14.4 Multivariate Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . 182 14.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 184 14.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 15 Multistate and Multistage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 15.2 Some Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192 15.3 Censoring: Construction of Likelihood Function . . . . . . . . . . . . 196 15.4 Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 15.5 Competing Risk Proportional Hazards Model . . . . . . . . . . . . . . . 199 15.6 Multistate Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 15.7 Multistage Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 15.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 16 Analysing Data Using R and SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 16.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235 Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 xii Contents About the Authors M. Ataharul Islam is currently QM Husain Professor at the Institute of Statistical Research and Training (ISRT), University of Dhaka, Bangladesh. He was a Professor of Statistics at the Universiti Sains Malaysia, King Saud University, East West University and the University of Dhaka. He served as a visiting faculty at the University of Hawaii and University of Pennsylvania. He is a recipient of the Pauline Stitt Award, Western North American Region (WNAR) Biometric Society Award for content and writing, University Grants Commission Award for book and research, and the Ibrahim Memorial Gold Medal for research. He has published more than 100 papers in international journals on various topics, mainly on lon- gitudinal and repeated measures data including multistate and multistage hazards model, statistical modeling, Markov models with covariate dependence, generalized linear models, conditional and joint models for correlated outcomes. He authored a book on Markov models, edited one book jointly and contributed chapters in several books. Rafiqul I. Chowdhury a former senior lecturer at the Department of Health Information Administration, Kuwait University, Kuwait, has been involved widely in various research projects as a research collaborator and consultant. He has extensive experience in statistical computing with large data sets, especially, with repeated measures data. He has published more than 60 papers in international journals on statistical computing, repeated measures data, and utilization of healthcare services among others and presented papers in various conferences. He co-authored a book on Markov models and wrote programs and developed pack- ages for marginal, conditional and joint models including multistate Markov and hazards models, bivariate generalized linear models on Poisson, geometric, Bernoulli using SAS and R. xiii List of Figures Fig. 2.1 Population Regression Model .. . . . . . . . . . . . . . . . . . . . . . . 10 Fig. 2.2 Simple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Fig. 15.1 States and transition for a simple proportional hazards model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 Fig. 15.2 Example of a multistate model . . . . . . . . . . . . . . . . . . . . . . . 200 Fig. 15.3 Example of a multistage model for maternal morbidity . . . . . . 203 Fig. 15.4 States and Transitions in a Simplified Multistage Model . . . . . 205 xv List of Tables Table 1.1 Status of disease at different follow-up times (Yij) . . . . . . . . . 2 Table 1.2 Occurrence of diabetes and heart problem by subjects and waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Table 2.1 Estimates and tests of parameters of a simple regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Table 2.2 Estimates and tests of parameters of a multiple linear regression model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 Table 4.1 Estimation of parameters of GLM using identity link function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 Table 4.2 Estimates of parameters of GLM for Binary Outcomes on Depression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Table 4.3 Distribution of number of conditions . . . . . . . . . . . . . . . . . . 49 Table 4.4 Estimates of parameters of GLM using log link function for number of conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Table 4.5 Negative binomial GLM of number of conditions . . . . . . . . . 50 Table 5.1 Frequency of depression in four waves . . . . . . . . . . . . . . . . 61 Table 5.2 Transition counts and transition probabilities for first-order Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 Table 5.3 Estimates for first-order Markov model . . . . . . . . . . . . . . . . 62 Table 5.4 Transition counts and transition probabilities for second-order Markov model . . . . . . . . . . . . . . . . . . . . . 62 Table 5.5 Estimates for second-order Markov model . . . . . . . . . . . . . . 62 Table 5.6 Transition counts and transition probabilities for third-order Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 Table 5.7 Estimates for third-order Markov model. . . . . . . . . . . . . . . . 64 Table 5.8 Test for the order of Markov model . . . . . . . . . . . . . . . . . . 65 Table 6.1 Bivariate probabilities for two outcome variables, Y1 and Y2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 6.2 Transition count and probability for Y1 and Y2 . . . . . . . . . . 84 Table 6.3 Estimates for two conditionals and one marginal model . . . . . 84 xvii Table 6.4 Observed and predicted counts from the bivariate distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Table 7.1 Frequency of incidence of diabetes followed by stroke . . . . . 94 Table 7.2 Estimates of the parameters of Model 1 . . . . . . . . . . . . . . . . 94 Table 7.3 Estimates of parameters of Model 2 . . . . . . . . . . . . . . . . . . 95 Table 8.1 Bivariate distribution of outcome variables. . . . . . . . . . . . . . 121 Table 8.2 Fit of bivariate Poisson model (marginal/conditional) for both unadjusted and adjusted for over- or underdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Table 8.3 Right-truncated bivariate Poisson model (marginal/conditional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 Table 8.4 Zero-truncated bivariate Poisson model (marginal/conditional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Table 8.5 Estimates of parameters of bivariate double Poisson model (Model 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Table 9.1 Estimates of parameters of bivariate negative binomial model using marginal–conditional approach . . . . . . 137 Table 9.2 Estimates of the parameters of bivariate negative binomial model (Joint model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 Table 10.1 Distribution of diabetes and heart problems in different waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148 Table 10.2 Estimates of bivariate exponential full model . . . . . . . . . . . . 148 Table 10.3 Likelihood ratio tests for overall model and association parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 Table 11.1 Estimated parameters and tests for number of conditions using quasi-likelihood method . . . . . . . . . . . . . . . . . . . . . . 158 Table 11.2 Estimated parameters and tests for counts of healthcare services utilizations using quasi-likelihood method . . . . . . . . 159 Table 12.1 GEE for various correlation structures . . . . . . . . . . . . . . . . . 167 Table 12.2 ALR with different correlation structures . . . . . . . . . . . . . . . 167 Table 13.1 Generalized linear mixed model with random intercept for binary responses on depression status from the HRS data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Table 13.2 Random effect estimates for selected subjects . . . . . . . . . . . . 175 Table 13.3 Predicted probabilities for selected subjects . . . . . . . . . . . . . 176 Table 13.4 Healthcare services utilization by waves . . . . . . . . . . . . . . . 176 Table 13.5 Generalized linear mixed model for log link function for healthcare services utilization with random intercepts . . . . . . 176 Table 14.1 Estimates of the parameters of multivariate Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 Table 15.1 Number of different types of transitions . . . . . . . . . . . . . . . . 208 Table 15.2 Estimates from multistate hazards model for depression data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 Table 15.3 Test for proportionality for different transitions . . . . . . . . . . . 209 xviii List of Tables Table 15.4 Estimates from multistage hazards model for complications in three stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 Table 15.5 Test for proportionality for different transitions during antenatal, delivery, and postnatal stages . . . . . . . . . . . 211 Table 15.6 Estimates from multistage hazards model for Model II . . . . . 212 List of Tables xix Chapter 1 Introduction The field of repeated measures has been growing very rapidly mainly due to increasing demand for statistical techniques for analyzing repeated measures data in various disciplines such as biomedical sciences, epidemiology, reliability, econo- metrics, environment, social science, etc. Repeated measures data may comprise of either responses from each subject/experimental unit longitudinally at multiple occasions or responses under multiple conditions. The responses may be qualitative (categorical) or quantitative (discrete or continuous). The analysis of repeated measures data becomes complex due to presence of two types of associations, one is the association between response and explanatory variables and the other is association between outcome variables. Repeated measures data from longitudinal studies are collected over time on each study participant or experimental unit. The changes in both outcome variables and factors associated with changes in outcome variables within individuals may provide useful insights. In addition, relationships between outcome variables as well as between outcome variables observed at different times and covariates can be studied thoroughly if we have repeated data on same individuals or experimental units. The study of change in observed values ofoutcome variable status of participants provides very important in-depth insights regarding the dynamics of underlying relationships between outcome status of participants and their characteristics represented by covariates in the presence of dependence in outcomes. For analyzing multivariate data from repeated measures, the type of association between outcome variables due to repeated occurrence of events from same participants is of great concern. In other words, the nature of correlation within the subjects needs to be taken into account. Two data layout designs are displayed in Tables 1.1 and 1.2. In the first layout design, each of the 5 subjects is followed up for 4 time points and status of a disease, such as whether diabetes is controlled or uncontrolled at each time point, is recorded. Let us denote Yij = 1, if diabetes is uncontrolled for the ith individual at the jth follow-up, Yij = 0, otherwise; i = 1,…,5; j = 1,…,4. The number of follow-ups for subjects can be equal (balanced) or unequal (unbalanced). © Springer Nature Singapore Pte Ltd. 2017 M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data, DOI 10.1007/978-981-10-3794-8_1 1 Table 1.2 shows a dummy table for occurrence of diabetes and heart problem being observed repeatedly over 11 time points (waves) specified by equal intervals. Let us denote Y1ij = 1, if diabetes is reported for the ith individual at the jth Table 1.1 Status of disease at different follow-up times (Yij) Subject (i) Time (j) T1 T2 T3 T4 1 y11 0 y12 0 y13 1 y14 1 2 y21 1 y22 1 y23 0 y24 1 3 y31 0 y32 1 y33 1 y34 0 4 y41 0 y42 0 y43 0 y44 0 5 y51 1 y52 1 y53 1 y54 1 Table 1.2 Occurrence of diabetes and heart problem by subjects and waves Ykij Wave (j) 1 2 3 4 Subject (i) Y1i1 Y2i1 Y1i2 Y2i3 Y1i3 Y2i3 Y1i4 Y2i4 1 0 1 0 0 0 1 0 1 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 5 1 0 0 0 1 0 1 0 Wave (j) 5 6 7 8 Subject (i) Y1i5 Y2i5 Y1i6 Y2i6 Y1i7 Y2i7 Y1i8 Y2i8 1 0 1 0 1 0 1 0 1 2 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 4 1 0 1 0 1 0 1 0 5 1 0 1 1 1 1 1 1 Wave (j) 9 10 11 Subject (i) Y1i9 Y2i9 Y1i10 Y2i10 Y1i11 Y2i11 1 0 1 0 1 0 1 2 0 0 0 0 0 1 3 0 0 0 0 0 0 4 1 0 1 0 1 0 5 1 9 9 9 9 9 2 1 Introduction follow-up, Y1ij = 0 otherwise; Y2ij = 1, if heart problem is reported for the ith individual at the jth follow-up, Y2ij = 0, otherwise; 9 shows missing value, k = 1,2; i = 1,…, 5; j = 1,…,11. Dependence in outcomes is a common feature in repeated measures data. Hence, a systematic approach to deal with correlated outcomes along with their relationship with covariates is the foremost challenge for analyzing repeated measures data. In case of independence of outcome variables, the modeling of relationship between explanatory and outcome variables reduces to marginal models but this may not reflect the reality in repeated measures because the data are obtained from each subject/experimental unit at multiple occasions or under multiple conditions. In that case, dependence in outcome variables may hardly satisfy the underlying conditions for a marginal model. In other words, marginal models may provide misleading results in analyzing repeated measures data due to exclusion of correlation among outcome variables in the models. An alternative to the marginal models is to employ conditional models such as models based on Markovian assumptions where the models are constructed for an outcome variable at the current time for given value of outcome observed previously. The order of Markov chain may vary depending on underlying nature of transitions over time. Since the development of generalized linear model, there is a scope to generalize the linear models for different types of outcome or response variables (normal or nonnormal, discrete or continuous, qualitative) that belong to the exponential family of distributions using different link functions. The exponential family form f ðy; hÞ ¼ e½aðyÞbðhÞ þ cðhÞ þ dðyÞ� provides the minimal sufficient statistic. The following alternative expression for the exponential family of distributions f ðy; hÞ ¼ e½fyh� bðhÞað/Þ g þ cðy;/Þ� can be used to identify the canonical parameter and the link between the random and systematic components can be specified. There have been extensive works conducted on univariate GLM but only some isolated efforts have been made to generalize the usefulness of the generalized linear models for dependent outcomes generated from repeated measures data. Some generalizations are available for bivariate binary and count data and it is noteworthy that both bivariate Bernoulli and count models have wide range of applications in various fields. An example of bivariate model for binary data for the outcome variables Y1 and Y2 can be expressed in the following form: Pðy1; y2Þ ¼ Pð1 � y1Þð1 � y2Þ00 P ð1 � y1Þy2 01 P y1ð1 � y2Þ 10 P y1y2 11 : 1 Introduction 3 Using the first order Markov chain transition probabilities can be defined as PðYij Yij� r; . . .; Yij� 1 �� Þ ¼ PðYij Yij� 1�� Þ: This relationship provides a conditional-marginal relationship to obtain the joint form PðY1 ¼ j; Y2 ¼ kÞ ¼ PðY2 ¼ k Y1 ¼ jÞ � PðY1 ¼ jÞ; j ¼ 0; 1; k ¼ 0; 1:j Similar approach can be shown for some other bivariate distributions such as Poisson, geometric, negative binomial, multinomial, exponential, etc. These dis- tributions can be expressed in bivariate exponential family by generalizing the univariate form as shown below: f ðy; hÞ ¼ e½f y1h1 þ y2h2 � bðh1 ;h2Þ að/Þ g þ cðy1;y2;/Þ� where h1 and h2 are canonical link functions such that h1 ¼ gðl1Þ ¼ g1 ¼ Xb1 and h2 ¼ gðl2Þ ¼ g2 ¼ Xb2. Here, l1 ¼ EðY1 XÞj and l2 ¼ EðY2 XÞj . For generalized linear models, it is essential to know the random component of the model which represents the underlying distributional form of the outcome variable. If the form of the distribution is known then the likelihood estimation procedure can be applied to estimate the parameters of the linear model. However, in many cases, the form of the underlying distribution may not be known. In that case, the quasi-likelihood approach can be used. For analyzing repeated measures data, the quasi-likelihood estimation procedure has become widely popular among the researchers. In quasi-likelihood method, we need to know the expected values of outcome variables and the variance functions need to be expressed as functions of mean. The variance of outcome variables can be shown as VarðYÞ ¼ að/ÞvðlÞ where að/Þ is the dispersion parameter and vðlÞ is the variance function. The quasi-likelihood function or more specifically quasi-log-likelihood (Nelder and Lee 1992) is defined for a single observation as Qðl; yÞ ¼ Zl y ðy� tÞ að/ÞvðtÞ dt: The quasi-score function can be obtained by differentiating Q with respect to l as shown below: @Q @l ¼ y� l að/ÞvðlÞ : 4 1 Introduction The quasi-likelihood or the quasi-log-likelihood for independent observations y1; . . .; yn, it can be shown that Qðl; yÞ ¼ Xn i¼1 Zli yi ðyi � tiÞ að/ÞvðtiÞ dti: The estimating equations for estimating the parameters of the linear model are UðbÞ ¼ @Q @b ¼ Xn i¼1 @li @b � �0ðyi � liÞ vðliÞ ¼ 0 which are known as quasi-score equations. This can be rewritten in the following form for repeated measures data: UðbÞ ¼ @Q @b ¼ @l @b � �0 V�1ðy� lÞ ¼ 0 ¼D0V�1ðy� lÞ ¼ 0: The generalized estimating equation (GEE) provides a marginal model which depends on the choice of a correlation structure. The estimating equations using quasi-likelihood scores can be shown as UðbÞ ¼ Xn i¼1 D0iViðli; aÞ�1ðyi � liÞ ¼ 0 where Viðli; aÞ ¼ A1=2i R að ÞA1=2i að/Þ, RðaÞ is a working correlation matrix expressed as a function of a. The generalizedestimating equation is an extension of the generalized linear model for repeated observations or more specifically GEE is a quasi-likelihood approach based on knowledge about first two moments where the second moment is a function of the first moment. However, due to marginal or population averaged modeling, the utility of the generalized estimating equations remains restricted. Although correlation structure is considered in a marginal model framework, the within subject association incorporated in the estimation of parameters remains largely beyond explanation. An alternative way to incorporate the within subject variation in the linear model is to use a generalized linear mixed model where random effects attributable to within subject variation are incorpo- rated. The generalized linear model is gðliÞ ¼ Xib; i ¼ 1; . . .; n with EðYi XiÞ ¼ liðbÞj and VarðYiÞ ¼ að/ÞVðliÞ. This model can be extended for the jth repeated observations on the ith subject as 1 Introduction 5 gðlijÞ ¼ Xijb; i ¼ 1; . . .; n; j ¼ 1; . . .; Ji with EðYij XijÞ ¼ lijðbÞ �� and VarðYijÞ ¼ að/ÞVðlijÞ. Then considering a random effect, ui, for the repeated observations of the ith subject or cluster, we can introduce an extended model gðlijÞ ¼ Xijbþ Ziui; i ¼ 1; . . .; n; j ¼ 1; . . .; Ji where ui�MVNð0; PÞ. Instead of normality assumption, other assumptions may be considered depending on the type of data. Another alternative to the marginal model is conditional model which can provide useful analysis by introducing a model for the outcome variable for given values of other outcome variables. One popular technique is based on the Markovian assumption where the transition probabilities are considered as func- tions of covariates and previous outcomes. The models can take into account first or higher order models and a test for order may make the model more specific. Markov models are suitable for longitudinal data observed over fixed intervals of time. A more efficient modeling of repeated measures requires multivariate models which can be obtained from marginal–conditional approach or joint distribution of out- come variables. The conditional models for binary outcome variables, Y1 and Y2, using first order Markov model, can be expressed as follows: PðY2i ¼ 1 Y1i ¼ 0; Xij Þ ¼ e Xib01 1þ eXib01 and PðY2i ¼ 1 Y1i ¼ 1; Xij Þ ¼ eXib11 1þ eXib11 where b001 ¼ ½b010; b011; . . .; b01p�; b011 ¼ ½b110; b111; . . .; b11p�; Xi ¼ ½1;X1i; . . .;Xpi�: The marginal models for Y1 and Y2 are PðY1i ¼ 1 Xij Þ ¼ e Xib1 1þ eXib1 and PðY2i ¼ 1 Xij Þ ¼ eXib2 1þ eXib2 : Here b01 ¼ ½b10; b11; . . .; b1p�; b02 ¼ ½b20; b21; . . .; b2p�; xi ¼ ½1; x1i; . . .; xpi�: The semi-parametric hazard models provide models for analyzing lifetime data arising from longitudinal studies that produce repeated measures. The multistate and multistage models can be effective for analyzing data on transitions, reverse transitions, and repeated transitions that take place over time in the status of events. It is useful to study the transitions over time as functions of covariates or risk factors. In survival or reliability analysis, we have to deal with censored data which 6 1 Introduction is the most common source of incomplete data in longitudinal studies. The pro- portional hazards models for one or more transient states can be obtained for partially censored data. The problem of analyzing repeated measures data for failure time in the competing risk framework has been of interest in various fields including survival analysis, reliability, and actuarial science. The hazard function for failure type, J = j, where J = 1,…,k, with covariate dependence it can be shown as hjðt; xÞ ¼ lim Dt!0 Pðt� T � tþDt; J ¼ j T � tj ; xÞ Dt : Then the cause-specific proportional hazards model is hijðti; xiÞ ¼ h0ijðtÞexibj where xi ¼ ðxi1; xi2; . . .. . . ; xipÞ and parameter vector bj ¼ ðbj1; . . .; bjpÞ0, j = 1,…,k. Extending the cause-specific hazard function for transitions among several transient states, we can define the multistate hazard function for transition from state j to state k during ðt; tþDtÞ as hðt; k j;j xjkÞ ¼ lim Dt!0 Pðt� T � tþDt; S ¼ k T � tj ; S ¼ j; xjkÞ Dt and the proportional hazards model for multistate transitions is hðt; k j;j xjkÞ ¼ h0jkðtÞexjkbjk where bjk is the vector of parameters for transition from j to k and xjk is the vector of covariate values. In this book, the inferential techniques for modeling repeated measures data are illustrated to provide detailed background with applications. The estimation pro- cedures for various models in analyzing repeated measures data are of prime concern and remain a challenge to the users. For testing the dependence in out- comes, some test procedures are illustrated for binary, count, and continuous out- come variables in this book. The goodness of fit tests are provided with applications. For correlated Poisson outcomes, the problem of under or overdis- persion are addressed and tests for under or overdispersion are highlighted with examples. In many instances truncation is one of the major problems in analyzing correlated outcomes such as zero or right truncation particularly in count regression models which are discussed in this book too. 1 Introduction 7 Chapter 2 Linear Models In this chapter, a brief introduction of linear models is presented. Linearity can be interpreted in terms of both linearity in parameters or linearity in variables. In this book, we have considered linearity in parameters of a model. Linear models may generally include regression models, analysis of variance models, and analysis of covariance models. As the focus of this book is to address various generalized linear models for repeated measures data using GLM and Markov chain/process, we have reviewed regression models in this chapter very briefly. 2.1 Simple Linear Regression Model Let us consider a random sample of n pairs of observations ðY1; X1Þ; . . .; ðYn; XnÞ. Here, let Y be the dependent variable or outcome and X be the independent variable or predictor. Then the simple regression model or the regression model with a single predictor is denoted by EðY XÞ ¼ b0þ b1Xj : ð2:1Þ It is clear from (2.1) that the simple regression model is a population averaged model. Here EðY XÞ ¼ lY Xj ��� represents conditional expectation of Y for given X . In other words, lY Xj ¼ b0þ b1X ð2:2Þ which can be visualized from the figure displayed below (Fig. 2.1). © Springer Nature Singapore Pte Ltd. 2017 M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data, DOI 10.1007/978-981-10-3794-8_2 9 An alternative way to represent model (2.1) or (2.2) is Y ¼ b0þ b1Xþ e ð2:3Þ where e denotes the distance of Y from the conditional expectation or conditional mean, lY Xj , as evident from expression shown below: Y ¼ lY Xj þ e ð2:4Þ where e denotes the error in the dependent or outcome variable, Y, attributable to the deviation from the population averaged model and e is a random variable as well with EðeÞ ¼ 0 and VarðeÞ ¼ r2. 2.2 Multiple Regression Model We can extend the simple regression model shown in Sect. 2.1 for multiple regression model with p predictors X1; . . .;Xp. The population averaged model can be shown as EðY XÞ ¼ b0þ b1X1þ . . .þ bpXp �� : ð2:5Þ Here EðY XÞj ¼ lY Xj as shown in Sect. 2.1. Alternatively, Y ¼ b0þ b1X1þ . . .þ bpXpþ e ð2:6Þ which can be expressed as Y ¼ lY Xj þ e: ð2:7Þ Fig. 2.1 Population Regression Model 10 2 Linear Models In vector and matrix notation, the model in Eq. (2.6) for a sample of size n is Y ¼ Xb þ e ð2:8Þ where Y ¼ Y1 Y2 .. . Yn 0 BBBBB@ 1 CCCCCA ; X ¼ 1 X11 . . . X1p 1 X21 . . . X1p .. . 1 Xn1 . . . Xnp 0 BBB@ 1 CCCA; b ¼ b0 b1 .. . bp 0 BBBBB@ 1 CCCCCA ; e ¼e1 e2 .. . en 0 BBBBB@ 1 CCCCCA : It is clear from the formulation of regression model that it provides a theoretical framework for explaining the underlying linear relationships between explanatory and outcome variables of interest. A perfect model can be obtained only if all the values of the outcome variable are equal to conditional expectation for given values of predictors which is not feasible in explaining real life problems. However, still it can provide very important insight under the circumstance of specifying a model that keeps the error minimum. Hence, it is important to specify a model that can produce estimate of outcome variable as much close to observed values as possible. In other words, the postulated models in Sects. 2.2 and 2.3 are hypothetical ide- alized version of the underlying linear relationships which may be attributed to merely association or in some instances causation as well. The population regression model is proposed under a set of assumptions: (i) EðeiÞ ¼ 0, (ii) VarðeiÞ ¼ r2, (iii) EðeiejÞ ¼ 0 for i 6¼ j, and (iv) independence of X and e. In addition, assumption of normality is necessary for likelihood estimation as well as for testing of hypotheses. Based on these assumptions, we can show the mean and variance of Yi as follows: EðYi Xij Þ ¼ Xib; andVarðYi Xij Þ ¼ r2; where Xi is the ith row vector of the matrix X. Using (2.8), we can rewrite the assumptions as follows: (i) EðeÞ ¼ 0, and (ii) CovðeÞ ¼ r2I. Similarly, EðY Xj Þ ¼ Xb, and CovðY Xj Þ ¼ r2I. 2.3 Estimation of Parameters For estimating the regression parameters, we can use both method of least squares and method of maximum likelihood. It may be noted here that for extending the concept of linear models to generalized linear models or covariate dependent Markov models, the maximum likelihood method will be used more extensively, hence, both are discussed here although method of least squares is a more con- venient method of estimation for linear regression model with desirable properties. 2.2 Multiple Regression Model 11 2.3.1 Method of Least Squares The method of least squares is used to estimate the regression parameters by minimizing the error sum of squares or residual sum of squares. The regression model is Yi ¼ b0þ b1Xi1þ . . .þ bpXipþ ei; i ¼ 1; 2; . . .; n ð2:9Þ and we can define the deviation between outcome variable and its corresponding conditional mean for given values of X as follows: ei ¼ Yi � ðb0þ b1Xi1þ . . .þ bpXipÞ: ð2:10Þ Then the error sum of squares is defined as a quadratic form Q ¼ Xn i¼1 e2i ¼ Xn i¼1 Yi � ðb0þ b1Xi1þ . . .þ bpXipÞ � �2 : ð2:11Þ The sum of squares of error is minimized if the estimates are obtained from the following equations: @Q @b0 ���� b¼b^ ¼ �2 Xn i¼1 Yi � ðb^0þ b^1Xi1þ . . .þ b^pXipÞ h i ¼ 0 ð2:12Þ @Q @bj ����� b¼b^ ¼ �2 Xn i¼1 Yi � ðb^0þ b^1Xi1þ . . .þ b^pXipÞ h i Xij ¼ 0; ð2:13Þ j = 1,…,p. We can consider (2.12) as a special case of Eq. (2.13) for j = 0 and X0 ¼ 1. Using model (2.8), Q can be expressed as Q ¼ e0e ¼ ðY � XbÞ0ðY � XbÞ: ð2:14Þ The right-hand side of (2.14) is Q ¼ Y 0Y � Y 0Xb� b0X 0Y þ b0X 0Xb where Y 0Xb ¼ b0X 0Y . Hence the estimating equations are @Q @b ���� b¼b^ ¼ �2X 0Y þ 2X 0Xb^ ¼ 0: ð2:15Þ 12 2 Linear Models Solving Eq. (2.15), we obtain the least squares estimators of regression parameters as shown below: b^ ¼ ðX 0XÞ�1ðX 0YÞ: ð2:16Þ The estimated regression model can be shown as Y^ ¼ Xb^ ð2:17Þ and alternatively Y ¼ Xb^þ e ð2:18Þ where Y^ ¼ Y^1 Y^2 .. . Y^n 0 BBBBB@ 1 CCCCCA ; X ¼ 1 X11 . . . X1p 1 X21 . . . X2p .. . 1 Xn1 . . . Xnp 0 BBB@ 1 CCCA; b^ ¼ b^0 b^1 .. . b^p 0 BBBBBB@ 1 CCCCCCA ; e ¼ e1 e2 .. . en 0 BBBBB@ 1 CCCCCA : It may be noted here that e is the vector of estimated errors from the fitted model. Hence, we can show that e ¼ Y � Y^ ð2:19Þ and the error sum of squares is e0e ¼ ðY � Y^Þ0ðY � Y^Þ: ð2:20Þ 2.3.1.1 Some Important Properties of the Least Squares Estimators The least squares estimators have some desirable properties of good estimators which are shown below. (i) Unbiasedness: Eðb^Þ ¼ b: Proof: We know that b^ ¼ ðX 0XÞ�1ðX 0YÞ and Y ¼ Xbþ e. Hence, Eðb^Þ ¼ E½ðX 0XÞ�1ðX 0YÞ� ¼ ðX 0XÞ�1X 0EðYÞ ¼ ðX 0XÞ�1X 0EðXbþ eÞ ¼ ðX 0XÞ�1X 0Xb ¼ b: 2.3 Estimation of Parameters 13 (ii) Covðb^Þ ¼ ðX 0XÞ�1r2: Proof: Covðb^Þ ¼ Cov½ðX 0XÞ�1X 0Y � ¼ ðX 0XÞ�1X 0CovðYÞXðX 0XÞ�1 where CovðYÞ ¼ r2I. Hence, Covðb^Þ ¼ ðX 0XÞ�1X 0IXðX 0XÞ�1r2 ¼ ðX 0XÞ�1r2: ð2:21Þ (iii) The least squares estimator b^ is the best linear unbiased estimator of b. (iv) The mean squared error is an unbiased estimator of r2. In other words, E e0e n� p� 1 � � ¼ r2 ð2:22Þ Proof: Let us denote SSE ¼ e0e ¼ ðY � Xb^Þ0ðY � Xb^Þ and s2 ¼ SSEn� p� 1 where p is the number of predictors. Total sum of squares of Y is Y 0Y . The sum of squares of errors can be rewritten as SSE ¼ Y 0Y � Y 0Xb^� b^0X 0Y þ b^0X 0Xb^ ¼ Y 0Y � 2b^0X 0Y þ b^0X 0Xb^ where Y 0Xb^ ¼ b^0X 0Y . Then replacing b^ by ðX 0XÞ�1ðX 0YÞ, it can be shown that SSE ¼ Y 0Y � 2b^0X 0Y þ b^0X 0Xb^ ¼ Y 0Y � b^0X 0Y ¼ Y 0Y � ½ðX 0XÞ�1X 0Y �0X 0Y ¼ Y 0Y � Y 0XðX 0XÞ�1X 0Y ¼ Y 0½I � Y 0XðX 0XÞ�1X 0�Y It can be shown that the middle term of the above expression is a symmetric idempotent matrix and SSEr2 is chi-square with degrees of freedom equal to the rank of the matrix ½I � Y 0XðX 0XÞ�1X 0�. The rank of this idempotent matrix is equal to the trace½I � Y 0XðX 0XÞ�1X 0� which is n – p − 1. Hence, 14 2 Linear Models E½ðn� p� 1Þðs2Þ=r2� ¼ E SSE=r2ð Þ ¼ trace½I � Y 0XðX 0XÞ�1X 0� ¼ n� p� 1. This implies EðSSEÞ ¼ ðn� p� 1Þr2 and E SSEn� p� 1 � � ¼ r2: In other words, the mean square error is an unbiased estimator of r2, i.e. Eðs2Þ ¼ r2: 2.3.2 Maximum Likelihood Estimation It is noteworthy that the estimation by least squares method does not require nor- mality assumption. However, the estimates of regression parameters can be obtained assuming that Y �Nn Xb; r2Ið Þ where EðY XÞj ¼ Xb and VarðY XÞj ¼ r2I. The likelihood function is Lðb; r2Þ ¼ 1 ð2pÞn=2½r2I�1=2 e�ðY �XbÞ 0ðr2IÞ�1ðY �XbÞ=2 ¼ 1 ð2pr2Þn=2 e�ðY �XbÞ 0ðY �XbÞ=2r2 : The log-likelihood function can be shown as follows: ln Lðb; r2Þ ¼ � 1 2 n lnð2pÞ � 1 2 n ln r2 � 1 2r2 ðY � XbÞ0ðY � XbÞ: ð2:23Þ Differentiating (2.23) with respect to parameters and equating to zero, we obtain the following equations: @ ln L @b ���� b¼b^; r2¼r^2 ¼ � 1 2r^2 ð�2X 0Y � 2X 0Xb^Þ ¼ 0 ð2:24Þ @ ln L @r2 ���� b¼b^; r2¼r^2 ¼ � n 2r^2 þ 1 2ðr^2Þ2 ðY � Xb^Þ 0ðY � Xb^Þ ¼ 0 ð2:25Þ Solving (2.24) and (2.25), we obtain the following maximum likelihood estimators: b^ ¼ ðX 0XÞ�1ðX 0YÞ; and r^2 ¼ 1 n ðY � Xb^Þ0ðY � Xb^Þ: 2.3 Estimation of Parameters 15 2.3.2.1 Some Important Properties of Maximum Likelihood Estimators Some important properties of maximum likelihood estimators are listed below: (i) b^�Npþ 1 b; r2ðX 0XÞ�1 h i ; (ii) nr^ 2 r2 � v2ðn� p� 1Þ; (iii) b^ and r^2 are independent, (iv) If Y is NnðXb; r2IÞ then b^ and r^2 are jointly sufficient for b and r2, and (v) If Y is NnðXb; r2IÞ then b^ have minimum variance among all unbiased estimators. 2.4 Tests In a regression model, we need to perform several tests, such as: (i) significance of the overall fitting of model involving p predictors, (ii) significance of each parameter to test for significant association between each predictor and outcome variable, and (iii) significance of a subset of parameters. (i) Test for significance of the model In the regression model, Y ¼ b0þ b1X1þ . . .þ bpXpþ e, it is important to examine whether none of the predictors X1; . . .;Xp is linearly associated with outcomevariable, Y, against the hypothesis that at least one of the predictors is linearly associated with outcome variable. As the postulated model represents a hypothetical relationship between population mean and predictors, EðY XÞ ¼ b0þj b1X1þ . . .þ bpXp:, the contribution of the model can be tested from the regression sum of squares which indicates the fit of the model for the conditional mean, compared to the error sum of squares that measures deviation of observed values of outcome variable from the postulated linear relationship of predictors with condi- tional mean. It may be noted here that total sum of squares due to outcome variable can be partitioned into two components for regression and error as shown below: Y 0Y ¼ b^0X 0Y þðY � Xb^Þ0ðY � Xb^Þ where b^0X 0Y is the sum of squares of regression (SSR) and ðY � Xb^Þ0ðY � Xb^Þ is the sum of squares error (SSE). The coefficient of multiple determination, R2, measures the extent or proportion of linear relationship explained by the multiple linear regression model. This is the 16 2 Linear Models squared multiple correlation. The coefficient of multiple determination can be defined as: R2 ¼ Regression Sumof Squares Total Sum of Squares ¼ b^ 0X 0Y � n�Y2 Y 0Y � n�Y2 : ð2:26Þ and the range of R2 is 0�R2� 1, 0 indicating that the model does not explain the variation at all and 1 for a perfect fit or 100% is explained by the model. The null and alternative hypotheses for overall test of the model are: H0 : b1 ¼ . . . ¼ bp ¼ 0 and H1 : bj 6¼ 0; for at least one j, j = 1, … , p. Under null hypothesis, sum of squares of regression is v2pr 2 and similarly sum of squares of error is v2n� p� 1r 2. The test statistic is F ¼ SSR=p SSE=ðn� p� 1Þ �Fp;ðn� p� 1Þ: ð2:27Þ Rejection of null hypothesis indicates that at least one of the variables in the postulated model contributes significantly in the overall or global test. (ii) Test for the significance of parameters Once we have determined that at least one of the predictors is significant, next step is to identify the variables that exert significant linear relationship with outcome variable. Statistically it is obvious that inclusion of one or more variables in a regression model may result in increase in regression sum of squares and thus decrease in error sum of squares. However, it needs to be tested whether such inclusion is statistically significant or not. These tests will be elaborated in the next section in more details. The first task is to examine each individual parameter separately to identify predictors with statistically significant linear relationship with outcome variable of interest. The null and alternative hypotheses for testing significance of individual parameters are: H0 : bj ¼ 0 andH1 : bj 6¼ 0: The test statistic is t ¼ b^j seðb^jÞ ð2:28Þ which follows a t distribution with (n – p − 1) degrees of freedom. We know that Covðb^Þ ¼ ðX 0XÞ�1r2 and estimate for the covariance matrix is C^ovðb^Þ ¼ ðX 0XÞ�1s2 where s2 is the unbiased estimator of r2. The standard error of b^j can be 2.4 Tests 17 obtained from corresponding diagonal elements of the inverse matrix ðX 0XÞ�1. In this rejection of null hypothesis implies a statistically significant linear relationship with outcome variable. (iii) Extra Sum of Squares Method As we mentioned in the previous section that inclusion of a variable may result in increase in SSR and subsequently decrease in SSE, it needs to be tested whether the increase in SSR is statistically significant or not. In addition, it is also possible to test whether inclusion or deletion of a subset of potential predictors result in any statistically significant change in the fit of the model or not. For this purpose, extra sum of squares principle may be a very useful procedure. Let us consider a regression model Y ¼ Xbþ e where Y is n� 1, X is n� k, b is k � 1, and k ¼ p þ 1. If we partition b as follows b ¼ b1 b2 � � where b ¼ b0 b1 .. . br�1 br .. . bp 0 BBBBBBBB@ 1 CCCCCCCCA ; b1 ¼ b0 b1 .. . br�1 0 BBB@ 1 CCCA and b2 ¼ br .. . bp 0 B@ 1 CA: We can express the partitioned regression model as Y ¼ X1b1þX2b2þ e ð2:29Þ where X1 ¼ 1 X11 . . . X1;r�1 1 X21 . . . X2;r�1 .. . 1 Xn1 . . . Xn;r�1 0 BBB@ 1 CCCA; X2 ¼ X1;r . . . X1;p X2;r . . . X2;p .. . Xn;r . . . Xn;p 0 BBB@ 1 CCCA: Let us consider this model as the full model. In other words, the full model is comprised of all the variables under consideration. We want to test, whether some of the variables or a subset of the variables included in the full model contributes 18 2 Linear Models significantly or not. This subset may include one or more variables and the cor- responding coefficients or regression parameters are represented by the vector b2. Hence, a test on whether b2 ¼ 0 is an appropriate null hypothesis here. This can be employed for a single parameter as a special case. Regression and error sum of squares from full and reduced models are shown below. Full Model: Under full model, the SSR and SSE are: SSR (full model) = b^0X 0Y SSE (full model) = Y 0Y � b^0X 0Y Reduced Model: Under null hypothesis, the SSR and SSE are: SSR (reduced model) = b^ 0 1X 0 1Y SSE (reduced model) = Y 0Y � b^01X 01Y Difference between SSR (full model) and SSR (reduced model) shows the contribution of the variables Xr; . . .;Xp which can be expressed as: SSR b2 b1jð Þ ¼ b^0X 0Y � b^01X 01Y : This is the extra sum of squares attributable to the variables under null hypothesis. The test statistic for H0 : b2 ¼ 0 is F ¼ SSR b2 b1jð Þ=ðk � rþ 1Þ s2 �Fðk�rþ 1Þ;ðn�kÞ: ð2:30Þ Acceptance of null hypothesis implies there may not be any statistically sig- nificant contribution of the variables Xr; . . .;Xp and the reduced model under null hypothesis is equally good as compared to the full model. 2.5 Example A data set on standardized fertility measure and socioeconomic indicators from Switzerland is used for application in this chapter. This data set is freely available from ‘datasets’ package in R. Full dataset and description are available for download from the Office of Population Research website (site https://opr. princeton.edu/archive/pefp/switz.aspx). Following variables are available in the ‘swiss’ dataset from datasets package. This data set includes indicators for each of 47 French-speaking provinces of Switzerland in 1888. The variables are: 2.4 Tests 19 Fertility common standardized fertility measure Agriculture % of males involved in agriculture as occupation Examination % draftees receiving highest mark on army examination Education % education beyond primary school for draftees Catholic % ‘catholic’ (as opposed to ‘Protestant’) Infant Mortality live births who live less than one year. Here the first example shows a fitting of a simple regression model where the outcome variable, Y = common standardized fertility measure and X = percent education beyond primary school for draftees. The estimated model is Y^ ¼ 79:6101� 0:8624X: Education appears to be negatively associated with fertility measure in French-speaking provinces (p-value < 0.001). Figure 2.2 displays the negative relationship. Table 2.1 summarizes the results. Fig. 2.2 Simple Linear Regression Table 2.1 Estimates and tests of parameters of a simple regression model Variable Estimate Std. error t-value Pr(>|t|) Constant 79.6101 2.1041 37.836 0.000 Education −0.8624 0.1448 −5.954 0.000 Table 2.2 Estimates and tests of parameters of a multiple linear regression model Variable Estimate Std. error t-value Pr(>|t|) Constant 62.10131 9.60489 6.466 0.000 Agriculture −0.15462 0.06819 −2.267 0.029 Education −0.98026 0.14814 −6.617 0.000 Catholic 0.12467 0.02889 4.315 0.000 Infant Mortality 1.07844 0.38187 2.824 0.00720 2 Linear Models Using the same data source, an example for the fit of a multiple regression model is shown and the results are summarized in Table 2.2. For the same outcome variable, four explanatory variables are considered, percent males involved in agriculture as profession ðX1Þ, education ðX2Þ, percent catholic ðX3Þ; and infant mortality ðX4Þ. The estimated model for the outcome variable, fertility, is Y^ ¼ 62:10131� 0:15462X1 � 0:98026X2þ 0:12467X3þ 1:08844X4: All the explanatory variables show statistically significant linear relationship with fertility, agriculture, and education are negatively but percent catholic and infant mortality are positively related to the outcome variable. The fit of the overall model is statistically significant (F = 24.42, D.F. = 4 and 42, p-value < 0.001). About 70% (R2 = 0.699) of the total variation is explained by the fitted model. 2.5 Example 21 Chapter 3 Exponential Family of Distributions The exponential family of distributions has an increasingly important role in statis- tics. The immediate purpose of family or class of families is to examine existence of sufficient statistics and it is possible to link the families to the existence of minimum variance unbiased estimates. In addition to these important uses, exponential families of distributions are extensively employed in developing generalized linear models. Let Y be a random variable with probability density or mass function f ðy; hÞwhere h is a single parameter then Y can be classified to belong to exponential family of distributions if the probability density or mass function can be expressed as follows: f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ� ð3:1Þ where aðyÞ and dðyÞ are functions of y, bðhÞ and cðhÞ are functions of parameter h only. We may express this function in the following form as well: f ðy; hÞ ¼ d0ðyÞe½aðyÞbðhÞþ cðhÞ� ð3:2Þ where d0ðyÞ ¼ edðyÞ. The joint pdf or pmf from (3.2) can be shown as follows for independently and identically distributed Y1; . . .; Yn: f ðy; hÞ ¼ Yn i¼1 f ðyi; hÞ ¼ Yn i¼1 e½aðyiÞbðhÞþ cðhÞþ dðyiÞ� ¼ Yn i¼1 d0ðyiÞe½aðyiÞbðhÞþ cðhÞ� ð3:3Þ where y0 ¼ ðy1; . . .; ynÞ. © Springer Nature Singapore Pte Ltd. 2017 M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data, DOI 10.1007/978-981-10-3794-8_3 23 3.1 Exponential Family and Sufficiency One of the major advantages of exponential family is that we can find the sufficient statistics readily from the expression. Let f ðy; hÞ, where y0 ¼ ðy1; . . .; ynÞ, be the joint pdf or pmf of sample then Pn i¼1 aðyiÞ is a sufficient statistic for h if and only if there exists function g Pn i¼1 aðyiÞjh � � and hðyÞ such that for all sample and parameter points, f ðy; hÞ ¼ hðyÞ � g Xn i¼1 aðyiÞ hj " # : ð3:4Þ . It can be shown from (3.1) to (3.3) that Lðh; yÞ ¼ Yn i¼1 d0ðyiÞ Yn i¼1 e½aðyiÞbðhÞþ cðhÞ� ¼ hðyÞ � e bðhÞ Pn i¼1 aðyiÞþ ncðhÞ ð3:5Þ where (3.5) is expressed in the factorized form of a sufficient statistic as displayed in (3.4). In other words, Pn i¼1 aðyiÞ is a sufficient statistic for h. If we assume that Y and X belong to the same class of partition of the sample space for Y1; . . .; Yn which is satisfied if the ratio of likelihood functions, Lðh; yÞ=Lðh; xÞ, does not depend on h, then any statistic corresponding to the parameter is minimal sufficient. If Y1; . . .; Yn are independently and identically distributed then the ratio of likelihood functions is Lðh; yÞ Lðh; xÞ ¼ hðyÞ � ebðhÞ Pn i¼1 aðyiÞþ ncðhÞ hðxÞ � ebðhÞ Pn i¼1 aðxiÞþ ncðhÞ : ð3:6Þ It is clearly evident from (3.6) that the ratio is independent of h only ifPn i¼1 aðyiÞ ¼ Pn i¼1 aðxiÞ, then Pn i¼1 aðyiÞ is a minimal sufficient statistic of h. It is noteworthy that if a minimum variance unbiased estimator exists, then there must be a function of the minimal sufficient statistic for the parameter which is a minimum variance unbiased estimator. If Y � f ðy; hÞ where h ¼ ðh1; . . .; hkÞ is a vector of k parameters belonging to the exponential family of distributions; then the probability distribution can be expressed as f ðy; hÞ ¼ e Pk j¼1 ajðyÞbjðhÞþ cðhÞþ dðyÞ � � ð3:7Þ 24 3 Exponential Family of Distributions where a1ðyÞ; . . .; akðyÞ and dðyÞ are functions of Y alone and b1ðhÞ; . . .; bkðhÞ and cðhÞ are functions of h alone. Then it can be shown thatPn i¼1 a1ðyiÞ; . . .; Pn i¼1 akðyiÞ are sufficient statistics for h1; . . .; hk respectively. Example 3.1 f y; n; pð Þ ¼ n y � � py 1� pð Þn�y ¼ e ln n y � � þ y ln p þ n�yð Þ ln 1�pð Þ � ¼ e y ln p1�p þ ln n y � � þ n ln 1�pð Þ � Here aðyÞ ¼ y bðhÞ ¼ ln p 1� p cðhÞ ¼ n ln 1� pð Þ dðyÞ ¼ ln n y � � : and it can be shown that Pn i¼1 yi is a sufficient statistic for h ¼ p. Example 3.2 Poisson Distribution f y; hð Þ ¼ e � hhy y! ¼ e y ln h � ln y!�hf g where aðyÞ ¼ y; bðhÞ ¼ ln h; cðhÞ ¼ � h; dðyÞ ¼ � ln y!: It can be shown that Pn i¼1 yi is a sufficient statistic for h. Example 3.3 Exponential f y; hð Þ ¼ he�hy ¼ e �h yþ ln hf g where aðyÞ ¼ y; bðhÞ ¼ �h; cðhÞ ¼ ln h; dðyÞ ¼ 0: For exponential distribution parameter, h, it can be shown that Pn i¼1 yi is a sufficient statistic. 3.1 Exponential Family and Sufficiency 25 Example 3.4 Normal Distribution with mean zero and variance r2 f y; 0; r2 � ¼ 1ffiffiffiffiffiffiffiffiffiffi 2pr2 p e�y2=2r2 ¼ e �y2=2r 2�12 ln 2pr2ð Þ � where aðyÞ ¼ y2 bðhÞ ¼ � 1 2r2 cðhÞ ¼ � 1 2 ln 2pr2 � dðyÞ ¼ 0: For h ¼ r2, the sufficient statistic is Pni¼1 y2i . Example 3.5 Normal Distribution with Mean l and Variance 1. f y; l; 1ð Þ ¼ 1ffiffiffiffiffiffi 2p p e�12 y�lð Þ2 ¼ e �12 y2�2lyþ l2ð Þ�12 lnð2pÞf g ¼ e yl�12y2�12 ln 2pð Þ�12l2f g where aðyÞ ¼ y; bðhÞ ¼ l; cðhÞ ¼ � 1 2 l2; dðyÞ ¼ � 1 2 y2 � 1 2 ln 2p: In this example, for h ¼ l, the sufficient statistic is Pni¼1 yi. Example 3.6 Gamma distribution f y; hð Þ ¼ h r Cr yr�1e�hy ¼ e �hyþ r�1ð Þ ln y�ln Crþ r ln hf g where aðyÞ ¼ y bðhÞ ¼ h cðhÞ ¼ r ln h dðyÞ ¼ r � 1ð Þ ln y� ln Cr: In this example, the sufficient statistic for h is Pn i¼1 yi. 26 3 Exponential Family of Distributions Example 3.7 Normal distribution with mean and variance l and r2 respectively f y; l;r2 � ¼ 1ffiffiffiffiffiffiffiffiffiffi 2pr2 p e� 12r2 y�lð Þ2 ¼ e � 12r2 y2� 2yl þ l2ð Þ� 12 ln 2pr2ð Þ � ¼ e � 12r2 y2 þ y lr2 � l 2 2r2 � 12 ln 2pr2ð Þ � where a1ðyÞ ¼ y a2ðyÞ ¼ y2 b1ðhÞ ¼ lr2 b2ðhÞ ¼ � 12r2 cðhÞ ¼ � l 2 2r2 � 1 2 lnðr2Þ dðyÞ ¼ � 1 2 lnð2pÞ: In this example, the joint sufficient statistics for h1 ¼ l and h2 ¼ r2 are Pn i¼1 yi and Pn i¼1 y 2 i , respectively. Example 3.8 Gamma distribution (two parameter) f y; a; bð Þ ¼ b a Ca ya�1e�by ¼ e �byþ a�1ð Þ ln yþ a ln b�ln Caf g where a1ðyÞ ¼ ln y a2ðyÞ ¼ y b1ðhÞ ¼ a b2ðhÞ ¼ �b cðhÞ ¼ a ln b� ln Ca dðyÞ ¼ � ln y: In this example, the joint sufficient statistics for h1 ¼ a and h2 ¼ b arePn i¼1 ln yi and Pn i¼1 yi, respectively. 3.1 Exponential Family and Sufficiency 27 3.2 Some Important Properties The expected value and variance of a(Y) can be obtained for exponential family assuming that the order of integration and differentiation can be interchanged. We know that the exponential family is represented by f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ� and after differentiating with respect to parameter we obtain df ðy; hÞ dh ¼ ½aðyÞb0ðhÞþ c0ðhÞ�f ðy; hÞ and interchanging differentiation and integration in the following expression, it can be shown that Z df ðy; hÞ dh dy ¼ Z ½aðyÞb0ðhÞþ c0ðhÞ�f ðy; hÞdy ¼ 0: ð3:8Þ It follows directly from (3.8) that b0ðhÞE½aðYÞ� þ c0ðhÞ ¼ 0: ð3:9Þ Hence, the expected value can be obtained from the following equation: E½aðYÞ� ¼ � c 0ðhÞ b0ðhÞ: It can be shown using the same regularity assumptions that the variance is Var½aðyÞ� ¼ b 00ðhÞc0ðhÞ=b0ðhÞ � c00ðhÞ ½b0ðhÞ�2 ¼ b 00ðhÞc0ðhÞ � c00ðhÞb0ðhÞ ½b0ðhÞ�3 : The log likelihood function for an exponential family of distribution is lðh; yÞ ¼ aðyÞbðhÞþ cðhÞþ dðyÞ and the score statistic is Uðh; yÞ ¼ dlðh; yÞ dh ¼ aðyÞb0ðhÞþ c0ðhÞ: 28 3 Exponential Family of Distributions It can be shown that U ¼ dlðh; yÞ dh ¼ aðyÞb0ðhÞþ c0ðhÞ; EðUÞ ¼ b0ðhÞ � c 0ðhÞ b0ðhÞ � � þ c0ðhÞ ¼ 0: and I ¼ VarðUÞ ¼ b0ðhÞ½ �2Var½aðyÞ� ¼ b 00ðhÞc0ðhÞ b0ðhÞ � c 00ðhÞ: Another important property of U is VarðUÞ ¼ EðU2Þ ¼ �EðU0Þ: Example 3.9 Binomial Distribution It has been shown from the exponential family of distribution form aðyÞ ¼ y; bðhÞ ¼ ln p 1� p ; cðhÞ ¼ n lnð1� pÞ; dðyÞ ¼ ln n y � � : Hence, EðYÞ ¼ � c 0ðhÞ b0ðhÞ ¼ � � n1�p 1 p þ 11�p ¼ np VarðYÞ ¼ b 00ðhÞc0ðhÞ � c00ðhÞb0ðhÞ ½b0ðhÞ�3 ¼ npð1� pÞ: Example 3.10 Poisson Distribution Pðy; hÞ ¼ e �hhy y! ¼ e y ln h�h�ln y!f g Hence, in exponential form notations 3.2 Some Important Properties 29 aðyÞ ¼ y; bðhÞ ¼ ln h; cðhÞ ¼ �h; dðyÞ ¼ �lny! The expected value and variance of Y are EðYÞ ¼ � �1 1=h ¼ h VarðYÞ ¼ ð�1=h 2Þð�1Þ � ð0Þð1=hÞ ½1=h�3 ¼ h: Example 3.11 Exponential Distribution f ðy; hÞ ¼ he�hy ¼ e �hyþ ln hf g: In the exponential family of distributions notations aðyÞ ¼ y; bðhÞ ¼ �h; cðhÞ ¼ ln h; dðyÞ ¼ 0: For exponential distribution, the expected value and variance are EðYÞ ¼ � 1=h�1 ¼ 1 h VarðYÞ ¼ ð0Þð1=hÞ � ð�1=h 2Þð�1Þ ½�1�3 ¼ 1 h2 : Example 3.12 Normal Distribution with mean l and variance 1 fðy; l; 1Þ ¼ 1ffiffiffiffiffiffi 2p p e�ðy�lÞ2=2 ¼ e yl�ð1=2Þl2�ð1=2Þ lnð2pÞ�y2=2f g Using the exponential form, it is shown that aðyÞ ¼ y; bðhÞ ¼ l; cðhÞ ¼ � 1 2 l2; dðyÞ ¼ � 1 2 y2 � 1 2 lnð2pÞ: The expected value and variance can be obtained from the exponential form as follows: EðYÞ ¼ ��l 1 ¼ l VarðYÞ ¼ ð0Þð�lÞ � ð�1Þð1Þ½1�3 ¼ 1: 30 3 Exponential Family of Distributions Chapter 4 Generalized Linear Models 4.1 Introduction Since the seminal work of Nelder and Wedderburn (1972) and publication of a book by McCullagh and Nelder (1983) the concept of Generalized Linear Models (GLMs) has been playing an increasingly important role in statistical theory and applications. We have presented linear regression models in Chap. 2 and expo- nential family of distributions in Chap. 3. A class of linear models that generalizes the linear models for both normal and nonnormal outcomes or for both discrete and continuous outcomes when the probability distribution of outcome variables belong to exponential family of distributions can be classified under a broad class named as generalized linear models. The linear regression models presented in Chap. 2 can be shown as a special case of GLM. In regression modeling, linear or nonlinear, the assumption on outcome variables is essentially normality assumption but it is obvious that in various situations such assumption cannot be true due to very wide range of situations where normality assumption is quite unrealistic. An obvious example is binary response to express the presence or absence of a disease where the outcome variable follows a Bernoulli distribution. Another example is number of accidents during a specified interval of time which provides count data that follows a Poisson distribution. If we are interested in an event such as success in a series of experiments for the first time after failures successively, the distribution is geometric. This can be applied to analyze incidence of a disease from follow-up data. Similarly, if the event is defined as attaining a fixed number of successes in a series of experiments, securing certain number of wins in a football match league completion to qualify for the next round, then the outcome variable may follow a negative binomial distribution. In case of continuous outcome variables, it is practically not so frequent in many cases to find outcome variables that follow normal distribution. In lifetime data for analyzing reliability or survival, the distributions are highly skewed and normality assump- tions cannot be used. Hence, for nonnormal distributions such as exponential or © Springer Nature Singapore Pte Ltd. 2017 M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data, DOI 10.1007/978-981-10-3794-8_4 31 gamma, the linear regression models are not applicable directly. To address this wide variety of situations where normality assumption cannot be considered for linear modeling, GLM provides a general framework to link the underlying random and systematic components. 4.2 Exponential Family and GLM For generalized linear models, it is assumed that the distribution of outcome variable can be represented in the form of exponential family of distributions. Let Y be a random variable with probability density or mass function f ðy; hÞ where h is a single parameter then Y can be classified to belong to exponential family of distributions if the probability density or mass function can be expressed as follows as shown in (3.1): f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ� where aðyÞ and dðyÞ are functions of y, bðhÞ and cðhÞ are functions of parameter h only. If aðyÞ ¼ y and bðhÞ ¼ h then h is called a natural parameter. Then (3.1) can be expressed in a different form convenient for GLM f ðy; hÞ ¼ e yh�bðhÞ að/Þ n o þ cðy;/Þ h i ð4:1Þ where bðhÞ is a new function of h, að/Þ is a function of / called dispersion parameter and cðy;/Þ is a function of y and /. Some Examples Example 4.1 Binomial f y; n; pð Þ ¼ n y � � py 1� pð Þn�y ¼ e ln n y � � þ y ln pþ n�yð Þ ln 1�pð Þ � � ¼ e y ln p 1�p� �n ln 1�pð Þð Þf g 1 þ ln n y � �" # Here h ¼ ln p 1� p ; b hð Þ ¼ �n lnð1� pÞ; c y;/ð Þ¼ ln n y � � ; að/Þ ¼ 1: 32 4 Generalized Linear Models Example 4.2 Poisson f y; kð Þ ¼ e � kky y ! ¼ e y ln kþ ln y !�kf g¼e y ln k�k1f gþ ln y !½ � where h ¼ ln k; b hð Þ ¼ k; að/Þ ¼ 1; c y;/ð Þ ¼ lny!: Example 4.3 Exponential f y; kð Þ ¼ ke�k y ¼ e �k yþ ln kf g ¼ e �k y�ð� ln kÞf g 1 � � where h ¼ �k; b hð Þ ¼ � ln k; að/Þ ¼ 1; c y;/ð Þ ¼ 0: Example 4.4 Normal Distribution with Mean Zero and Variance, r2 f y; 0; r2 � ¼ 1ffiffiffiffiffiffiffiffiffiffi 2pr2 p e�y2=2r2 = e �y 2 � 2r 2�12 ln 2pr2ð Þ � There is no natural parameter in this case. Example 4.5 Normal Distribution with Mean l and Variance 1. Y �N l; 1ð Þ f y; l; 1ð Þ ¼ 1ffiffiffiffiffiffi 2p p e �12 y�lð Þ 2 ¼ e �12 y2�2lyþl2ð Þ�12 ln 2pð Þf g ¼ e yl�12y2�12 ln 2pð Þ�12l2f g ¼ e yl�12l 2f g 1 �12y2�12 ln 2pð Þ h i 4.2 Exponential Family and GLM 33 where h ¼ l; b hð Þ ¼ l2=2; að/Þ ¼ 1; c y;/ð Þ ¼ 1 2 y2þ lnð2pÞ� : Example 4.6 Gamma f y; kð Þ ¼ k r Cr xr�1 e�ky ¼ e �kyþ r�1ð Þ ln y� lnCrþ r ln kf g ¼ e �kyþ r ln kf g 1 þ r�1ð Þ ln y� lnCr � � where h ¼ �k; b hð Þ ¼ �r ln k; að/Þ ¼ 1; c y;/ð Þ ¼ r � 1ð Þ ln y� lnCr: Example 4.7 Y �N l; r2ð Þ f y; l; r2 � ¼ 1ffiffiffiffiffiffiffiffiffiffi 2pr2 p e� 12r2 y�lð Þ2 ¼ e � 12r2 y2�2ylþ l2ð Þ�12 ln 2pr2ð Þ � ¼ e � 12r2y2 þ y lr2� l 2 2r2 �12 ln 2pr2ð Þ � ¼ e yl�l 2 2 � r2 � 1 2r2 y2�12 ln 2pr2ð Þ � � h ¼ l; bðhÞ ¼ �l2=2; að/Þ ¼ r2; cðy;/Þ ¼ � 1 2r2 y2 � 1 2 ln 2pr2 � : 4.3 Expected Value and Variance Expected value and variance of Y can be obtained from (4.1) assuming that the order of integration (summation in case of discrete variable) and differentiation can be interchanged. Differentiating f ðy; hÞ with respect to h, we obtain df ðy; hÞ dh ¼ 1 að/Þ ½y� b 0ðhÞ�f ðy; hÞ 34 4 Generalized Linear Models and interchanging differentiation and integration in the following
Compartilhar