Analysis of Repeated Measures Data, 2017

Estatística I

•
UFRPE

BUENO ABREU
19/06/2018
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 257 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 257 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 257 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Estatística I

57.458 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
M. Ataharul Islam · Rafiqul I. Chowdhury
Analysis of 
Repeated 
Measures 
Data
Analysis of Repeated Measures Data
M. Ataharul Islam • Raﬁqul I. Chowdhury
Analysis of Repeated
Measures Data
123
M. Ataharul Islam
Institute of Statistical Research and Training
(ISRT)
University of Dhaka
Dhaka
Bangladesh
Raﬁqul I. Chowdhury
Institute of Statistical Research and Training
(ISRT)
University of Dhaka
Dhaka
Bangladesh
ISBN 978-981-10-3793-1 ISBN 978-981-10-3794-8 (eBook)
DOI 10.1007/978-981-10-3794-8
Library of Congress Control Number: 2017939538
© Springer Nature Singapore Pte Ltd. 2017
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made. The publisher remains neutral with regard to
jurisdictional claims in published maps and institutional afﬁliations.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer Nature Singapore Pte Ltd.
The registered company address is: 152 Beach Road, #21-01/04GatewayEast, Singapore 189721, Singapore
Preface
During the past four decades, we have observed a steady increase in the use of
repeated measures data. As the type of data in repeated measures can be discrete or
continuous, quantitative or qualitative, there has been an increasing demand for
models for not only normally distributed variables observed repeatedly over time
but also for non-normal variables where classical regression models are clearly
inadequate or fail to address the objectives of studies conducted in various ﬁelds.
There are well-documented developments in the analysis of repeated measures data
using normality assumption; however, the literature and textbooks are grossly
inadequate for analyzing repeated measures data for non-normal variables. Since
the introduction of the generalized linear model, the scope for generalizing the
regression models for non-normal data in addition to data approximately based on
normality assumption has been widened to a great extent. This book presents a
broad range of statistical techniques to address the emerging needs in the ﬁeld of
repeated measures. The demand for statistical models for correlated outcomes grew
rapidly during the recent past mainly attributable to two types of underlying
associations: (i) association between outcomes and (ii) association between
explanatory variables and outcomes. In real-life situations, repeated measures data
are currently available from various sources. This book provides a systematic
treatment of the problems in modeling repeated measures data for estimating the
underlying relationships between covariates and outcome variables for correlated
data. In other words, this book is prepared to fulﬁll a long-standing demand for
addressing repeated measures data analysis in real-life situations with models
applicable to a wide range of correlated outcome variables.
This book starts with background chapters on linear model, exponential family
of distributions, and generalized linear models. Throughout the book, except for
Chap. 15, the concepts of generalized linear models have been used with extensions
wherever necessary. The developments in repeated measures data analysis can be
categorized under three different broad types: marginal models, conditional models,
and joint models. In this book, we have included models belonging to all these
types and examples are given to illustrate the estimation and test procedures. In
Chap. 5, covariate-dependent Markov models are introduced for ﬁrst or higher
v
orders. This book provides developments on modeling bivariate binary data in
Chap. 6. In many occasions, researchers need conditional or joint models for
analyzing correlated binary outcomes. Tests for dependence are also necessary to
develop a modeling strategy for analyzing these data. These problems are discussed
with applications in Chap. 6. In modeling repeated measures data, the use of
geometric models is very scanty. The problems associated with characterization are
available in the literature but bivariate geometric models with covariate dependence
are scarce. However, it is noteworthy that applications of bivariate geometric
models in various ﬁelds where incidence or ﬁrst time occurrence of two events,
such as incidence of two diseases can be very useful. For understanding the risk
factors associated with the incidence of two diseases or two complications, bivariate
geometric models can provide deeper insight to explain the underlying mechanism.
The bivariate count models are useful in various disciplines such as economics,
public health, epidemiology, environmental studies, reliability, and actuarial sci-
ence. The count models are introduced in Chaps. 8 and 9 that include bivariate
Poisson, bivariate double Poisson, bivariate negative binomial, and bivariate
multinomial models. The bivariate Poisson models are introduced for truncated data
too. The under- or overdispersion problems are discussed and test procedures are
shown with examples. In reliability and other lifetime data analysis, the bivariate
exponential models are very useful. In Chap. 9, an extended GLM is employed and
test for dependence is illustrated. In repeated measures, the extended GLM
approaches such as generalized estimating equations and generalized linear mixed
models play very important roles. It is noteworthy that the use of quasi-likelihood
methods created opportunities for exploring models when distributional assump-
tions are difﬁcult to attain but variance can be expressed as a function of mean. In
Chaps. 11–13, quasi-likelihood, generalized estimating equations, and generalized
linear mixed models are discussed. Generalized multivariate models by extending
the concepts of GLM are shown in Chap. 14. This chapter includes simple ways to
generalize the models for repeated measures data for two or more correlated out-
come variables with covariate dependence. In this book, the semi-parametric haz-
ards models are also highlighted which are being used extensively for analyzing
failure time data arising from longitudinal studies that produce repeated measures.
Multistate and multistage models, effective for analyzing repeated measures data,
are illustrated for both the graduate level students and researchers. The problem of
analyzing repeated measures data for failure time in the competing risk framework
is included which appears to have an increasingly important role in the ﬁeld of
survival analysis, reliability, and actuarial science. For analyzing lifetime data,
extended proportional hazards models such as multistate and multistage models
with transitions, reverse transitions, and repeated transitions over time are intro-
duced with applications in Chap. 15. In many instances, use of the techniques for
repeated measures data cannot be explored conveniently due to lack of appropriate
softwaresupport. In Chap. 16, newly developed R packages and functions along
with the use of existing R packages, SAS codes, and macro/IML are shown.
This book aims to provide important guidelines for both researchers and grad-
uate students in the ﬁelds of statistics and applied statistics, biomedical sciences,
vi Preface
epidemiology, reliability, survival analysis, econometrics, environment, social
science, actuarial science, etc. Both theory and applications are presented with
details to make the book user-friendly. This book includes necessary illustrations
and software usage outlines. In addition to the researchers, graduate students and
other users of statistical techniques for analyzing repeated measures data will be
beneﬁtted from this book. The potential users will ﬁnd it as a comprehensive
reference book essential for addressing challenges in analyzing repeated measures
data with a deeper understanding about nature of underlying relationships among
outcome and explanatory variables in the presence of dependence among outcome
variables.
We are grateful to our colleagues and students at the University of Dhaka,
University Sains Malaysia, King Saud University, and East West University. The
idea of writing this book has stemmed from teaching and supervising research
students on repeated measures data analysis for many years. We want to thank
Shahariar Huda for his continued support to our work. We extend our deepest
gratitude to Amiya Atahar for her unconditional help during the ﬁnal stage of
writing this book. Further we acknowledge gratefully the continued support from
Tahmina Khatun, Farida Yeasmeen, and Jayati Atahar. We extend our deep grat-
itude to the University Grants Commission, Bangladesh and the World Bank for
supporting the Higher Education Quality Enhancement Sub-project 3293 on repe-
ated measures. We are grateful to Rosihan M. Ali, Adam Baharum,
V. Ravichandran, A.A. Kamil, Jahida Gulshan, O.I. Idais, and A.E. Tabl for their
support. We are also indebted to Farzana Jahan, M. Aminul Islam and Mahfuza
Begum for their support at different stages of writing this book.
Dhaka, Bangladesh M. Ataharul Islam
Raﬁqul I. Chowdhury
Preface vii
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Simple Linear Regression Model . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Multiple Regression Model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Method of Least Squares. . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Maximum Likelihood Estimation . . . . . . . . . . . . . . . . . . 15
2.4 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 Exponential Family of Distributions. . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1 Exponential Family and Sufﬁciency . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Some Important Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Generalized Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Exponential Family and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Expected Value and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Components of a GLM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.5 Multinomial Response Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.6 Estimating Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.7 Deviance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 Covariate–Dependent Markov Models . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 First Order Markov Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 Conditional Model for Second Order Markov Chain
with Covariate Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.4 Covariate Dependent Model for Markov Chain of Order r . . . . . 57
ix
5.5 Tests for the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6 Modeling Bivariate Binary Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.2 Bivariate Bernoulli Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Bivariate Binary Model with Covariate Dependence. . . . . . . . . . 69
6.3.1 Covariate-Dependent Model . . . . . . . . . . . . . . . . . . . . . . 70
6.3.2 Likelihood Function and Estimating Equations . . . . . . . . 71
6.4 Test for Dependence in Bivariate Binary Outcomes . . . . . . . . . . 72
6.4.1 Measure of Dependence . . . . . . . . . . . . . . . . . . . . . . . . . 72
6.4.2 Test for the Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.4.3 Test for Dependence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.5 Generalized Bivariate Bernoulli Model . . . . . . . . . . . . . . . . . . . . 76
6.5.1 The Bivariate Bernoulli Model . . . . . . . . . . . . . . . . . . . . 77
6.5.2 Estimating Equations. . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.5.3 Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.6 Some Alternative Binary Repeated Measures Models . . . . . . . . . 82
6.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
7 Bivariate Geometric Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
7.2 Univariate Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . . 88
7.3 Bivariate Geometric Distribution: Marginal and Conditional
Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.4 Bivariate Geometric Distribution: Joint Model . . . . . . . . . . . . . . 91
7.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
8 Models for Bivariate Count Data:
Bivariate Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
8.2 The Poisson–Poisson Distribution. . . . . . . . . . . . . . . . . . . . . . . . 98
8.3 Bivariate GLM for Poisson–Poisson . . . . . . . . . . . . . . . . . . . . . . 99
8.3.1 Model and Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
8.3.2 Overdispersion in Count Data . . . . . . . . . . . . . . . . . . . . . 100
8.3.3 Tests for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . 101
8.3.4 Simple Tests for Overdispersion With or Without
Covariate Dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . 102
8.4 Zero-Truncated Bivariate Poisson . . . . . . . . . . . . . . . . . . . . . . . . 103
8.4.1 Zero-Truncated Poisson Distribution . . . . . . . . . . . . . . . . 104
8.4.2 A Generalized Zero-Truncated BVP Linear Model . . . . . 105
8.4.3 Test for the Model . . . . . . . . . . . . . . . .. . . . . . . . . . . . . 107
8.4.4 Deviance and Goodness of Fit . . . . . . . . . . . . . . . . . . . . 107
x Contents
8.5 Right-Truncated Bivariate Poisson Model. . . . . . . . . . . . . . . . . . 108
8.5.1 Bivariate Right-Truncated Poisson–Poisson Model . . . . . 108
8.5.2 Predicted Probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.5.3 Test for Goodness of Fit . . . . . . . . . . . . . . . . . . . . . . . . . 112
8.6 Double Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6.1 Double Poisson Model . . . . . . . . . . . . . . . . . . . . . . . . . . 114
8.6.2 Bivariate Double Poisson Model . . . . . . . . . . . . . . . . . . . 118
8.7 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
9 Bivariate Negative Binomial and Multinomial Models . . . . . . . . . . . 125
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9.2 Review of GLM for Multinomial . . . . . . . . . . . . . . . . . . . . . . . . 126
9.3 Bivariate Multinomial . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4 Tests for Comparison of Models. . . . . . . . . . . . . . . . . . . . . . . . . 131
9.5 Negative Multinomial Distribution and Bivariate GLM . . . . . . . 133
9.5.1 GLM for Negative Multinomial . . . . . . . . . . . . . . . . . . . 134
9.6 Application of Negative Multinomial Model . . . . . . . . . . . . . . . 137
10 Bivariate Exponential Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
10.2 Bivariate Exponential Distributions. . . . . . . . . . . . . . . . . . . . . . . 139
10.3 Bivariate Exponential Generalized Linear Model . . . . . . . . . . . . 142
10.4 Bivariate Exponential GLM Proposed
by Iwasaki and Tsubaki . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
10.5 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
11 Quasi-Likelihood Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
11.2 Likelihood Function and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . 152
11.3 Quasi-likelihood Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
11.4 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
11.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
12 Generalized Estimating Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
12.3 Estimation of Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
12.4 Steps in a GEE: Estimation and Test . . . . . . . . . . . . . . . . . . . . . 164
12.5 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
13 Generalized Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
13.2 Generalized Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . 169
13.3 Identity Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.4 Logit Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
13.5 Log Link Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
Contents xi
13.6 Multinomial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
13.7 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
14 Generalized Multivariate Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
14.2 Multivariate Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . 179
14.3 Multivariate Negative Binomial Distribution. . . . . . . . . . . . . . . . 181
14.4 Multivariate Geometric Distribution . . . . . . . . . . . . . . . . . . . . . . 182
14.5 Multivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . 184
14.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
15 Multistate and Multistage Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
15.2 Some Basic Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
15.3 Censoring: Construction of Likelihood Function . . . . . . . . . . . . 196
15.4 Proportional Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
15.5 Competing Risk Proportional Hazards Model . . . . . . . . . . . . . . . 199
15.6 Multistate Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
15.7 Multistage Hazards Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
15.8 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
16 Analysing Data Using R and SAS . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
16.1 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
Subject Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
xii Contents
About the Authors
M. Ataharul Islam is currently QM Husain Professor at the Institute of Statistical
Research and Training (ISRT), University of Dhaka, Bangladesh. He was a
Professor of Statistics at the Universiti Sains Malaysia, King Saud University, East
West University and the University of Dhaka. He served as a visiting faculty at the
University of Hawaii and University of Pennsylvania. He is a recipient of the
Pauline Stitt Award, Western North American Region (WNAR) Biometric Society
Award for content and writing, University Grants Commission Award for book and
research, and the Ibrahim Memorial Gold Medal for research. He has published
more than 100 papers in international journals on various topics, mainly on lon-
gitudinal and repeated measures data including multistate and multistage hazards
model, statistical modeling, Markov models with covariate dependence, generalized
linear models, conditional and joint models for correlated outcomes. He authored a
book on Markov models, edited one book jointly and contributed chapters in
several books.
Raﬁqul I. Chowdhury a former senior lecturer at the Department of Health
Information Administration, Kuwait University, Kuwait, has been involved widely
in various research projects as a research collaborator and consultant. He has
extensive experience in statistical computing with large data sets, especially, with
repeated measures data. He has published more than 60 papers in international
journals on statistical computing, repeated measures data, and utilization of
healthcare services among others and presented papers in various conferences. He
co-authored a book on Markov models and wrote programs and developed pack-
ages for marginal, conditional and joint models including multistate Markov and
hazards models, bivariate generalized linear models on Poisson, geometric,
Bernoulli using SAS and R.
xiii
List of Figures
Fig. 2.1 Population Regression Model .. . . . . . . . . . . . . . . . . . . . . . . 10
Fig. 2.2 Simple Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Fig. 15.1 States and transition for a simple proportional
hazards model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
Fig. 15.2 Example of a multistate model . . . . . . . . . . . . . . . . . . . . . . . 200
Fig. 15.3 Example of a multistage model for maternal morbidity . . . . . . 203
Fig. 15.4 States and Transitions in a Simplified Multistage Model . . . . . 205
xv
List of Tables
Table 1.1 Status of disease at different follow-up times (Yij) . . . . . . . . . 2
Table 1.2 Occurrence of diabetes and heart problem by subjects
and waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Table 2.1 Estimates and tests of parameters of a simple regression
model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Table 2.2 Estimates and tests of parameters of a multiple linear
regression model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Table 4.1 Estimation of parameters of GLM using identity link
function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Table 4.2 Estimates of parameters of GLM for Binary Outcomes
on Depression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Table 4.3 Distribution of number of conditions . . . . . . . . . . . . . . . . . . 49
Table 4.4 Estimates of parameters of GLM using log link function for
number of conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Table 4.5 Negative binomial GLM of number of conditions . . . . . . . . . 50
Table 5.1 Frequency of depression in four waves . . . . . . . . . . . . . . . . 61
Table 5.2 Transition counts and transition probabilities for ﬁrst-order
Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
Table 5.3 Estimates for ﬁrst-order Markov model . . . . . . . . . . . . . . . . 62
Table 5.4 Transition counts and transition probabilities
for second-order Markov model . . . . . . . . . . . . . . . . . . . . . 62
Table 5.5 Estimates for second-order Markov model . . . . . . . . . . . . . . 62
Table 5.6 Transition counts and transition probabilities for third-order
Markov model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Table 5.7 Estimates for third-order Markov model. . . . . . . . . . . . . . . . 64
Table 5.8 Test for the order of Markov model . . . . . . . . . . . . . . . . . . 65
Table 6.1 Bivariate probabilities for two outcome variables,
Y1 and Y2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 6.2 Transition count and probability for Y1 and Y2 . . . . . . . . . . 84
Table 6.3 Estimates for two conditionals and one marginal model . . . . . 84
xvii
Table 6.4 Observed and predicted counts from the bivariate
distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Table 7.1 Frequency of incidence of diabetes followed by stroke . . . . . 94
Table 7.2 Estimates of the parameters of Model 1 . . . . . . . . . . . . . . . . 94
Table 7.3 Estimates of parameters of Model 2 . . . . . . . . . . . . . . . . . . 95
Table 8.1 Bivariate distribution of outcome variables. . . . . . . . . . . . . . 121
Table 8.2 Fit of bivariate Poisson model (marginal/conditional)
for both unadjusted and adjusted for over- or
underdispersion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Table 8.3 Right-truncated bivariate Poisson model
(marginal/conditional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
Table 8.4 Zero-truncated bivariate Poisson model
(marginal/conditional) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Table 8.5 Estimates of parameters of bivariate double Poisson model
(Model 2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Table 9.1 Estimates of parameters of bivariate negative
binomial model using marginal–conditional approach . . . . . . 137
Table 9.2 Estimates of the parameters of bivariate negative binomial
model (Joint model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
Table 10.1 Distribution of diabetes and heart problems
in different waves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
Table 10.2 Estimates of bivariate exponential full model . . . . . . . . . . . . 148
Table 10.3 Likelihood ratio tests for overall model and association
parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Table 11.1 Estimated parameters and tests for number of conditions
using quasi-likelihood method . . . . . . . . . . . . . . . . . . . . . . 158
Table 11.2 Estimated parameters and tests for counts of healthcare
services utilizations using quasi-likelihood method . . . . . . . . 159
Table 12.1 GEE for various correlation structures . . . . . . . . . . . . . . . . . 167
Table 12.2 ALR with different correlation structures . . . . . . . . . . . . . . . 167
Table 13.1 Generalized linear mixed model with random intercept
for binary responses on depression status from the HRS
data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
Table 13.2 Random effect estimates for selected subjects . . . . . . . . . . . . 175
Table 13.3 Predicted probabilities for selected subjects . . . . . . . . . . . . . 176
Table 13.4 Healthcare services utilization by waves . . . . . . . . . . . . . . . 176
Table 13.5 Generalized linear mixed model for log link function for
healthcare services utilization with random intercepts . . . . . . 176
Table 14.1 Estimates of the parameters of multivariate
Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
Table 15.1 Number of different types of transitions . . . . . . . . . . . . . . . . 208
Table 15.2 Estimates from multistate hazards model
for depression data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
Table 15.3 Test for proportionality for different transitions . . . . . . . . . . . 209
xviii List of Tables
Table 15.4 Estimates from multistage hazards model for complications
in three stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
Table 15.5 Test for proportionality for different transitions
during antenatal, delivery, and postnatal stages . . . . . . . . . . . 211
Table 15.6 Estimates from multistage hazards model for Model II . . . . . 212
List of Tables xix
Chapter 1
Introduction
The ﬁeld of repeated measures has been growing very rapidly mainly due to
increasing demand for statistical techniques for analyzing repeated measures data in
various disciplines such as biomedical sciences, epidemiology, reliability, econo-
metrics, environment, social science, etc. Repeated measures data may comprise of
either responses from each subject/experimental unit longitudinally at multiple
occasions or responses under multiple conditions. The responses may be qualitative
(categorical) or quantitative (discrete or continuous). The analysis of repeated
measures data becomes complex due to presence of two types of associations, one
is the association between response and explanatory variables and the other is
association between outcome variables. Repeated measures data from longitudinal
studies are collected over time on each study participant or experimental unit. The
changes in both outcome variables and factors associated with changes in outcome
variables within individuals may provide useful insights. In addition, relationships
between outcome variables as well as between outcome variables observed at
different times and covariates can be studied thoroughly if we have repeated data on
same individuals or experimental units. The study of change in observed values ofoutcome variable status of participants provides very important in-depth insights
regarding the dynamics of underlying relationships between outcome status of
participants and their characteristics represented by covariates in the presence of
dependence in outcomes. For analyzing multivariate data from repeated measures,
the type of association between outcome variables due to repeated occurrence of
events from same participants is of great concern. In other words, the nature of
correlation within the subjects needs to be taken into account.
Two data layout designs are displayed in Tables 1.1 and 1.2. In the ﬁrst layout
design, each of the 5 subjects is followed up for 4 time points and status of a
disease, such as whether diabetes is controlled or uncontrolled at each time point, is
recorded. Let us denote Yij = 1, if diabetes is uncontrolled for the ith individual at
the jth follow-up, Yij = 0, otherwise; i = 1,…,5; j = 1,…,4. The number of
follow-ups for subjects can be equal (balanced) or unequal (unbalanced).
© Springer Nature Singapore Pte Ltd. 2017
M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data,
DOI 10.1007/978-981-10-3794-8_1
1
Table 1.2 shows a dummy table for occurrence of diabetes and heart problem
being observed repeatedly over 11 time points (waves) speciﬁed by equal intervals.
Let us denote Y1ij = 1, if diabetes is reported for the ith individual at the jth
Table 1.1 Status of disease
at different follow-up times
(Yij)
Subject (i) Time (j)
T1 T2 T3 T4
1 y11
0
y12
0
y13
1
y14
1
2 y21
1
y22
1
y23
0
y24
1
3 y31
0
y32
1
y33
1
y34
0
4 y41
0
y42
0
y43
0
y44
0
5 y51
1
y52
1
y53
1
y54
1
Table 1.2 Occurrence of
diabetes and heart problem by
subjects and waves
Ykij Wave (j)
1 2 3 4
Subject
(i)
Y1i1 Y2i1 Y1i2 Y2i3 Y1i3 Y2i3 Y1i4 Y2i4
1 0 1 0 0 0 1 0 1
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
5 1 0 0 0 1 0 1 0
Wave (j)
5 6 7 8
Subject
(i)
Y1i5 Y2i5 Y1i6 Y2i6 Y1i7 Y2i7 Y1i8 Y2i8
1 0 1 0 1 0 1 0 1
2 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0
4 1 0 1 0 1 0 1 0
5 1 0 1 1 1 1 1 1
Wave (j)
9 10 11
Subject
(i)
Y1i9 Y2i9 Y1i10 Y2i10 Y1i11 Y2i11
1 0 1 0 1 0 1
2 0 0 0 0 0 1
3 0 0 0 0 0 0
4 1 0 1 0 1 0
5 1 9 9 9 9 9
2 1 Introduction
follow-up, Y1ij = 0 otherwise; Y2ij = 1, if heart problem is reported for the ith
individual at the jth follow-up, Y2ij = 0, otherwise; 9 shows missing value, k = 1,2;
i = 1,…, 5; j = 1,…,11.
Dependence in outcomes is a common feature in repeated measures data. Hence,
a systematic approach to deal with correlated outcomes along with their relationship
with covariates is the foremost challenge for analyzing repeated measures data. In
case of independence of outcome variables, the modeling of relationship between
explanatory and outcome variables reduces to marginal models but this may not
reflect the reality in repeated measures because the data are obtained from each
subject/experimental unit at multiple occasions or under multiple conditions. In that
case, dependence in outcome variables may hardly satisfy the underlying conditions
for a marginal model. In other words, marginal models may provide misleading
results in analyzing repeated measures data due to exclusion of correlation among
outcome variables in the models. An alternative to the marginal models is to
employ conditional models such as models based on Markovian assumptions where
the models are constructed for an outcome variable at the current time for given
value of outcome observed previously. The order of Markov chain may vary
depending on underlying nature of transitions over time.
Since the development of generalized linear model, there is a scope to generalize
the linear models for different types of outcome or response variables (normal or
nonnormal, discrete or continuous, qualitative) that belong to the exponential
family of distributions using different link functions. The exponential family form
f ðy; hÞ ¼ e½aðyÞbðhÞ þ cðhÞ þ dðyÞ�
provides the minimal sufﬁcient statistic. The following alternative expression for
the exponential family of distributions
f ðy; hÞ ¼ e½fyh� bðhÞað/Þ g þ cðy;/Þ�
can be used to identify the canonical parameter and the link between the random
and systematic components can be speciﬁed. There have been extensive works
conducted on univariate GLM but only some isolated efforts have been made to
generalize the usefulness of the generalized linear models for dependent outcomes
generated from repeated measures data. Some generalizations are available for
bivariate binary and count data and it is noteworthy that both bivariate Bernoulli
and count models have wide range of applications in various ﬁelds.
An example of bivariate model for binary data for the outcome variables Y1 and
Y2 can be expressed in the following form:
Pðy1; y2Þ ¼ Pð1 � y1Þð1 � y2Þ00 P
ð1 � y1Þy2
01 P
y1ð1 � y2Þ
10 P
y1y2
11 :
1 Introduction 3
Using the ﬁrst order Markov chain transition probabilities can be deﬁned as
PðYij Yij� r; . . .; Yij� 1
�� Þ ¼ PðYij Yij� 1�� Þ:
This relationship provides a conditional-marginal relationship to obtain the joint
form
PðY1 ¼ j; Y2 ¼ kÞ ¼ PðY2 ¼ k Y1 ¼ jÞ � PðY1 ¼ jÞ; j ¼ 0; 1; k ¼ 0; 1:j
Similar approach can be shown for some other bivariate distributions such as
Poisson, geometric, negative binomial, multinomial, exponential, etc. These dis-
tributions can be expressed in bivariate exponential family by generalizing the
univariate form as shown below:
f ðy; hÞ ¼ e½f
y1h1 þ y2h2 � bðh1 ;h2Þ
að/Þ g þ cðy1;y2;/Þ�
where h1 and h2 are canonical link functions such that h1 ¼ gðl1Þ ¼ g1 ¼ Xb1
and h2 ¼ gðl2Þ ¼ g2 ¼ Xb2. Here, l1 ¼ EðY1 XÞj and l2 ¼ EðY2 XÞj .
For generalized linear models, it is essential to know the random component of
the model which represents the underlying distributional form of the outcome
variable. If the form of the distribution is known then the likelihood estimation
procedure can be applied to estimate the parameters of the linear model. However,
in many cases, the form of the underlying distribution may not be known. In that
case, the quasi-likelihood approach can be used. For analyzing repeated measures
data, the quasi-likelihood estimation procedure has become widely popular among
the researchers. In quasi-likelihood method, we need to know the expected values
of outcome variables and the variance functions need to be expressed as functions
of mean. The variance of outcome variables can be shown as VarðYÞ ¼ að/ÞvðlÞ
where að/Þ is the dispersion parameter and vðlÞ is the variance function.
The quasi-likelihood function or more speciﬁcally quasi-log-likelihood (Nelder
and Lee 1992) is deﬁned for a single observation as
Qðl; yÞ ¼
Zl
y
ðy� tÞ
að/ÞvðtÞ dt:
The quasi-score function can be obtained by differentiating Q with respect to l as
shown below:
@Q
@l
¼ y� l
að/ÞvðlÞ :
4 1 Introduction
The quasi-likelihood or the quasi-log-likelihood for independent observations
y1; . . .; yn, it can be shown that
Qðl; yÞ ¼
Xn
i¼1
Zli
yi
ðyi � tiÞ
að/ÞvðtiÞ dti:
The estimating equations for estimating the parameters of the linear model are
UðbÞ ¼ @Q
@b
¼
Xn
i¼1
@li
@b
� �0ðyi � liÞ
vðliÞ
¼ 0
which are known as quasi-score equations. This can be rewritten in the following
form for repeated measures data:
UðbÞ ¼ @Q
@b
¼ @l
@b
� �0
V�1ðy� lÞ ¼ 0
¼D0V�1ðy� lÞ ¼ 0:
The generalized estimating equation (GEE) provides a marginal model which
depends on the choice of a correlation structure. The estimating equations using
quasi-likelihood scores can be shown as
UðbÞ ¼
Xn
i¼1
D0iViðli; aÞ�1ðyi � liÞ ¼ 0
where Viðli; aÞ ¼ A1=2i R að ÞA1=2i að/Þ, RðaÞ is a working correlation matrix
expressed as a function of a. The generalizedestimating equation is an extension of
the generalized linear model for repeated observations or more speciﬁcally GEE is a
quasi-likelihood approach based on knowledge about ﬁrst two moments where the
second moment is a function of the ﬁrst moment. However, due to marginal or
population averaged modeling, the utility of the generalized estimating equations
remains restricted. Although correlation structure is considered in a marginal model
framework, the within subject association incorporated in the estimation of
parameters remains largely beyond explanation. An alternative way to incorporate
the within subject variation in the linear model is to use a generalized linear mixed
model where random effects attributable to within subject variation are incorpo-
rated. The generalized linear model is
gðliÞ ¼ Xib; i ¼ 1; . . .; n
with EðYi XiÞ ¼ liðbÞj and VarðYiÞ ¼ að/ÞVðliÞ. This model can be extended for
the jth repeated observations on the ith subject as
1 Introduction 5
gðlijÞ ¼ Xijb; i ¼ 1; . . .; n; j ¼ 1; . . .; Ji
with EðYij XijÞ ¼ lijðbÞ
�� and VarðYijÞ ¼ að/ÞVðlijÞ. Then considering a random
effect, ui, for the repeated observations of the ith subject or cluster, we can introduce
an extended model
gðlijÞ ¼ Xijbþ Ziui; i ¼ 1; . . .; n; j ¼ 1; . . .; Ji
where ui�MVNð0;
PÞ. Instead of normality assumption, other assumptions may
be considered depending on the type of data.
Another alternative to the marginal model is conditional model which can
provide useful analysis by introducing a model for the outcome variable for given
values of other outcome variables. One popular technique is based on the
Markovian assumption where the transition probabilities are considered as func-
tions of covariates and previous outcomes. The models can take into account ﬁrst or
higher order models and a test for order may make the model more speciﬁc. Markov
models are suitable for longitudinal data observed over ﬁxed intervals of time.
A more efﬁcient modeling of repeated measures requires multivariate models which
can be obtained from marginal–conditional approach or joint distribution of out-
come variables. The conditional models for binary outcome variables, Y1 and Y2,
using ﬁrst order Markov model, can be expressed as follows:
PðY2i ¼ 1 Y1i ¼ 0; Xij Þ ¼ e
Xib01
1þ eXib01 and PðY2i ¼ 1 Y1i ¼ 1; Xij Þ ¼
eXib11
1þ eXib11
where
b001 ¼ ½b010; b011; . . .; b01p�; b011 ¼ ½b110; b111; . . .; b11p�; Xi ¼ ½1;X1i; . . .;Xpi�:
The marginal models for Y1 and Y2 are
PðY1i ¼ 1 Xij Þ ¼ e
Xib1
1þ eXib1 and PðY2i ¼ 1 Xij Þ ¼
eXib2
1þ eXib2 :
Here
b01 ¼ ½b10; b11; . . .; b1p�; b02 ¼ ½b20; b21; . . .; b2p�; xi ¼ ½1; x1i; . . .; xpi�:
The semi-parametric hazard models provide models for analyzing lifetime data
arising from longitudinal studies that produce repeated measures. The multistate
and multistage models can be effective for analyzing data on transitions, reverse
transitions, and repeated transitions that take place over time in the status of events.
It is useful to study the transitions over time as functions of covariates or risk
factors. In survival or reliability analysis, we have to deal with censored data which
6 1 Introduction
is the most common source of incomplete data in longitudinal studies. The pro-
portional hazards models for one or more transient states can be obtained for
partially censored data. The problem of analyzing repeated measures data for failure
time in the competing risk framework has been of interest in various ﬁelds
including survival analysis, reliability, and actuarial science. The hazard function
for failure type, J = j, where J = 1,…,k, with covariate dependence it can be shown
as
hjðt; xÞ ¼ lim
Dt!0
Pðt� T � tþDt; J ¼ j T � tj ; xÞ
Dt
:
Then the cause-speciﬁc proportional hazards model is
hijðti; xiÞ ¼ h0ijðtÞexibj
where xi ¼ ðxi1; xi2; . . .. . . ; xipÞ and parameter vector bj ¼ ðbj1; . . .; bjpÞ0, j = 1,…,k.
Extending the cause-speciﬁc hazard function for transitions among several transient
states, we can deﬁne the multistate hazard function for transition from state j to state
k during ðt; tþDtÞ as
hðt; k j;j xjkÞ ¼ lim
Dt!0
Pðt� T � tþDt; S ¼ k T � tj ; S ¼ j; xjkÞ
Dt
and the proportional hazards model for multistate transitions is
hðt; k j;j xjkÞ ¼ h0jkðtÞexjkbjk
where bjk is the vector of parameters for transition from j to k and xjk is the vector of
covariate values.
In this book, the inferential techniques for modeling repeated measures data are
illustrated to provide detailed background with applications. The estimation pro-
cedures for various models in analyzing repeated measures data are of prime
concern and remain a challenge to the users. For testing the dependence in out-
comes, some test procedures are illustrated for binary, count, and continuous out-
come variables in this book. The goodness of ﬁt tests are provided with
applications. For correlated Poisson outcomes, the problem of under or overdis-
persion are addressed and tests for under or overdispersion are highlighted with
examples. In many instances truncation is one of the major problems in analyzing
correlated outcomes such as zero or right truncation particularly in count regression
models which are discussed in this book too.
1 Introduction 7
Chapter 2
Linear Models
In this chapter, a brief introduction of linear models is presented. Linearity can be
interpreted in terms of both linearity in parameters or linearity in variables. In this
book, we have considered linearity in parameters of a model. Linear models may
generally include regression models, analysis of variance models, and analysis of
covariance models. As the focus of this book is to address various generalized
linear models for repeated measures data using GLM and Markov chain/process, we
have reviewed regression models in this chapter very briefly.
2.1 Simple Linear Regression Model
Let us consider a random sample of n pairs of observations ðY1; X1Þ; . . .; ðYn; XnÞ.
Here, let Y be the dependent variable or outcome and X be the independent variable
or predictor. Then the simple regression model or the regression model with a single
predictor is denoted by
EðY XÞ ¼ b0þ b1Xj : ð2:1Þ
It is clear from (2.1) that the simple regression model is a population averaged
model. Here EðY XÞ ¼ lY Xj
��� represents conditional expectation of Y for given X . In
other words,
lY Xj ¼ b0þ b1X ð2:2Þ
which can be visualized from the ﬁgure displayed below (Fig. 2.1).
© Springer Nature Singapore Pte Ltd. 2017
M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data,
DOI 10.1007/978-981-10-3794-8_2
9
An alternative way to represent model (2.1) or (2.2) is
Y ¼ b0þ b1Xþ e ð2:3Þ
where e denotes the distance of Y from the conditional expectation or conditional
mean, lY Xj , as evident from expression shown below:
Y ¼ lY Xj þ e ð2:4Þ
where e denotes the error in the dependent or outcome variable, Y, attributable to the
deviation from the population averaged model and e is a random variable as well
with EðeÞ ¼ 0 and VarðeÞ ¼ r2.
2.2 Multiple Regression Model
We can extend the simple regression model shown in Sect. 2.1 for multiple
regression model with p predictors X1; . . .;Xp. The population averaged model can
be shown as
EðY XÞ ¼ b0þ b1X1þ . . .þ bpXp
�� : ð2:5Þ
Here EðY XÞj ¼ lY Xj as shown in Sect. 2.1.
Alternatively,
Y ¼ b0þ b1X1þ . . .þ bpXpþ e ð2:6Þ
which can be expressed as
Y ¼ lY Xj þ e: ð2:7Þ
Fig. 2.1 Population
Regression Model
10 2 Linear Models
In vector and matrix notation, the model in Eq. (2.6) for a sample of size n is
Y ¼ Xb þ e ð2:8Þ
where
Y ¼
Y1
Y2
..
.
Yn
0
BBBBB@
1
CCCCCA
; X ¼
1 X11 . . . X1p
1 X21 . . . X1p
..
.
1 Xn1 . . . Xnp
0
BBB@
1
CCCA; b ¼
b0
b1
..
.
bp
0
BBBBB@
1
CCCCCA
; e ¼e1
e2
..
.
en
0
BBBBB@
1
CCCCCA
:
It is clear from the formulation of regression model that it provides a theoretical
framework for explaining the underlying linear relationships between explanatory
and outcome variables of interest. A perfect model can be obtained only if all the
values of the outcome variable are equal to conditional expectation for given values
of predictors which is not feasible in explaining real life problems. However, still it
can provide very important insight under the circumstance of specifying a model
that keeps the error minimum. Hence, it is important to specify a model that can
produce estimate of outcome variable as much close to observed values as possible.
In other words, the postulated models in Sects. 2.2 and 2.3 are hypothetical ide-
alized version of the underlying linear relationships which may be attributed to
merely association or in some instances causation as well.
The population regression model is proposed under a set of assumptions:
(i) EðeiÞ ¼ 0, (ii) VarðeiÞ ¼ r2, (iii) EðeiejÞ ¼ 0 for i 6¼ j, and (iv) independence of
X and e. In addition, assumption of normality is necessary for likelihood estimation
as well as for testing of hypotheses. Based on these assumptions, we can show the
mean and variance of Yi as follows:
EðYi Xij Þ ¼ Xib; andVarðYi Xij Þ ¼ r2;
where Xi is the ith row vector of the matrix X. Using (2.8), we can rewrite the
assumptions as follows: (i) EðeÞ ¼ 0, and (ii) CovðeÞ ¼ r2I. Similarly,
EðY Xj Þ ¼ Xb, and CovðY Xj Þ ¼ r2I.
2.3 Estimation of Parameters
For estimating the regression parameters, we can use both method of least squares
and method of maximum likelihood. It may be noted here that for extending the
concept of linear models to generalized linear models or covariate dependent
Markov models, the maximum likelihood method will be used more extensively,
hence, both are discussed here although method of least squares is a more con-
venient method of estimation for linear regression model with desirable properties.
2.2 Multiple Regression Model 11
2.3.1 Method of Least Squares
The method of least squares is used to estimate the regression parameters by
minimizing the error sum of squares or residual sum of squares. The regression
model is
Yi ¼ b0þ b1Xi1þ . . .þ bpXipþ ei; i ¼ 1; 2; . . .; n ð2:9Þ
and we can deﬁne the deviation between outcome variable and its corresponding
conditional mean for given values of X as follows:
ei ¼ Yi � ðb0þ b1Xi1þ . . .þ bpXipÞ: ð2:10Þ
Then the error sum of squares is deﬁned as a quadratic form
Q ¼
Xn
i¼1
e2i ¼
Xn
i¼1
Yi � ðb0þ b1Xi1þ . . .þ bpXipÞ
� �2
: ð2:11Þ
The sum of squares of error is minimized if the estimates are obtained from the
following equations:
@Q
@b0
����
b¼b^
¼ �2
Xn
i¼1
Yi � ðb^0þ b^1Xi1þ . . .þ b^pXipÞ
h i
¼ 0 ð2:12Þ
@Q
@bj
�����
b¼b^
¼ �2
Xn
i¼1
Yi � ðb^0þ b^1Xi1þ . . .þ b^pXipÞ
h i
Xij ¼ 0; ð2:13Þ
j = 1,…,p. We can consider (2.12) as a special case of Eq. (2.13) for j = 0 and
X0 ¼ 1.
Using model (2.8), Q can be expressed as
Q ¼ e0e ¼ ðY � XbÞ0ðY � XbÞ: ð2:14Þ
The right-hand side of (2.14) is
Q ¼ Y 0Y � Y 0Xb� b0X 0Y þ b0X 0Xb
where Y 0Xb ¼ b0X 0Y . Hence the estimating equations are
@Q
@b
����
b¼b^
¼ �2X 0Y þ 2X 0Xb^ ¼ 0: ð2:15Þ
12 2 Linear Models
Solving Eq. (2.15), we obtain the least squares estimators of regression
parameters as shown below:
b^ ¼ ðX 0XÞ�1ðX 0YÞ: ð2:16Þ
The estimated regression model can be shown as
Y^ ¼ Xb^ ð2:17Þ
and alternatively
Y ¼ Xb^þ e ð2:18Þ
where
Y^ ¼
Y^1
Y^2
..
.
Y^n
0
BBBBB@
1
CCCCCA
; X ¼
1 X11 . . . X1p
1 X21 . . . X2p
..
.
1 Xn1 . . . Xnp
0
BBB@
1
CCCA; b^ ¼
b^0
b^1
..
.
b^p
0
BBBBBB@
1
CCCCCCA
; e ¼
e1
e2
..
.
en
0
BBBBB@
1
CCCCCA
:
It may be noted here that e is the vector of estimated errors from the ﬁtted model.
Hence, we can show that
e ¼ Y � Y^ ð2:19Þ
and the error sum of squares is
e0e ¼ ðY � Y^Þ0ðY � Y^Þ: ð2:20Þ
2.3.1.1 Some Important Properties of the Least Squares Estimators
The least squares estimators have some desirable properties of good estimators
which are shown below.
(i) Unbiasedness: Eðb^Þ ¼ b:
Proof: We know that b^ ¼ ðX 0XÞ�1ðX 0YÞ and Y ¼ Xbþ e. Hence,
Eðb^Þ ¼ E½ðX 0XÞ�1ðX 0YÞ�
¼ ðX 0XÞ�1X 0EðYÞ
¼ ðX 0XÞ�1X 0EðXbþ eÞ
¼ ðX 0XÞ�1X 0Xb
¼ b:
2.3 Estimation of Parameters 13
(ii) Covðb^Þ ¼ ðX 0XÞ�1r2:
Proof:
Covðb^Þ ¼ Cov½ðX 0XÞ�1X 0Y �
¼ ðX 0XÞ�1X 0CovðYÞXðX 0XÞ�1
where CovðYÞ ¼ r2I. Hence,
Covðb^Þ ¼ ðX 0XÞ�1X 0IXðX 0XÞ�1r2
¼ ðX 0XÞ�1r2:
ð2:21Þ
(iii) The least squares estimator b^ is the best linear unbiased estimator of b.
(iv) The mean squared error is an unbiased estimator of r2. In other words,
E
e0e
n� p� 1
� �
¼ r2 ð2:22Þ
Proof: Let us denote SSE ¼ e0e ¼ ðY � Xb^Þ0ðY � Xb^Þ and s2 ¼ SSEn� p� 1 where p is
the number of predictors. Total sum of squares of Y is Y 0Y . The sum of squares of
errors can be rewritten as
SSE ¼ Y 0Y � Y 0Xb^� b^0X 0Y þ b^0X 0Xb^
¼ Y 0Y � 2b^0X 0Y þ b^0X 0Xb^
where Y 0Xb^ ¼ b^0X 0Y . Then replacing b^ by ðX 0XÞ�1ðX 0YÞ, it can be shown that
SSE ¼ Y 0Y � 2b^0X 0Y þ b^0X 0Xb^ ¼ Y 0Y � b^0X 0Y
¼ Y 0Y � ½ðX 0XÞ�1X 0Y �0X 0Y
¼ Y 0Y � Y 0XðX 0XÞ�1X 0Y
¼ Y 0½I � Y 0XðX 0XÞ�1X 0�Y
It can be shown that the middle term of the above expression is a symmetric
idempotent matrix and SSEr2 is chi-square with degrees of freedom equal to the rank of
the matrix ½I � Y 0XðX 0XÞ�1X 0�. The rank of this idempotent matrix is equal to the
trace½I � Y 0XðX 0XÞ�1X 0� which is n – p − 1. Hence,
14 2 Linear Models
E½ðn� p� 1Þðs2Þ=r2� ¼ E SSE=r2ð Þ ¼ trace½I � Y 0XðX 0XÞ�1X 0� ¼ n� p� 1.
This implies EðSSEÞ ¼ ðn� p� 1Þr2 and E SSEn� p� 1
� �
¼ r2: In other words, the
mean square error is an unbiased estimator of r2, i.e. Eðs2Þ ¼ r2:
2.3.2 Maximum Likelihood Estimation
It is noteworthy that the estimation by least squares method does not require nor-
mality assumption. However, the estimates of regression parameters can be
obtained assuming that Y �Nn Xb; r2Ið Þ where EðY XÞj ¼ Xb and VarðY XÞj ¼
r2I. The likelihood function is
Lðb; r2Þ ¼ 1
ð2pÞn=2½r2I�1=2
e�ðY �XbÞ
0ðr2IÞ�1ðY �XbÞ=2
¼ 1
ð2pr2Þn=2
e�ðY �XbÞ
0ðY �XbÞ=2r2 :
The log-likelihood function can be shown as follows:
ln Lðb; r2Þ ¼ � 1
2
n lnð2pÞ � 1
2
n ln r2 � 1
2r2
ðY � XbÞ0ðY � XbÞ: ð2:23Þ
Differentiating (2.23) with respect to parameters and equating to zero, we obtain
the following equations:
@ ln L
@b
����
b¼b^; r2¼r^2
¼ � 1
2r^2
ð�2X 0Y � 2X 0Xb^Þ ¼ 0 ð2:24Þ
@ ln L
@r2
����
b¼b^; r2¼r^2
¼ � n
2r^2
þ 1
2ðr^2Þ2 ðY � Xb^Þ
0ðY � Xb^Þ ¼ 0 ð2:25Þ
Solving (2.24) and (2.25), we obtain the following maximum likelihood estimators:
b^ ¼ ðX 0XÞ�1ðX 0YÞ;
and
r^2 ¼ 1
n
ðY � Xb^Þ0ðY � Xb^Þ:
2.3 Estimation of Parameters 15
2.3.2.1 Some Important Properties of Maximum Likelihood
Estimators
Some important properties of maximum likelihood estimators are listed below:
(i) b^�Npþ 1 b; r2ðX 0XÞ�1
h i
;
(ii) nr^
2
r2 � v2ðn� p� 1Þ;
(iii) b^ and r^2 are independent,
(iv) If Y is NnðXb; r2IÞ then b^ and r^2 are jointly sufﬁcient for b and r2, and
(v) If Y is NnðXb; r2IÞ then b^ have minimum variance among all unbiased
estimators.
2.4 Tests
In a regression model, we need to perform several tests, such as: (i) signiﬁcance of
the overall ﬁtting of model involving p predictors, (ii) signiﬁcance of each
parameter to test for signiﬁcant association between each predictor and outcome
variable, and (iii) signiﬁcance of a subset of parameters.
(i) Test for signiﬁcance of the model
In the regression model, Y ¼ b0þ b1X1þ . . .þ bpXpþ e, it is important to examine
whether none of the predictors X1; . . .;Xp is linearly associated with outcomevariable, Y, against the hypothesis that at least one of the predictors is linearly
associated with outcome variable. As the postulated model represents a hypothetical
relationship between population mean and predictors, EðY XÞ ¼ b0þj
b1X1þ . . .þ bpXp:, the contribution of the model can be tested from the regression
sum of squares which indicates the ﬁt of the model for the conditional mean,
compared to the error sum of squares that measures deviation of observed values of
outcome variable from the postulated linear relationship of predictors with condi-
tional mean. It may be noted here that total sum of squares due to outcome variable
can be partitioned into two components for regression and error as shown below:
Y 0Y ¼ b^0X 0Y þðY � Xb^Þ0ðY � Xb^Þ
where b^0X 0Y is the sum of squares of regression (SSR) and ðY � Xb^Þ0ðY � Xb^Þ is
the sum of squares error (SSE).
The coefﬁcient of multiple determination, R2, measures the extent or proportion
of linear relationship explained by the multiple linear regression model. This is the
16 2 Linear Models
squared multiple correlation. The coefﬁcient of multiple determination can be
deﬁned as:
R2 ¼ Regression Sumof Squares
Total Sum of Squares
¼ b^
0X 0Y � n�Y2
Y 0Y � n�Y2 : ð2:26Þ
and the range of R2 is 0�R2� 1, 0 indicating that the model does not explain the
variation at all and 1 for a perfect ﬁt or 100% is explained by the model.
The null and alternative hypotheses for overall test of the model are:
H0 : b1 ¼ . . . ¼ bp ¼ 0 and H1 : bj 6¼ 0; for at least one j, j = 1, … , p.
Under null hypothesis, sum of squares of regression is v2pr
2 and similarly sum of
squares of error is v2n� p� 1r
2. The test statistic is
F ¼ SSR=p
SSE=ðn� p� 1Þ �Fp;ðn� p� 1Þ: ð2:27Þ
Rejection of null hypothesis indicates that at least one of the variables in the
postulated model contributes signiﬁcantly in the overall or global test.
(ii) Test for the signiﬁcance of parameters
Once we have determined that at least one of the predictors is signiﬁcant, next step
is to identify the variables that exert signiﬁcant linear relationship with outcome
variable. Statistically it is obvious that inclusion of one or more variables in a
regression model may result in increase in regression sum of squares and thus
decrease in error sum of squares. However, it needs to be tested whether such
inclusion is statistically signiﬁcant or not. These tests will be elaborated in the next
section in more details. The ﬁrst task is to examine each individual parameter
separately to identify predictors with statistically signiﬁcant linear relationship with
outcome variable of interest.
The null and alternative hypotheses for testing signiﬁcance of individual
parameters are:
H0 : bj ¼ 0 andH1 : bj 6¼ 0:
The test statistic is
t ¼ b^j
seðb^jÞ
ð2:28Þ
which follows a t distribution with (n – p − 1) degrees of freedom. We know that
Covðb^Þ ¼ ðX 0XÞ�1r2 and estimate for the covariance matrix is C^ovðb^Þ ¼
ðX 0XÞ�1s2 where s2 is the unbiased estimator of r2. The standard error of b^j can be
2.4 Tests 17
obtained from corresponding diagonal elements of the inverse matrix ðX 0XÞ�1. In
this rejection of null hypothesis implies a statistically signiﬁcant linear relationship
with outcome variable.
(iii) Extra Sum of Squares Method
As we mentioned in the previous section that inclusion of a variable may result in
increase in SSR and subsequently decrease in SSE, it needs to be tested whether the
increase in SSR is statistically signiﬁcant or not. In addition, it is also possible to
test whether inclusion or deletion of a subset of potential predictors result in any
statistically signiﬁcant change in the ﬁt of the model or not. For this purpose, extra
sum of squares principle may be a very useful procedure.
Let us consider a regression model
Y ¼ Xbþ e
where Y is n� 1, X is n� k, b is k � 1, and k ¼ p þ 1. If we partition b as follows
b ¼ b1
b2
� �
where
b ¼
b0
b1
..
.
br�1
br
..
.
bp
0
BBBBBBBB@
1
CCCCCCCCA
; b1 ¼
b0
b1
..
.
br�1
0
BBB@
1
CCCA and b2 ¼
br
..
.
bp
0
B@
1
CA:
We can express the partitioned regression model as
Y ¼ X1b1þX2b2þ e ð2:29Þ
where
X1 ¼
1 X11 . . . X1;r�1
1 X21 . . . X2;r�1
..
.
1 Xn1 . . . Xn;r�1
0
BBB@
1
CCCA; X2 ¼
X1;r . . . X1;p
X2;r . . . X2;p
..
.
Xn;r . . . Xn;p
0
BBB@
1
CCCA:
Let us consider this model as the full model. In other words, the full model is
comprised of all the variables under consideration. We want to test, whether some
of the variables or a subset of the variables included in the full model contributes
18 2 Linear Models
signiﬁcantly or not. This subset may include one or more variables and the cor-
responding coefﬁcients or regression parameters are represented by the vector b2.
Hence, a test on whether b2 ¼ 0 is an appropriate null hypothesis here. This can be
employed for a single parameter as a special case.
Regression and error sum of squares from full and reduced models are shown
below.
Full Model:
Under full model, the SSR and SSE are:
SSR (full model) = b^0X 0Y
SSE (full model) = Y 0Y � b^0X 0Y
Reduced Model:
Under null hypothesis, the SSR and SSE are:
SSR (reduced model) = b^
0
1X
0
1Y
SSE (reduced model) = Y 0Y � b^01X 01Y
Difference between SSR (full model) and SSR (reduced model) shows the
contribution of the variables Xr; . . .;Xp which can be expressed as:
SSR b2 b1jð Þ ¼ b^0X 0Y � b^01X 01Y :
This is the extra sum of squares attributable to the variables under null
hypothesis.
The test statistic for H0 : b2 ¼ 0 is
F ¼ SSR b2 b1jð Þ=ðk � rþ 1Þ
s2
�Fðk�rþ 1Þ;ðn�kÞ: ð2:30Þ
Acceptance of null hypothesis implies there may not be any statistically sig-
niﬁcant contribution of the variables Xr; . . .;Xp and the reduced model under null
hypothesis is equally good as compared to the full model.
2.5 Example
A data set on standardized fertility measure and socioeconomic indicators from
Switzerland is used for application in this chapter. This data set is freely available
from ‘datasets’ package in R. Full dataset and description are available for
download from the Ofﬁce of Population Research website (site https://opr.
princeton.edu/archive/pefp/switz.aspx). Following variables are available in the
‘swiss’ dataset from datasets package. This data set includes indicators for each of
47 French-speaking provinces of Switzerland in 1888. The variables are:
2.4 Tests 19
Fertility common standardized fertility measure
Agriculture % of males involved in agriculture as occupation
Examination % draftees receiving highest mark on army examination
Education % education beyond primary school for draftees
Catholic % ‘catholic’ (as opposed to ‘Protestant’)
Infant Mortality live births who live less than one year.
Here the ﬁrst example shows a ﬁtting of a simple regression model where the
outcome variable, Y = common standardized fertility measure and X = percent
education beyond primary school for draftees. The estimated model is Y^ ¼
79:6101� 0:8624X: Education appears to be negatively associated with fertility
measure in French-speaking provinces (p-value < 0.001). Figure 2.2 displays the
negative relationship. Table 2.1 summarizes the results.
Fig. 2.2 Simple Linear Regression
Table 2.1 Estimates and
tests of parameters of a simple
regression model
Variable Estimate Std. error t-value Pr(>|t|)
Constant 79.6101 2.1041 37.836 0.000
Education −0.8624 0.1448 −5.954 0.000
Table 2.2 Estimates and
tests of parameters of a
multiple linear regression
model
Variable Estimate Std. error t-value Pr(>|t|)
Constant 62.10131 9.60489 6.466 0.000
Agriculture −0.15462 0.06819 −2.267 0.029
Education −0.98026 0.14814 −6.617 0.000
Catholic 0.12467 0.02889 4.315 0.000
Infant Mortality 1.07844 0.38187 2.824 0.00720 2 Linear Models
Using the same data source, an example for the ﬁt of a multiple regression model
is shown and the results are summarized in Table 2.2. For the same outcome
variable, four explanatory variables are considered, percent males involved in
agriculture as profession ðX1Þ, education ðX2Þ, percent catholic ðX3Þ; and infant
mortality ðX4Þ. The estimated model for the outcome variable, fertility, is
Y^ ¼ 62:10131� 0:15462X1 � 0:98026X2þ 0:12467X3þ 1:08844X4:
All the explanatory variables show statistically signiﬁcant linear relationship
with fertility, agriculture, and education are negatively but percent catholic and
infant mortality are positively related to the outcome variable. The ﬁt of the overall
model is statistically signiﬁcant (F = 24.42, D.F. = 4 and 42, p-value < 0.001).
About 70% (R2 = 0.699) of the total variation is explained by the ﬁtted model.
2.5 Example 21
Chapter 3
Exponential Family of Distributions
The exponential family of distributions has an increasingly important role in statis-
tics. The immediate purpose of family or class of families is to examine existence of
sufﬁcient statistics and it is possible to link the families to the existence of minimum
variance unbiased estimates. In addition to these important uses, exponential families
of distributions are extensively employed in developing generalized linear models.
Let Y be a random variable with probability density or mass function f ðy; hÞwhere
h is a single parameter then Y can be classiﬁed to belong to exponential family of
distributions if the probability density or mass function can be expressed as follows:
f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ� ð3:1Þ
where aðyÞ and dðyÞ are functions of y, bðhÞ and cðhÞ are functions of parameter h
only. We may express this function in the following form as well:
f ðy; hÞ ¼ d0ðyÞe½aðyÞbðhÞþ cðhÞ� ð3:2Þ
where d0ðyÞ ¼ edðyÞ.
The joint pdf or pmf from (3.2) can be shown as follows for independently and
identically distributed Y1; . . .; Yn:
f ðy; hÞ ¼
Yn
i¼1
f ðyi; hÞ
¼
Yn
i¼1
e½aðyiÞbðhÞþ cðhÞþ dðyiÞ�
¼
Yn
i¼1
d0ðyiÞe½aðyiÞbðhÞþ cðhÞ�
ð3:3Þ
where y0 ¼ ðy1; . . .; ynÞ.
© Springer Nature Singapore Pte Ltd. 2017
M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data,
DOI 10.1007/978-981-10-3794-8_3
23
3.1 Exponential Family and Sufﬁciency
One of the major advantages of exponential family is that we can ﬁnd the sufﬁcient
statistics readily from the expression. Let f ðy; hÞ, where y0 ¼ ðy1; . . .; ynÞ, be the
joint pdf or pmf of sample then
Pn
i¼1 aðyiÞ is a sufﬁcient statistic for h if and only if
there exists function g
Pn
i¼1 aðyiÞjh
� �
and hðyÞ such that for all sample and
parameter points,
f ðy; hÞ ¼ hðyÞ � g
Xn
i¼1
aðyiÞ hj
" #
: ð3:4Þ
.
It can be shown from (3.1) to (3.3) that
Lðh; yÞ ¼
Yn
i¼1
d0ðyiÞ
Yn
i¼1
e½aðyiÞbðhÞþ cðhÞ�
¼ hðyÞ � e
bðhÞ
Pn
i¼1
aðyiÞþ ncðhÞ
ð3:5Þ
where (3.5) is expressed in the factorized form of a sufﬁcient statistic as displayed
in (3.4). In other words,
Pn
i¼1 aðyiÞ is a sufﬁcient statistic for h.
If we assume that Y and X belong to the same class of partition of the sample
space for Y1; . . .; Yn which is satisﬁed if the ratio of likelihood functions,
Lðh; yÞ=Lðh; xÞ, does not depend on h, then any statistic corresponding to the
parameter is minimal sufﬁcient. If Y1; . . .; Yn are independently and identically
distributed then the ratio of likelihood functions is
Lðh; yÞ
Lðh; xÞ ¼
hðyÞ � ebðhÞ
Pn
i¼1 aðyiÞþ ncðhÞ
hðxÞ � ebðhÞ
Pn
i¼1 aðxiÞþ ncðhÞ
: ð3:6Þ
It is clearly evident from (3.6) that the ratio is independent of h only ifPn
i¼1 aðyiÞ ¼
Pn
i¼1 aðxiÞ, then
Pn
i¼1 aðyiÞ is a minimal sufﬁcient statistic of h.
It is noteworthy that if a minimum variance unbiased estimator exists, then there
must be a function of the minimal sufﬁcient statistic for the parameter which is a
minimum variance unbiased estimator.
If Y � f ðy; hÞ where h ¼ ðh1; . . .; hkÞ is a vector of k parameters belonging to the
exponential family of distributions; then the probability distribution can be
expressed as
f ðy; hÞ ¼ e
Pk
j¼1
ajðyÞbjðhÞþ cðhÞþ dðyÞ
� �
ð3:7Þ
24 3 Exponential Family of Distributions
where a1ðyÞ; . . .; akðyÞ and dðyÞ are functions of Y alone and b1ðhÞ; . . .; bkðhÞ and
cðhÞ are functions of h alone. Then it can be shown thatPn
i¼1 a1ðyiÞ; . . .;
Pn
i¼1 akðyiÞ are sufﬁcient statistics for h1; . . .; hk respectively.
Example 3.1
f y; n; pð Þ ¼ n
y
� �
py 1� pð Þn�y
¼ e
ln
n
y
� �
þ y ln p þ n�yð Þ ln 1�pð Þ
� 	
¼ e
y ln p1�p þ ln
n
y
� �
þ n ln 1�pð Þ
� 	
Here
aðyÞ ¼ y
bðhÞ ¼ ln p
1� p
cðhÞ ¼ n ln 1� pð Þ
dðyÞ ¼ ln n
y
� �
:
and it can be shown that
Pn
i¼1 yi is a sufﬁcient statistic for h ¼ p.
Example 3.2 Poisson Distribution
f y; hð Þ ¼ e
� hhy
y!
¼ e y ln h � ln y!�hf g
where
aðyÞ ¼ y; bðhÞ ¼ ln h; cðhÞ ¼ � h; dðyÞ ¼ � ln y!:
It can be shown that
Pn
i¼1 yi is a sufﬁcient statistic for h.
Example 3.3 Exponential
f y; hð Þ ¼ he�hy ¼ e �h yþ ln hf g
where
aðyÞ ¼ y; bðhÞ ¼ �h; cðhÞ ¼ ln h; dðyÞ ¼ 0:
For exponential distribution parameter, h, it can be shown that
Pn
i¼1 yi is a
sufﬁcient statistic.
3.1 Exponential Family and Sufﬁciency 25
Example 3.4 Normal Distribution with mean zero and variance r2
f y; 0; r2
 � ¼ 1ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
2pr2
p e�y2=2r2
¼ e �y2=2r
2�12 ln 2pr2ð Þ
 �
where
aðyÞ ¼ y2
bðhÞ ¼ � 1
2r2
cðhÞ ¼ � 1
2
ln 2pr2
 �
dðyÞ ¼ 0:
For h ¼ r2, the sufﬁcient statistic is Pni¼1 y2i .
Example 3.5 Normal Distribution with Mean l and Variance 1.
f y; l; 1ð Þ ¼ 1ﬃﬃﬃﬃﬃﬃ
2p
p e�12 y�lð Þ2 ¼ e �12 y2�2lyþ l2ð Þ�12 lnð2pÞf g
¼ e yl�12y2�12 ln 2pð Þ�12l2f g
where
aðyÞ ¼ y; bðhÞ ¼ l; cðhÞ ¼ � 1
2
l2; dðyÞ ¼ � 1
2
y2 � 1
2
ln 2p:
In this example, for h ¼ l, the sufﬁcient statistic is Pni¼1 yi.
Example 3.6 Gamma distribution
f y; hð Þ ¼ h
r
Cr
yr�1e�hy
¼ e �hyþ r�1ð Þ ln y�ln Crþ r ln hf g
where
aðyÞ ¼ y
bðhÞ ¼ h
cðhÞ ¼ r ln h
dðyÞ ¼ r � 1ð Þ ln y� ln Cr:
In this example, the sufﬁcient statistic for h is
Pn
i¼1 yi.
26 3 Exponential Family of Distributions
Example 3.7 Normal distribution with mean and variance l and r2 respectively
f y; l;r2
 � ¼ 1ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
2pr2
p e� 12r2 y�lð Þ2
¼ e � 12r2 y2� 2yl þ l2ð Þ� 12 ln 2pr2ð Þ
 �
¼ e � 12r2 y2 þ y lr2 � l
2
2r2
� 12 ln 2pr2ð Þ
 �
where
a1ðyÞ ¼ y
a2ðyÞ ¼ y2
b1ðhÞ ¼ lr2
b2ðhÞ ¼ � 12r2
cðhÞ ¼ � l
2
2r2
� 1
2
lnðr2Þ
dðyÞ ¼ � 1
2
lnð2pÞ:
In this example, the joint sufﬁcient statistics for h1 ¼ l and h2 ¼ r2 are
Pn
i¼1 yi
and
Pn
i¼1 y
2
i , respectively.
Example 3.8 Gamma distribution (two parameter)
f y; a; bð Þ ¼ b
a
Ca
ya�1e�by
¼ e �byþ a�1ð Þ ln yþ a ln b�ln Caf g
where
a1ðyÞ ¼ ln y
a2ðyÞ ¼ y
b1ðhÞ ¼ a
b2ðhÞ ¼ �b
cðhÞ ¼ a ln b� ln Ca
dðyÞ ¼ � ln y:
In this example, the joint sufﬁcient statistics for h1 ¼ a and h2 ¼ b arePn
i¼1 ln yi and
Pn
i¼1 yi, respectively.
3.1 Exponential Family and Sufﬁciency 27
3.2 Some Important Properties
The expected value and variance of a(Y) can be obtained for exponential family
assuming that the order of integration and differentiation can be interchanged. We
know that the exponential family is represented by
f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ�
and after differentiating with respect to parameter we obtain
df ðy; hÞ
dh
¼ ½aðyÞb0ðhÞþ c0ðhÞ�f ðy; hÞ
and interchanging differentiation and integration in the following expression, it can
be shown that
Z
df ðy; hÞ
dh
dy ¼
Z
½aðyÞb0ðhÞþ c0ðhÞ�f ðy; hÞdy ¼ 0: ð3:8Þ
It follows directly from (3.8) that
b0ðhÞE½aðYÞ� þ c0ðhÞ ¼ 0: ð3:9Þ
Hence, the expected value can be obtained from the following equation:
E½aðYÞ� ¼ � c
0ðhÞ
b0ðhÞ:
It can be shown using the same regularity assumptions that the variance is
Var½aðyÞ� ¼ b
00ðhÞc0ðhÞ=b0ðhÞ � c00ðhÞ
½b0ðhÞ�2
¼ b
00ðhÞc0ðhÞ � c00ðhÞb0ðhÞ
½b0ðhÞ�3 :
The log likelihood function for an exponential family of distribution is
lðh; yÞ ¼ aðyÞbðhÞþ cðhÞþ dðyÞ
and the score statistic is
Uðh; yÞ ¼ dlðh; yÞ
dh
¼ aðyÞb0ðhÞþ c0ðhÞ:
28 3 Exponential Family of Distributions
It can be shown that
U ¼ dlðh; yÞ
dh
¼ aðyÞb0ðhÞþ c0ðhÞ;
EðUÞ ¼ b0ðhÞ � c
0ðhÞ
b0ðhÞ
� �
þ c0ðhÞ ¼ 0:
and
I ¼ VarðUÞ
¼ b0ðhÞ½ �2Var½aðyÞ�
¼ b
00ðhÞc0ðhÞ
b0ðhÞ � c
00ðhÞ:
Another important property of U is
VarðUÞ ¼ EðU2Þ ¼ �EðU0Þ:
Example 3.9 Binomial Distribution
It has been shown from the exponential family of distribution form
aðyÞ ¼ y; bðhÞ ¼ ln p
1� p ; cðhÞ ¼ n lnð1� pÞ; dðyÞ ¼ ln
n
y
� �
:
Hence,
EðYÞ ¼ � c
0ðhÞ
b0ðhÞ
¼ �
� n1�p
1
p þ 11�p
¼ np
VarðYÞ ¼ b
00ðhÞc0ðhÞ � c00ðhÞb0ðhÞ
½b0ðhÞ�3 ¼ npð1� pÞ:
Example 3.10 Poisson Distribution
Pðy; hÞ ¼ e
�hhy
y!
¼ e y ln h�h�ln y!f g
Hence, in exponential form notations
3.2 Some Important Properties 29
aðyÞ ¼ y; bðhÞ ¼ ln h; cðhÞ ¼ �h; dðyÞ ¼ �lny!
The expected value and variance of Y are
EðYÞ ¼ � �1
1=h
¼ h
VarðYÞ ¼ ð�1=h
2Þð�1Þ � ð0Þð1=hÞ
½1=h�3 ¼ h:
Example 3.11 Exponential Distribution
f ðy; hÞ ¼ he�hy ¼ e �hyþ ln hf g:
In the exponential family of distributions notations
aðyÞ ¼ y; bðhÞ ¼ �h; cðhÞ ¼ ln h; dðyÞ ¼ 0:
For exponential distribution, the expected value and variance are
EðYÞ ¼ � 1=h�1 ¼
1
h
VarðYÞ ¼ ð0Þð1=hÞ � ð�1=h
2Þð�1Þ
½�1�3 ¼
1
h2
:
Example 3.12 Normal Distribution with mean l and variance 1
fðy; l; 1Þ ¼ 1ﬃﬃﬃﬃﬃﬃ
2p
p e�ðy�lÞ2=2
¼ e yl�ð1=2Þl2�ð1=2Þ lnð2pÞ�y2=2f g
Using the exponential form, it is shown that
aðyÞ ¼ y; bðhÞ ¼ l; cðhÞ ¼ � 1
2
l2; dðyÞ ¼ � 1
2
y2 � 1
2
lnð2pÞ:
The expected value and variance can be obtained from the exponential form as
follows:
EðYÞ ¼ ��l
1
¼ l
VarðYÞ ¼ ð0Þð�lÞ � ð�1Þð1Þ½1�3 ¼ 1:
30 3 Exponential Family of Distributions
Chapter 4
Generalized Linear Models
4.1 Introduction
Since the seminal work of Nelder and Wedderburn (1972) and publication of a
book by McCullagh and Nelder (1983) the concept of Generalized Linear Models
(GLMs) has been playing an increasingly important role in statistical theory and
applications. We have presented linear regression models in Chap. 2 and expo-
nential family of distributions in Chap. 3. A class of linear models that generalizes
the linear models for both normal and nonnormal outcomes or for both discrete and
continuous outcomes when the probability distribution of outcome variables belong
to exponential family of distributions can be classiﬁed under a broad class named as
generalized linear models. The linear regression models presented in Chap. 2 can
be shown as a special case of GLM.
In regression modeling, linear or nonlinear, the assumption on outcome variables
is essentially normality assumption but it is obvious that in various situations such
assumption cannot be true due to very wide range of situations where normality
assumption is quite unrealistic. An obvious example is binary response to express
the presence or absence of a disease where the outcome variable follows a Bernoulli
distribution. Another example is number of accidents during a speciﬁed interval of
time which provides count data that follows a Poisson distribution. If we are
interested in an event such as success in a series of experiments for the ﬁrst time
after failures successively, the distribution is geometric. This can be applied to
analyze incidence of a disease from follow-up data. Similarly, if the event is deﬁned
as attaining a ﬁxed number of successes in a series of experiments, securing certain
number of wins in a football match league completion to qualify for the next round,
then the outcome variable may follow a negative binomial distribution. In case of
continuous outcome variables, it is practically not so frequent in many cases to ﬁnd
outcome variables that follow normal distribution. In lifetime data for analyzing
reliability or survival, the distributions are highly skewed and normality assump-
tions cannot be used. Hence, for nonnormal distributions such as exponential or
© Springer Nature Singapore Pte Ltd. 2017
M.A. Islam and R.I. Chowdhury, Analysis of Repeated Measures Data,
DOI 10.1007/978-981-10-3794-8_4
31
gamma, the linear regression models are not applicable directly. To address this
wide variety of situations where normality assumption cannot be considered for
linear modeling, GLM provides a general framework to link the underlying random
and systematic components.
4.2 Exponential Family and GLM
For generalized linear models, it is assumed that the distribution of outcome
variable can be represented in the form of exponential family of distributions. Let
Y be a random variable with probability density or mass function f ðy; hÞ where h is
a single parameter then Y can be classiﬁed to belong to exponential family of
distributions if the probability density or mass function can be expressed as follows
as shown in (3.1):
f ðy; hÞ ¼ e½aðyÞbðhÞþ cðhÞþ dðyÞ�
where aðyÞ and dðyÞ are functions of y, bðhÞ and cðhÞ are functions of parameter h
only. If aðyÞ ¼ y and bðhÞ ¼ h then h is called a natural parameter. Then (3.1) can
be expressed in a different form convenient for GLM
f ðy; hÞ ¼ e
yh�bðhÞ
að/Þ
n o
þ cðy;/Þ
h i
ð4:1Þ
where bðhÞ is a new function of h, að/Þ is a function of / called dispersion
parameter and cðy;/Þ is a function of y and /.
Some Examples
Example 4.1 Binomial
f y; n; pð Þ ¼ n
y
� �
py 1� pð Þn�y
¼ e
ln
n
y
� �
þ y ln pþ n�yð Þ ln 1�pð Þ
� �
¼ e
y ln
p
1�p� �n ln 1�pð Þð Þf g
1 þ ln
n
y
� �" #
Here
h ¼ ln p
1� p ; b hð Þ ¼ �n lnð1� pÞ; c y;/ð Þ¼ ln
n
y
� �
; að/Þ ¼ 1:
32 4 Generalized Linear Models
Example 4.2 Poisson
f y; kð Þ ¼ e
� kky
y !
¼ e y ln kþ ln y !�kf g¼e y ln k�k1f gþ ln y !½ �
where
h ¼ ln k; b hð Þ ¼ k; að/Þ ¼ 1; c y;/ð Þ ¼ lny!:
Example 4.3 Exponential
f y; kð Þ ¼ ke�k y
¼ e �k yþ ln kf g
¼ e
�k y�ð� ln kÞf g
1
� �
where
h ¼ �k; b hð Þ ¼ � ln k; að/Þ ¼ 1; c y;/ð Þ ¼ 0:
Example 4.4 Normal Distribution with Mean Zero and Variance, r2
f y; 0; r2
� 	 ¼ 1ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
2pr2
p e�y2=2r2
= e �y
2
�
2r
2�12 ln 2pr2ð Þ
� 
There is no natural parameter in this case.
Example 4.5 Normal Distribution with Mean l and Variance 1.
Y �N l; 1ð Þ
f y; l; 1ð Þ ¼ 1ﬃﬃﬃﬃﬃﬃ
2p
p e �12 y�lð Þ 2
¼ e �12 y2�2lyþl2ð Þ�12 ln 2pð Þf g
¼ e yl�12y2�12 ln 2pð Þ�12l2f g
¼ e
yl�12l
2f g
1 �12y2�12 ln 2pð Þ
h i
4.2 Exponential Family and GLM 33
where
h ¼ l; b hð Þ ¼ l2=2; að/Þ ¼ 1; c y;/ð Þ ¼ 1
2
y2þ lnð2pÞ� 	:
Example 4.6 Gamma
f y; kð Þ ¼ k
r
Cr
xr�1 e�ky
¼ e �kyþ r�1ð Þ ln y� lnCrþ r ln kf g
¼ e
�kyþ r ln kf g
1 þ r�1ð Þ ln y� lnCr
� �
where
h ¼ �k; b hð Þ ¼ �r ln k; að/Þ ¼ 1; c y;/ð Þ ¼ r � 1ð Þ ln y� lnCr:
Example 4.7 Y �N l; r2ð Þ
f y; l; r2
� 	 ¼ 1ﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ
2pr2
p e� 12r2 y�lð Þ2
¼ e � 12r2 y2�2ylþ l2ð Þ�12 ln 2pr2ð Þ
� 
¼ e � 12r2y2 þ y lr2� l
2
2r2
�12 ln 2pr2ð Þ
� 
¼ e
yl�l
2
2
� 
r2
� 1
2r2
y2�12 ln 2pr2ð Þ
� �
h ¼ l; bðhÞ ¼ �l2=2; að/Þ ¼ r2; cðy;/Þ ¼ � 1
2r2
y2 � 1
2
ln 2pr2
� 	
:
4.3 Expected Value and Variance
Expected value and variance of Y can be obtained from (4.1) assuming that the
order of integration (summation in case of discrete variable) and differentiation can
be interchanged. Differentiating f ðy; hÞ with respect to h, we obtain
df ðy; hÞ
dh
¼ 1
að/Þ ½y� b
0ðhÞ�f ðy; hÞ
34 4 Generalized Linear Models
and interchanging differentiation and integration in the following