Econometric Modelling with Time Series_Specification, Estimation and Testing_Martin-Hurn-Harris_2012

Econometria

•
UFRJ

Marcus
14/05/2020
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 3, do total de 953 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 6, do total de 953 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
Você viu 9, do total de 953 páginas
Faça como milhares de estudantes: teste grátis o Passei Direto

Esse e outros conteúdos desbloqueados
16 milhões de materiais de várias disciplinas
Impressão de materiais
Agora você pode testar o

Passei Direto grátis

Você também pode ser Premium ajudando estudantes
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Econometria

6.322 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
Econometric Modelling with Time Series
Specification, Estimation and Testing
V. L. Martin, A. S. Hurn and D. Harris
iv
Preface
This book provides a general framework for specifying, estimating and test-
ing time series econometric models. Special emphasis is given to estima-
tion by maximum likelihood, but other methods are also discussed includ-
ing quasi-maximum likelihood estimation, generalized method of moments,
nonparametrics and estimation by simulation. An important advantage of
adopting the principle of maximum likelihood as the unifying framework for
the book is that many of the estimators and test statistics proposed in econo-
metrics can be derived within a likelihood framework, thereby providing a
coherent vehicle for understanding their properties and interrelationships.
In contrast to many existing econometric textbooks, which deal mainly
with the theoretical properties of estimators and test statistics through a
theorem-proof presentation, this book is very concerned with implemen-
tation issues in order to provide a fast-track between the theory and ap-
plied work. Consequently many of the econometric methods discussed in
the book are illustrated by means of a suite of programs written in GAUSS
and MATLABR©.1 The computer code emphasizes the computational side of
econometrics and follows the notation in the book as closely as possible,
thereby reinforcing the principles presented in the text. More generally, the
computer code also helps to bridge the gap between theory and practice
by enabling the reproduction of both theoretical and empirical results pub-
lished in recent journal articles. The reader, as a result, may build on the
code and tailor it to more involved applications.
Organization of the Book
Part ONE of the book is an exposition of the basic maximum likelihood
framework. To implement this approach, three conditions are required: the
probability distribution of the stochastic process must be known and spec-
ified correctly, the parametric specifications of the moments of the distri-
bution must be known and specified correctly, and the likelihood must be
tractable. The properties of maximum likelihood estimators are presented
and three fundamental testing procedures – namely, the Likelihood Ratio
test, the Wald test and the Lagrange Multiplier test – are discussed in detail.
There is also a comprehensive treatment of iterative algorithms to compute
maximum likelihood estimators when no analytical expressions are available.
Part TWO is the usual regression framework taught in standard econo-
metric courses but presented within the maximum likelihood framework.
1 GAUSS is a registered trademark of Aptech Systems, Inc. http://www.aptech.com/ and
MATLABR© is a registered trademark of The MathWorks, Inc. http://www.mathworks.com/.
v
Both nonlinear regression models and non-spherical models exhibiting ei-
ther autocorrelation or heteroskedasticity, or both, are presented. A further
advantage of the maximum likelihood strategy is that it provides a mecha-
nism for deriving new estimators and new test statistics, which are designed
specifically for non-standard problems.
Part THREE provides a coherent treatment of a number of alternative es-
timation procedures which are applicable when the conditions to implement
maximum likelihood estimation are not satisfied. For the case where the
probability distribution is incorrectly specified, quasi-maximum likelihood
is appropriate. If the joint probability distribution of the data is treated as
unknown, then a generalized method of moments estimator is adopted. This
estimator has the advantage of circumventing the need to specify the dis-
tribution and hence avoids any potential misspecification from an incorrect
choice of the distribution. An even less restrictive approach is not to specify
either the distribution or the parametric form of the moments of the distri-
bution and use nonparametric procedures to model either the distribution
of variables or the relationships between variables. Simulation estimation
methods are used for models where the likelihood is intractable arising, for
example, from the presence of latent variables. Indirect inference, efficient
methods of moments and simulated methods of moments are presented and
compared.
Part FOUR examines stationary time series models with a special empha-
sis on using maximum likelihood methods to estimate and test these models.
Both single equation models, including the autoregressive moving average
class of models, and multiple equation models, including vector autoregres-
sions and structural vector autoregressions, are dealt with in detail. Also
discussed are linear factor models where the factors are treated as latent.
The presence of the latent factor means that the full likelihood is generally
not tractable. However, if the models are specified in terms of the normal
distribution with moments based on linear parametric representations, a
Kalman filter is used to rewrite the likelihood in terms of the observable
variables thereby making estimation and testing by maximum likelihood
feasible.
Part FIVE focusses on nonstationary time series models and in particular
tests for unit roots and cointegration. Some important asymptotic results
for nonstationary time series are presented followed by a comprehensive dis-
cussion of testing for unit roots. Cointegration is tackled from the perspec-
tive that the well-known Johansen estimator may be usefully interpreted
as a maximum likelihood estimator based on the assumption of a normal
distribution applied to a system of equations that is subject to a set of
vi
cross-equation restrictions arising from the assumption of common long-run
relationships. Further, the trace and maximum eigenvalue tests of cointegra-
tion are shown to be likelihood ratio tests.
Part SIX is concerned with nonlinear time series models. Models that are
nonlinear in mean include the threshold class of model, bilinear models and
also artificial neural network modelling, which, contrary to many existing
treatments, is again addressed from the econometric perspective of estima-
tion and testing based on maximum likelihood methods. Nonlinearities in
variance are dealt with in terms of the GARCH class of models. The final
chapter focusses on models that deal with discrete or truncated time series
data.
Even in a project of this size and scope, sacrifices have had to be made to
keep the length of the book manageable. Accordingly, there are a number
of important topics that have had to be omitted.
(i) Although Bayesian methods are increasingly being used in many areas
of statistics and econometrics, no material on Bayesian econometrics is
included. This is an important field in its own right and the interested
reader is referred to recent books by Koop (2003), Geweke (2005), Koop,
Poirier and Tobias (2007) and Greenberg (2008), inter alia. Where ap-
propriate, references to Bayesian methods are provided in the body of
the text.
(ii) With great reluctance a chapter on bootstrapping was not included be-
cause of space issues. A good place to start reading is the introductory
text by Efron and Tibshirani (1993) and the useful surveys by Horowitz
(1997) and Li and Maddala (1996b,1996a).
(iii) In Part SIX, in the chapter dealing with modelling the variance of time
series, there are important recent developments in stochastic volatility
and realized volatility that would be worthy of inclusion. For stochastic
volatility, there is an excellent volume of readings edited by Shephard
(2005), while the seminal articles in the area of realized volatility are
Anderson et al. (2001, 2003).
The fact that these areas have not been covered should not be regarded as a
value judgement about their relative importance. Instead the subject matter
chosen for inclusion reflects a balance between the interests of the authors
and purely operational decisions aimedat preserving the flow and continuity
of the book.
vii
Computer Code
Specifically, computer code is available from a companion website to repro-
duce relevant examples in the text, to reproduce figures in the text that are
not part of an example, to reproduce the applications presented in the final
section of each chapter, and to complete the exercises. Where applicable,
the time series data used in these examples, applications and exercises are
also available in a number of different formats.
Presenting numerical results in the examples immediately gives rise to two
important issues concerning numerical precision.
(1) In all of the examples listed in the front of the book where computer code
has been used, the numbers appearing in the text are rounded versions of
those generated by the code. Accordingly, the rounded numbers should
be interpreted as such and not be used independently of the computer
code to try and reproduce the numbers reported in the text.
(2) In many of the examples, simulation has been used to demonstrate a
concept. Since GAUSS and MATLAB have different random number gen-
erators, the results generated by the different sets of code will not be
identical to one another. For consistency we have always used the GAUSS
output for reporting purposes.
Although GAUSS and MATLAB are very similar high-level programming
languages, there are some important differences that require explanation.
Probably the most important difference is one of programming style. GAUSS
programs are script files that allow calls to both inbuilt GAUSS and user-
defined procedures. MATLAB, on the other hand, does not support the use
of user-defined functions in script files. Furthermore, MATLAB programming
style favours writing user-defined functions in separate files and then calling
them as if they were in-built functions. This style of programming does not
suit the learning-by-doing environment that the book tries to create. Con-
sequently, the MATLAB programs are written mainly as function files with a
main function and all the required user-defined functions required to im-
plement the procedure in the same file. The only exception to this rule is
that a few MATLAB utility files, which greatly facilitate the conversion and
interpretation of code from GAUSS to MATLAB, which are provided as sep-
arate stand-alone MATLAB function files. Finally, all the figures in the text
were created using MATLAB together with a utility file laprint.m written by
Arno Linnemann of the University of Kessel.2
2 A user guide is available at
http://www.uni-kassel.de/fb16/rat/matlab/laprint/laprintdoc.ps.
viii
Acknowledgements
Creating a manuscript of this scope and magnitude is a daunting task and
there are many people to whom we are indebted. In particular, we would
like to thank Kenneth Lindsay, Adrian Pagan and Andy Tremayne for their
careful reading of various chapters of the manuscript and for many helpful
comments and suggestions. Gael Martin helped with compiling a suitable
list of references to Bayesian econometric methods. Ayesha Scott compiled
the index, a painstaking task for a manuscript of this size. Many others
have commented on earlier drafts of chapters and we are grateful to the
following individuals: our colleagues, Gunnar B̊ardsen, Ralf Becker, Adam
Clements, Vlad Pavlov and Joseph Jeisman; and our graduate students, Tim
Christensen, Christopher Coleman-Fenn, Andrew McClelland, Jessie Wang
and Vivianne Vilar.
We also wish to express our deep appreciation to the team at Cambridge
University Press, particularly Peter C. B. Phillips for his encouragement
and support throughout the long gestation period of the book as well as
for reading and commenting on earlier drafts. Scott Parris, with his energy
and enthusiasm for the project, was a great help in sustaining the authors
during the long slog of completing the manuscript. Our thanks are also due
to our CUP readers who provided detailed and constructive feedback at
various stages in the compilation of the final document. Michael Erkelenz of
Fine Line Writers edited the entire manuscript, helped to smooth out the
prose and provided particular assistance with the correct use of adjectival
constructions in the passive voice.
It is fair to say that writing this book was an immense task that involved
the consumption of copious quantities of chillies, champagne and port over a
protracted period of time. The biggest debt of gratitude we owe, therefore, is
to our respective families. To Gael, Sarah and David; Cath, Iain, Robert and
Tim; and Fiona and Caitlin: thank you for your patience, your good humour
in putting up with and cleaning up after many a pizza night, your stoicism
in enduring yet another vacant stare during an important conversation and,
ultimately, for making it all worthwhile.
Vance Martin, Stan Hurn & David Harris
November 2011
Contents
List of illustrations page 1
Computer Code used in the Examples 4
PART ONE MAXIMUM LIKELIHOOD 1
1 The Maximum Likelihood Principle 3
1.1 Introduction 3
1.2 Motivating Examples 3
1.3 Joint Probability Distributions 9
1.4 Maximum Likelihood Framework 12
1.4.1 The Log-Likelihood Function 12
1.4.2 Gradient 18
1.4.3 Hessian 20
1.5 Applications 23
1.5.1 Stationary Distribution of the Vasicek Model 23
1.5.2 Transitional Distribution of the Vasicek Model 25
1.6 Exercises 28
2 Properties of Maximum Likelihood Estimators 35
2.1 Introduction 35
2.2 Preliminaries 35
2.2.1 Stochastic Time Series Models and Their Prop-
erties 36
2.2.2 Weak Law of Large Numbers 41
2.2.3 Rates of Convergence 45
2.2.4 Central Limit Theorems 47
2.3 Regularity Conditions 55
2.4 Properties of the Likelihood Function 57
x Contents
2.4.1 The Population Likelihood Function 57
2.4.2 Moments of the Gradient 58
2.4.3 The Information Matrix 61
2.5 Asymptotic Properties 63
2.5.1 Consistency 63
2.5.2 Normality 67
2.5.3 Efficiency 68
2.6 Finite-Sample Properties 72
2.6.1 Unbiasedness 73
2.6.2 Sufficiency 74
2.6.3 Invariance 75
2.6.4 Non-Uniqueness 76
2.7 Applications 76
2.7.1 Portfolio Diversification 78
2.7.2 Bimodal Likelihood 80
2.8 Exercises 82
3 Numerical Estimation Methods 91
3.1 Introduction 91
3.2 Newton Methods 92
3.2.1 Newton-Raphson 93
3.2.2 Method of Scoring 94
3.2.3 BHHH Algorithm 95
3.2.4 Comparative Examples 98
3.3 Quasi-Newton Methods 101
3.4 Line Searching 102
3.5 Optimisation Based on Function Evaluation 104
3.6 Computing Standard Errors 106
3.7 Hints for Practical Optimization 109
3.7.1 Concentrating the Likelihood 109
3.7.2 Parameter Constraints 110
3.7.3 Choice of Algorithm 111
3.7.4 Numerical Derivatives 112
3.7.5 Starting Values 113
3.7.6 Convergence Criteria 113
3.8 Applications 114
3.8.1 Stationary Distribution of the CIR Model 114
3.8.2 Transitional Distribution of the CIR Model 116
3.9 Exercises 118
Contents xi
4 Hypothesis Testing 124
4.1 Introduction 124
4.2 Overview 124
4.3 Types of Hypotheses 126
4.3.1 Simple and Composite Hypotheses 126
4.3.2 Linear Hypotheses 127
4.3.3 Nonlinear Hypotheses 128
4.4 Likelihood Ratio Test 129
4.5 Wald Test 133
4.5.1 Linear Hypotheses 134
4.5.2 Nonlinear Hypotheses 136
4.6 Lagrange Multiplier Test 137
4.7 Distribution Theory 139
4.7.1 Asymptotic Distribution of the Wald Statistic 139
4.7.2 Asymptotic Relationships Among the Tests 142
4.7.3 Finite Sample Relationships 143
4.8 Size and Power Properties 145
4.8.1 Size of a Test 145
4.8.2 Power of a Test 146
4.9 Applications 148
4.9.1 Exponential Regression Model 148
4.9.2 Gamma Regression Model 151
4.10 Exercises 153
PART TWO REGRESSION MODELS 159
5 Linear Regression Models 161
5.1 Introduction 161
5.2 Specification 162
5.2.1 Model Classification 162
5.2.2 Structural and Reduced Forms 163
5.3 Estimation 166
5.3.1 Single Equation: Ordinary Least Squares 166
5.3.2 Multiple Equations: FIML 170
5.3.3 Identification 175
5.3.4 Instrumental Variables 177
5.3.5 Seemingly Unrelated Regression181
5.4 Testing 182
5.5 Applications 187
xii Contents
5.5.1 Linear Taylor Rule 187
5.5.2 The Klein Model of the U.S. Economy 189
5.6 Exercises 191
6 Nonlinear Regression Models 199
6.1 Introduction 199
6.2 Specification 199
6.3 Maximum Likelihood Estimation 201
6.4 Gauss-Newton 208
6.4.1 Relationship to Nonlinear Least Squares 212
6.4.2 Relationship to Ordinary Least Squares 213
6.4.3 Asymptotic Distributions 213
6.5 Testing 214
6.5.1 LR, Wald and LM Tests 214
6.5.2 Nonnested Tests 218
6.6 Applications 221
6.6.1 Robust Estimation of the CAPM 221
6.6.2 Stochastic Frontier Models 224
6.7 Exercises 228
7 Autocorrelated Regression Models 234
7.1 Introduction 234
7.2 Specification 234
7.3 Maximum Likelihood Estimation 236
7.3.1 Exact Maximum Likelihood 237
7.3.2 Conditional Maximum Likelihood 238
7.4 Alternative Estimators 240
7.4.1 Gauss-Newton 241
7.4.2 Zig-zag Algorithms 244
7.4.3 Cochrane-Orcutt 247
7.5 Distribution Theory 248
7.5.1 Maximum Likelihood Estimator 249
7.5.2 Least Squares Estimator 253
7.6 Lagged Dependent Variables 258
7.7 Testing 260
7.7.1 Alternative LM Test I 262
7.7.2 Alternative LM Test II 263
7.7.3 Alternative LM Test III 264
7.8 Systems of Equations 265
7.8.1 Estimation 266
7.8.2 Testing 268
Contents xiii
7.9 Applications 268
7.9.1 Illiquidity and Hedge Funds 268
7.9.2 Beach-Mackinnon Simulation Study 269
7.10 Exercises 271
8 Heteroskedastic Regression Models 280
8.1 Introduction 280
8.2 Specification 280
8.3 Estimation 283
8.3.1 Maximum Likelihood 283
8.3.2 Relationship with Weighted Least Squares 286
8.4 Distribution Theory 289
8.5 Testing 289
8.6 Heteroskedasticity in Systems of Equations 295
8.6.1 Specification 295
8.6.2 Estimation 297
8.6.3 Testing 299
8.6.4 Heteroskedastic and Autocorrelated Disturbances 300
8.7 Applications 302
8.7.1 The Great Moderation 302
8.7.2 Finite Sample Properties of the Wald Test 304
8.8 Exercises 306
PART THREE OTHER ESTIMATION METHODS 313
9 Quasi-Maximum Likelihood Estimation 315
9.1 Introduction 315
9.2 Misspecification 316
9.3 The Quasi-Maximum Likelihood Estimator 320
9.4 Asymptotic Distribution 323
9.4.1 Misspecification and the Information Equality 325
9.4.2 Independent and Identically Distributed Data 328
9.4.3 Dependent Data: Martingale Difference Score 329
9.4.4 Dependent Data and Score 330
9.4.5 Variance Estimation 331
9.5 Quasi-Maximum Likelihood and Linear Regression 333
9.5.1 Nonnormality 336
9.5.2 Heteroskedasticity 337
9.5.3 Autocorrelation 338
9.5.4 Variance Estimation 342
xiv Contents
9.6 Testing 346
9.7 Applications 348
9.7.1 Autoregressive Models for Count Data 348
9.7.2 Estimating the Parameters of the CKLS Model 351
9.8 Exercises 354
10 Generalized Method of Moments 361
10.1 Introduction 361
10.2 Motivating Examples 362
10.2.1 Population Moments 362
10.2.2 Empirical Moments 363
10.2.3 GMM Models from Conditional Expectations 368
10.2.4 GMM and Maximum Likelihood 371
10.3 Estimation 372
10.3.1 The GMM Objective Function 372
10.3.2 Asymptotic Properties 373
10.3.3 Estimation Strategies 378
10.4 Over-Identification Testing 382
10.5 Applications 387
10.5.1 Monte Carlo Evidence 387
10.5.2 Level Effect in Interest Rates 393
10.6 Exercises 396
11 Nonparametric Estimation 404
11.1 Introduction 404
11.2 The Kernel Density Estimator 405
11.3 Properties of the Kernel Density Estimator 409
11.3.1 Finite Sample Properties 410
11.3.2 Optimal Bandwidth Selection 410
11.3.3 Asymptotic Properties 414
11.3.4 Dependent Data 416
11.4 Semi-Parametric Density Estimation 417
11.5 The Nadaraya-Watson Kernel Regression Estimator 419
11.6 Properties of Kernel Regression Estimators 423
11.7 Bandwidth Selection for Kernel Regression 427
11.8 Multivariate Kernel Regression 430
11.9 Semi-parametric Regression of the Partial Linear Model 432
11.10 Applications 433
11.10.1Derivatives of a Nonlinear Production Function 434
11.10.2Drift and Diffusion Functions of SDEs 436
11.11 Exercises 439
Contents xv
12 Estimation by Simulation 447
12.1 Introduction 447
12.2 Motivating Example 448
12.3 Indirect Inference 450
12.3.1 Estimation 451
12.3.2 Relationship with Indirect Least Squares 455
12.4 Efficient Method of Moments (EMM) 456
12.4.1 Estimation 456
12.4.2 Relationship with Instrumental Variables 458
12.5 Simulated Generalized Method of Moments (SMM) 459
12.6 Estimating Continuous-Time Models 461
12.6.1 Brownian Motion 464
12.6.2 Geometric Brownian Motion 467
12.6.3 Stochastic Volatility 470
12.7 Applications 472
12.7.1 Simulation Properties 473
12.7.2 Empirical Properties 475
12.8 Exercises 477
PART FOUR STATIONARY TIME SERIES 483
13 Linear Time Series Models 485
13.1 Introduction 485
13.2 Time Series Properties of Data 486
13.3 Specification 488
13.3.1 Univariate Model Classification 489
13.3.2 Multivariate Model Classification 491
13.3.3 Likelihood 493
13.4 Stationarity 493
13.4.1 Univariate Examples 494
13.4.2 Multivariate Examples 495
13.4.3 The Stationarity Condition 496
13.4.4 Wold’s Representation Theorem 497
13.4.5 Transforming a VAR to a VMA 498
13.5 Invertibility 501
13.5.1 The Invertibility Condition 501
13.5.2 Transforming a VMA to a VAR 502
13.6 Estimation 502
13.7 Optimal Choice of Lag Order 506
xvi Contents
13.8 Distribution Theory 508
13.9 Testing 511
13.10 Analyzing Vector Autoregressions 513
13.10.1Granger Causality Testing 515
13.10.2Impulse Response Functions 517
13.10.3Variance Decompositions 523
13.11 Applications 525
13.11.1Barro’s Rational Expectations Model 525
13.11.2The Campbell-Shiller Present Value Model 526
13.12 Exercises 528
14 Structural Vector Autoregressions 537
14.1 Introduction 537
14.2 Specification 538
14.2.1 Short-Run Restrictions 542
14.2.2 Long-Run Restrictions 544
14.2.3 Short-Run and Long-Run Restrictions 548
14.2.4 Sign Restrictions 550
14.3 Estimation 553
14.4 Identification 558
14.5 Testing 559
14.6 Applications 561
14.6.1 Peersman’s Model of Oil Price Shocks 561
14.6.2 A Portfolio SVAR Model of Australia 563
14.7 Exercises 566
15 Latent Factor Models 571
15.1 Introduction 571
15.2 Motivating Examples 572
15.2.1 Empirical 572
15.2.2 Theoretical 574
15.3 The Recursions of the Kalman Filter 575
15.3.1 Univariate 576
15.3.2 Multivariate 581
15.4 Extensions 585
15.4.1 Intercepts 585
15.4.2 Dynamics 585
15.4.3 Nonstationary Factors 587
15.4.4 Exogenous and Predetermined Variables 589
15.5 Factor Extraction 589
15.6 Estimation 591
Contents xvii
15.6.1 Identification 591
15.6.2 Maximum Likelihood 591
15.6.3 Principal Components Estimator 593
15.7 Relationship to VARMA Models 596
15.8 Applications 597
15.8.1 The Hodrick-Prescott Filter 597
15.8.2 A Factor Model of Spreads with Money Shocks 601
15.9 Exercises 603
PART FIVE NON-STATIONARY TIME SERIES 613
16 Nonstationary Distribution Theory 615
16.1 Introduction 615
16.2 Specification 616
16.2.1 Models of Trends 616
16.2.2 Integration 618
16.3 Estimation 620
16.3.1 Stationary Case 621
16.3.2 Nonstationary Case: Stochastic Trends 624
16.3.3 Nonstationary Case: Deterministic Trends 626
16.4 Asymptotics for Integrated Processes 629
16.4.1 Brownian Motion 630
16.4.2 Functional Central Limit Theorem 631
16.4.3 Continuous Mapping Theorem 635
16.4.4 Stochastic Integrals 637
16.5 Multivariate Analysis 638
16.6 Applications 640
16.6.1 Least Squares Estimator of the AR(1) Model 641
16.6.2 Trend Misspecification 643
16.7 Exercises 644
17 Unit Root Testing 651
17.1 Introduction 651
17.2 Specification 651
17.3 Detrending 653
17.3.1 Ordinary Least Squares: Dickey and Fuller 655
17.3.2 First Differences: Schmidt and Phillips 656
17.3.3 Generalized Least Squares: Elliott, Rothenberg
and Stock 657
17.4 Testing 658
xviii Contents
17.4.1 Dickey-Fuller Tests 659
17.4.2 M Tests 660
17.5 Distribution Theory 662
17.5.1 Ordinary Least Squares Detrending 664
17.5.2 Generalized Least Squares Detrending 665
17.5.3 Simulating Critical Values 66717.6 Power 668
17.6.1 Near Integration and the Ornstein-Uhlenbeck
Processes 669
17.6.2 Asymptotic Local Power 671
17.6.3 Point Optimal Tests 671
17.6.4 Asymptotic Power Envelope 673
17.7 Autocorrelation 675
17.7.1 Dickey-Fuller Test with Autocorrelation 675
17.7.2 M Tests with Autocorrelation 676
17.8 Structural Breaks 678
17.8.1 Known Break Point 681
17.8.2 Unknown Break Point 684
17.9 Applications 685
17.9.1 Power and the Initial Value 685
17.9.2 Nelson-Plosser Data Revisited 687
17.10 Exercises 687
18 Cointegration 695
18.1 Introduction 695
18.2 Long-Run Economic Models 696
18.3 Specification: VECM 698
18.3.1 Bivariate Models 698
18.3.2 Multivariate Models 700
18.3.3 Cointegration 701
18.3.4 Deterministic Components 703
18.4 Estimation 705
18.4.1 Full-Rank Case 706
18.4.2 Reduced-Rank Case: Iterative Estimator 707
18.4.3 Reduced Rank Case: Johansen Estimator 709
18.4.4 Zero-Rank Case 715
18.5 Identification 716
18.5.1 Triangular Restrictions 716
18.5.2 Structural Restrictions 717
18.6 Distribution Theory 718
Contents xix
18.6.1 Asymptotic Distribution of the Eigenvalues 718
18.6.2 Asymptotic Distribution of the Parameters 720
18.7 Testing 724
18.7.1 Cointegrating Rank 724
18.7.2 Cointegrating Vector 727
18.7.3 Exogeneity 730
18.8 Dynamics 731
18.8.1 Impulse responses 731
18.8.2 Cointegrating Vector Interpretation 732
18.9 Applications 732
18.9.1 Rank Selection Based on Information Criteria 733
18.9.2 Effects of Heteroskedasticity on the Trace Test 735
18.10 Exercises 737
PART SIX NONLINEAR TIME SERIES 747
19 Nonlinearities in Mean 749
19.1 Introduction 749
19.2 Motivating Examples 749
19.3 Threshold Models 755
19.3.1 Specification 755
19.3.2 Estimation 756
19.3.3 Testing 758
19.4 Artificial Neural Networks 761
19.4.1 Specification 761
19.4.2 Estimation 764
19.4.3 Testing 766
19.5 Bilinear Time Series Models 767
19.5.1 Specification 767
19.5.2 Estimation 768
19.5.3 Testing 769
19.6 Markov Switching Model 770
19.7 Nonparametric Autoregression 774
19.8 Nonlinear Impulse Responses 775
19.9 Applications 779
19.9.1 A Multiple Equilibrium Model of Unemployment 779
19.9.2 Bivariate Threshold Models of G7 Countries 781
19.10 Exercises 784
xx Contents
20 Nonlinearities in Variance 795
20.1 Introduction 795
20.2 Statistical Properties of Asset Returns 795
20.3 The ARCH Model 799
20.3.1 Specification 799
20.3.2 Estimation 801
20.3.3 Testing 804
20.4 Univariate Extensions 807
20.4.1 GARCH 807
20.4.2 Integrated GARCH 812
20.4.3 Additional Variables 813
20.4.4 Asymmetries 814
20.4.5 Garch-in-Mean 815
20.4.6 Diagnostics 817
20.5 Conditional Nonnormality 818
20.5.1 Parametric 819
20.5.2 Semi-Parametric 821
20.5.3 Nonparametric 821
20.6 Multivariate GARCH 825
20.6.1 VECH 826
20.6.2 BEKK 827
20.6.3 DCC 830
20.6.4 DECO 836
20.7 Applications 837
20.7.1 DCC and DECO Models of U.S. Zero Coupon
Yields 837
20.7.2 A Time-Varying Volatility SVAR Model 838
20.8 Exercises 841
21 Discrete Time Series Models 850
21.1 Introduction 850
21.2 Motivating Examples 850
21.3 Qualitative Data 853
21.3.1 Specification 853
21.3.2 Estimation 857
21.3.3 Testing 861
21.3.4 Binary Autoregressive Models 863
21.4 Ordered Data 865
21.5 Count Data 867
21.5.1 The Poisson Regression Model 869
Contents xxi
21.5.2 Integer Autoregressive Models 871
21.6 Duration Data 874
21.7 Applications 876
21.7.1 An ACH Model of U.S. Airline Trades 876
21.7.2 EMM Estimator of Integer Models 879
21.8 Exercises 881
Appendix A Change of Variable in Probability Density Func-
tions 887
Appendix B The Lag Operator 888
B.1 Basics 888
B.2 Polynomial Convolution 889
B.3 Polynomial Inversion 890
B.4 Polynomial Decomposition 891
Appendix C FIML Estimation of a Structural Model 892
C.1 Log-likelihood Function 892
C.2 First-order Conditions 892
C.3 Solution 893
Appendix D Additional Nonparametric Results 897
D.1 Mean 897
D.2 Variance 899
D.3 Mean Square Error 901
D.4 Roughness 902
D.4.1 Roughness Results for the Gaussian Distribution 902
D.4.2 Roughness Results for the Gaussian Kernel 903
References 905
Author index 915
Subject index 918
Illustrations
1.1 Probability distributions of y for various models 5
1.2 Probability distributions of y for various models 7
1.3 Log-likelihood function for Poisson distribution 15
1.4 Log-likelihood function for exponential distribution 15
1.5 Log-likelihood function for the normal distribution 17
1.6 Eurodollar interest rates 24
1.7 Stationary density of Eurodollar interest rates 25
1.8 Transitional density of Eurodollar interest rates 27
2.1 Demonstration of the weak law of large numbers 42
2.2 Demonstration of the Lindeberg-Levy central limit theorem 49
2.3 Convergence of log-likelihood function 65
2.4 Consistency of sample mean for normal distribution 65
2.5 Consistency of median for Cauchy distribution 66
2.6 Illustrating asymptotic normality 69
2.7 Bivariate normal distribution 77
2.8 Scatter plot of returns on Apple and Ford stocks 78
2.9 Gradient of the bivariate normal model 81
3.1 Stationary density of Eurodollar interest rates: CIR model 115
3.2 Estimated variance function of CIR model 117
4.1 Illustrating the LR and Wald tests 125
4.2 Illustrating the LM test 126
4.3 Simulated and asymptotic distributions of the Wald test 142
5.1 Simulating a bivariate regression model 166
5.2 Sampling distribution of a weak instrument 180
5.3 U.S. data on the Taylor Rule 188
6.1 Simulated exponential models 201
6.2 Scatter of plot Martin Marietta returns data 222
6.3 Stochastic frontier disturbance distribution 225
7.1 Simulated models with autocorrelated disturbances 236
2 Illustrations
7.2 Distribution of maximum likelihood estimator in an autocorre-
lated regression model 252
8.1 Simulated data from heteroskedastic models 282
8.2 The Great Moderation 303
8.3 Sampling distribution of Wald test 305
8.4 Power of Wald test 305
9.1 Comparison of true and misspecified log-likelihood functions 317
9.2 U.S. Dollar/British Pound exchange rates 345
9.3 Estimated variance function of CKLS model 353
11.1 Bias and variance of the kernel estimate of density 411
11.2 Kernel estimate of distribution of stock index returns 413
11.3 Bivariate normal density 414
11.4 Semiparametric density estimator 419
11.5 Parametric conditional mean estimates 420
11.6 Nadaraya-Watson nonparametric kernel regression 424
11.7 Effect of bandwidth on kernel regression 425
11.8 Cross validation bandwidth selection 429
11.9 Two-dimensional product kernel 431
11.10 Semiparametric regression 433
11.11 Nonparametric production function 435
11.12 Nonparametric estimates of drift and diffusion functions 438
12.1 Simulated AR(1) model 450
12.2 Illustrating Brownian motion 462
13.1 U.S. macroeconomic data 487
13.2 Plots of simulated stationary time series 490
13.3 Choice of optimal lag order 508
14.1 Bivariate SVAR model 541
14.2 Bivariate SVAR with short-run restrictions 545
14.3 Bivariate SVAR with long-run restrictions 547
14.4 Bivariate SVAR with short- and long-run restrictions 549
14.5 Bivariate SVAR with sign restrictions 552
14.6 Impuse responses of Peerman’s model 564
15.1 Daily U.S. zero coupon rates 573
15.2 Alternative priors for latent factors in the Kalman filter 588
15.3 Factor loadings of a term structure model 595
15.4 Hodrick-Prescott filter of real U.S. GPD 601
16.1 Nelson-Plosser data 618
16.2 Simulated distribution of AR1 parameter 624
16.3 Continuous-time processes 633
16.4 Functional Central Limit Theorem 635
16.5 Distribution of a stochastic integral 638
16.6 Mixed normal distribution 640
17.1 Real U.S. GDP 652
Illustrations 3
17.2 Detrending 658
17.3 Near unit root process 669
17.4 Aymptotic power curve of ADF tests 672
17.5 Asymptotic power envelope of ADF tests 674
17.6 Structural breaks in U.S. GDP 679
17.7 Union of rejections approach 686
18.1 Permanent income hypothesis 696
18.2 Long run money demand 697
18.3 Term structure of U.S. yields 698
18.4 Error correction phase diagram 699
19.1 Propertiesof an AR(2) model 750
19.2 Limit cycle 751
19.3 Strange attractor 752
19.4 Nonlinear error correction model 753
19.5 U.S. unemployment 754
19.6 Threshold functions 757
19.7 Decomposition of an ANN 762
19.8 Simulated bilinear time series models 768
19.9 Markov switching model of U.S. output 773
19.10 Nonparametric estimate of a TAR(1) model 775
19.11 Simulated TAR models for G7 countries 783
20.1 Statistical properties of FTSE returns 796
20.2 Distribution of FTSE returns 799
20.3 News impact curve 801
20.4 ACF of GARCH(1,1) models 810
20.5 Conditional variance of FTSE returns 812
20.6 Risk-return preferences 816
20.7 BEKK model of U.S. zero coupon bonds 829
20.8 DECO model of interest rates 838
20.9 SVAR model of U.K. Libor spread 840
21.1 U.S. Federal funds target rate from 1984 to 2009 852
21.2 Money demand equation with a floor interest rate 853
21.3 Duration descriptive statistics for AMR 877
Computer Code used in the Examples
(Code is written in GAUSS in which case the extension is .g
and in MATLAB in which case the extension is .m)
1.1 basic sample.* 4
1.2 basic sample.* 6
1.3 basic sample.* 6
1.4 basic sample.* 6
1.5 basic sample.* 7
1.6 basic sample.* 8
1.7 basic sample.* 8
1.8 basic sample.* 9
1.10 basic poisson.* 13
1.11 basic exp.* 14
1.12 basic normal like.* 16
1.14 basic poisson.* 18
1.15 basic exp.* 19
1.16 basic normal like.* 19
1.18 basic exp.* 22
1.19 basic normal.* 22
2.5 prop wlln1.* 41
2.6 prop wlln2.* 42
2.8 prop moment.* 45
2.10 prop lindlevy.* 48
2.21 prop consistency.* 64
2.22 prop normal.* 64
2.23 prop cauchy.* 65
2.25 prop asymnorm.* 68
2.28 prop edgeworth.* 72
2.29 prop bias.* 73
3.2 max exp.* 93
3.3 max exp.* 95
3.4 max exp.* 97
3.6 max weibull.* 99
Computer Code used in the Examples 5
3.7 max exp.* 102
3.8 max exp.* 103
4.3 test weibull.* 133
4.5 test weibull.* 135
4.7 test weibull.* 139
4.10 test asymptotic.* 141
4.11 text size.* 145
4.12 test power.* 147
4.13 test power.* 147
5.5 linear simulation.* 165
5.6 linear estimate.* 169
5.7 linear fiml.* 171
5.8 linear fiml.* 173
5.10 linear weak.* 179
5.14 linear lr.*, linear wd.*, linear lm.* 182
5.15 linear fiml lr.*, linear fiml wd.*, linear fiml lm.* 185
6.3 nls simulate.* 200
6.5 nls exponential.* 206
6.7 nls consumption estimate.* 210
6.8 nls contest.* 215
6.11 nls money.* 219
7.1 auto simulate.* 235
7.5 auto invest.* 240
7.8 auto distribution.* 251
7.11 auto test.* 260
7.12 auto system.* 267
8.1 hetero simulate.* 281
8.3 hetero estimate.* 284
8.7 hetero test.* 293
8.9 hetero system.* 298
8.10 hetero system.* 299
8.11 hetero general.* 301
10.2 gmm table.* 366
10.3 gmm table.* 367
10.11 gmm ccapm.* 382
11.1 npd kernel.* 407
11.2 npd property.* 410
11.3 npd ftse.* 412
11.4 npd bivariate.* 414
11.5 npd seminonlin.* 418
11.6 npr parametric.* 419
11.7 npr nadwatson.* 422
11.8 npr property.* 424
6 Computer Code used in the Examples
11.10 npr bivariate.* 430
11.11 npr semi.* 432
12.1 sim mom.* 450
12.3 sim accuracy.* 453
12.4 sim ma1indirect.* 454
12.5 sim ma1emm.* 457
12.6 sim ma1overid.* 460
12.7 sim brownind.*,sim brownemm.* 466
13.1 stsm simulate.* 489
13.8 stsm root.* 496
13.9 stsm root.* 497
13.17 stsm varma.* 504
13.21 stsm anderson.* 511
13.24 stsm recursive.* 513
13.25 stsm recursive.* 516
13.26 stsm recursive.* 522
13.27 stsm recursive.* 523
14.2 svar bivariate.* 540
14.5 svar bivariate.* 544
14.9 svar bivariate.* 547
14.10 svar bivariate.* 548
14.12 svar bivariate.* 552
14.13 svar shortrun.* 554
14.14 svar longrun.* 556
14.15 svar recursive.* 557
14.17 svar test.* 560
14.18 svar test.* 561
15.1 kalman termfig.* 572
15.5 kalman uni.* 580
15.6 kalman multi.* 583
15.8 kalman smooth.* 590
15.9 kalman uni.* 592
15.10 kalman term.* 592
15.11 kalman fvar.* 594
15.12 kalman panic.* 594
16.1 nts nelplos.* 616
16.2 nts nelplos.* 616
16.3 nts nelplos.* 617
16.4 nts moment.* 622
16.5 nts moment.* 624
16.6 nts moment.* 628
16.7 nts yts.* 632
16.8 nts fclt.* 635
Computer Code used in the Examples 7
16.10 nts stochint.* 637
16.11 nts mixednormal.* 639
17.1 unit qusgdp.* 657
17.2 unit qusgdp.* 661
17.3 unit asypower1.* 671
17.4 unit asypowerenv.* 674
17.5 unit maicsim.* 677
17.6 unit qusgdp.* 679
17.8 unit qusgdp.* 683
17.9 unit qusgdp.* 685
18.1 coint lrgraphs.* 696
18.2 coint lrgraphs.* 696
18.3 coint lrgraphs.* 697
18.4 coint lrgraphs.* 702
18.6 coint bivterm.* 707
18.7 coint bivterm.* 708
18.8 coint bivterm.* 712
18.9 coint permincome.* 714
18.10 coint bivterm.* 715
18.11 coint triterm.* 716
18.13 coint simevals.* 719
18.16 coint bivterm.* 728
19.1 nlm features.* 750
19.2 nlm features.* 750
19.3 nlm features.* 751
19.4 nlm features.* 752
19.6 nlm tarsim.* 760
19.7 nlm annfig.* 762
19.8 nlm bilinear.* 767
19.9 nlm hamilton.* 772
19.10 nlm tar.* 774
19.11 nlm girf.* 778
20.1 garch nic.* 800
20.2 garch estimate.* 804
20.3 garch test.* 806
20.4 garch simulate.* 809
20.5 garch estimate.* 810
20.6 garch seasonality.* 813
20.7 garch mean.* 816
20.9 mgarch bekk.* 828
21.2 discrete mpol.* 852
21.3 discrete floor.* 852
21.4 discrete simulation.* 857
8 Computer Code used in the Examples
21.7 discrete probit.* 859
21.8 discrete probit.* 862
21.9 discrete ordered.* 866
21.11 discrete thinning.* 871
21.12 discrete poissonauto.* 873
Code Disclaimer Information
Note that the computer code is provided for illustrative purposes only and
although care has been taken to ensure that it works properly, it has not been
thoroughly tested under all conditions and on all platforms. The authors and
Cambridge University Press cannot guarantee or imply reliability, service-
ability, or function of this computer code. All code is therefore provided ‘as
is’ without any warranties of any kind.
PART ONE
MAXIMUM LIKELIHOOD
1
The Maximum Likelihood Principle
1.1 Introduction
Maximum likelihood estimation is a general method for estimating the pa-
rameters of econometric models from observed data. The principle of max-
imum likelihood plays a central role in the exposition of this book, since a
number of estimators used in econometrics can be derived within this frame-
work. Examples include ordinary least squares, generalized least squares and
full-information maximum likelihood. In deriving the maximum likelihood
estimator, a key concept is the joint probability density function (pdf) of
the observed random variables, yt. Maximum likelihood estimation requires
that the following conditions are satisfied.
(1) The form of the joint pdf of yt is known.
(2) The specification of the moments of the joint pdf are known.
(3) The joint pdf can be evaluated for all values of the parameters, θ.
Parts ONE and TWO of this book deal with models in which all these
conditions are satisfied. Part THREE investigates models in which these
conditions are not satisfied and considers four important cases. First, if the
distribution of yt is misspecified, resulting in both conditions 1 and 2 being
violated, estimation is by quasi-maximum likelihood (Chapter 9). Second,
if condition 1 is not satisfied, a generalized method of moments estimator
(Chapter 10) is required. Third, if condition 2 is not satisfied, estimation
relies on nonparametric methods (Chapter 11). Fourth, if condition 3 is
violated, simulation-based estimation methods are used (Chapter 12).
1.2 Motivating Examples
To highlight the role of probability distributions in maximum likelihood esti-
mation, this section emphasizes the link between observed sample data and
4 The Maximum Likelihood Principle
the probability distribution from which they are drawn. This relationship
is illustrated with a number of simulation examples where samples of size
T = 5 are drawn from a range of alternative models. The realizations of
these draws for each model are listed in Table 1.1.
Table 1.1
Realisations of yt from alternative models: t = 1, 2, · · · , 5.
Model t=1 t=2 t=3 t=4 t=5
Time Invariant -2.720 2.470 0.495 0.597 -0.960
Count 2.000 4.000 3.000 4.000 0.000
Linear Regression 2.850 3.105 5.693 8.101 10.387
Exponential Regression 0.874 8.284 0.507 3.7225.865
Autoregressive 0.000 -1.031 -0.283 -1.323 -2.195
Bilinear 0.000 -2.721 0.531 1.350 -2.451
ARCH 0.000 3.558 6.989 7.925 8.118
Poisson 3.000 10.000 17.000 20.000 23.000
Example 1.1 Time Invariant Model
Consider the model
yt = σzt ,
where zt is a disturbance term and σ is a parameter. Let zt be a standardized
normal distribution, N(0, 1), defined by
f(z) =
1√
2π
exp
[
−z
2
2
]
.
The distribution of yt is obtained from the distribution of zt using the change
of variable technique (see Appendix A for details)
f(y ; θ) = f(z)
∣∣∣∣
∂z
∂y
∣∣∣∣ ,
where θ = {σ2}. Applying this rule, and recognising that z = y/σ, yields
f(y ; θ) =
1√
2π
exp
[
−(y/σ)
2
2
] ∣∣∣∣
1
σ
∣∣∣∣ =
1√
2πσ2
exp
[
− y
2
2σ2
]
,
or yt ∼ N(0, σ
2). In this model, the distribution of yt is time invariant
because neither the mean nor the variance depend on time. This property
is highlighted in panel (a) of Figure 1.1 where the parameter is σ = 2.
For comparative purposes the distributions of both yt and zt are given. As
yt = 2zt, the distribution of yt is flatter than the distribution of zt.
1.2 Motivating Examples 5
(a) Time Invariant Model
f
(y
)
y
z
y
(b) Count Model
f
(y
)
y
(c) Linear Regression Model
f
(y
)
y
(d) Exponential Regression Model
f
(y
)
y
-10 0 10 20-10 0 10 20
0 1 2 3 4 5 6 7 8 9-10 0 10
0
0.2
0.4
0.6
0.8
1
0
0.05
0.1
0.15
0.2
0
0.1
0.2
0.3
0
0.1
0.2
0.3
0.4
Figure 1.1 Probability distributions of y generated from the time invariant,
count, linear regression and exponential regression models. Except for the
time invariant and count models, the solid line represents the density at
t = 1, the dashed line represents the density at t = 3 and the dotted line
represents the density at t = 5.
As the distribution of yt in Example 1.1 does not depend on lagged values
yt−i, yt is independently distributed. In addition, since the distribution of yt
is the same at each t, yt is identically distributed. These two properties are
abbreviated as iid. Conversely, the distribution is dependent if yt depends
on its own lagged values and non-identical if it changes over time.
6 The Maximum Likelihood Principle
Example 1.2 Count Model
Consider a time series of counts modelled as a series of draws from a
Poisson distribution
f (y; θ) =
θy exp[−θ]
y!
, y = 0, 1, 2, · · · ,
where θ > 0 is an unknown parameter. A sample of T = 5 realizations of
yt, given in Table 1.1, is drawn from the Poisson probability distribution in
panel (b) of Figure 1.1 for θ = 2. By assumption, this distribution is the
same at each point in time. In contrast to the data in the previous example
where the random variable is continuous, the data here are discrete as they
are positive integers that measure counts.
Example 1.3 Linear Regression Model
Consider the regression model
yt = βxt + σzt , zt ∼ iidN(0, 1) ,
where xt is an explanatory variable that is independent of zt and θ = {β, σ2}.
The distribution of y conditional on xt is
f(y |xt; θ) =
1√
2πσ2
exp
[
−(y − βxt)
2
2σ2
]
,
which is a normal distribution with conditional mean βxt and variance σ
2,
or yt ∼ N(βxt, σ
2). This distribution is illustrated in panel (c) of Figure 1.1
with β = 3, σ = 2 and explanatory variable xt = {0, 1, 2, 3, 4}. The effect of
xt is to shift the distribution of yt over time into the positive region, resulting
in the draws of yt given in Table 1.1 becoming increasingly positive. As the
variance at each point in time is constant, the spread of the distributions of
yt is the same for all t.
Example 1.4 Exponential Regression Model
Consider the exponential regression model
f(y |xt; θ) =
1
µt
exp
[
− y
µt
]
,
where µt = β0+β1xt is the time-varying conditional mean, xt is an explana-
tory variable and θ = {β0, β1}. This distribution is highlighted in panel (d)
of Figure 1.1 with β0 = 1, β1 = 1 and xt = {0, 1, 2, 3, 4}. As β1 > 0, the ef-
fect of xt is to cause the distribution of yt to become more positively skewed
over time.
1.2 Motivating Examples 7
(a) Autoregressive Model
f
(y
)
y
(b) Bilinear Model
f
(y
)
y
(c) Autoregressive Heteroskedastic Model
f
(y
)
y
(d) ARCH Model
f
(y
)
y
-10 0 10 20-10 0 10
-10 0 10 20-10 0 10
0
0.1
0.2
0.3
0.4
0.5
0
0.1
0.2
0.3
0.4
0.5
0
0.05
0.1
0.15
0.2
0
0.05
0.1
0.15
0.2
Figure 1.2 Probability distributions of y generated from the autoregressive,
bilinear, autoregressive with heteroskedasticity and ARCH models. The
solid line represents the density at t = 1, the dashed line represents the
density at t = 3 and the dotted line represents the density at t = 5.
Example 1.5 Autoregressive Model
An example of a first-order autoregressive model, denoted AR(1), is
yt = ρyt−1 + ut , ut ∼ iidN(0, σ
2) ,
8 The Maximum Likelihood Principle
with |ρ| < 1 and θ = {ρ, σ2}. The distribution of y, conditional on yt−1, is
f(y | yt−1; θ) =
1√
2πσ2
exp
[
−(y − ρyt−1)
2
2σ2
]
,
which is a normal distribution with conditional mean ρyt−1 and variance σ2,
or yt ∼ N(ρyt−1, σ2). If 0 < ρ < 1, then a large positive (negative) value of
yt−1 shifts the distribution into the positive (negative) region for yt, raising
the probability that the next draw from this distribution is also positive
(negative). This property of the autoregressive model is highlighted in panel
(a) of Figure 1.2 with ρ = 0.8, σ = 2 and initial value y1 = 0.
Example 1.6 Bilinear Time Series Model
The autoregressive model discussed above specifies a linear relationship
between yt and yt−1. The following bilinear model is an example of a non-
linear time series model
yt = ρyt−1 + γyt−1ut−1 + ut , ut ∼ iidN(0, σ
2) ,
where yt−1ut−1 represents the bilinear term and θ = {ρ, γ, σ2}. The distri-
bution of yt conditional on yt−1 is
f(y | yt−1; θ) =
1√
2πσ2
exp
[
−(y − µt)
2
2σ2
]
,
which is a normal distribution with conditional mean µt = ρyt−1+γyt−1ut−1
and variance σ2. To highlight the nonlinear property of the model, substitute
out ut−1 in the equation for the mean
µt = ρyt−1 + γyt−1(yt−1 − ρyt−2 − γyt−2ut−2)
= ρyt−1 + γy
2
t−1 − γρyt−1yt−2 − γ2yt−1yt−2ut−2 ,
which shows that the mean is a nonlinear function of yt−1. Setting γ =
0 yields the linear AR(1) model of Example 1.5. The distribution of the
bilinear model is illustrated in panel (b) of Figure 1.2 with ρ = 0.8, γ = 0.4,
σ = 2 and initial value y1 = 0.
Example 1.7 Autoregressive Model with Heteroskedasticity
An example of an AR(1) model with heteroskedasticity is
yt = ρyt−1 + σtzt
σ2t = α0 + α1wt
zt ∼ iidN(0, 1) ,
where θ = {ρ, α0, α1} and wt is an explanatory variable. The distribution
1.3 Joint Probability Distributions 9
of yt conditional on yt−1 and wt is
f(y | yt−1, wt; θ) =
1√
2πσ2t
exp
[
−(y − ρyt−1)
2
2σ2t
]
,
which is a normal distribution with conditional mean ρyt−1 and conditional
variance α0 + α1wt. For this model, the distribution shifts because of the
dependence on yt−1 and the spread of the distribution changes because of
wt. These features are highlighted in panel (c) of Figure 1.2 with ρ = 0.8,
α0 = 0.8, α1 = 0.8, wt is defined as a uniform random number on the unit
interval and the initial value is y1 = 0.
Example 1.8 Autoregressive Conditional Heteroskedasticity
The autoregressive conditional heteroskedasticity (ARCH) class of models
is a special case of the heteroskedastic regression model where wt in Example
1.7 is expressed in terms of lagged values of the disturbance term squared.
An example of a regression model as in Example 1.3 with ARCH is
yt = βxt + ut
ut = σtzt
σ2t = α0 + α1u
2
t−1
zt ∼ iidN(0, 1),
where xt is an explanatory variable and θ = {β, α0, α1}. The distribution
of y conditional on yt−1, xt and xt−1 is
f (y | yt−1, xt, xt−1; θ) =
1√
2π
(
α0 + α1 (yt−1 − βxt−1)2
)
× exp

− (y − βxt)
2
2
(
α0 + α1 (yt−1 − βxt−1)2
)

 .
For this model, a large shock, represented by a large value of ut, results in
an increased variance in the next period if α1 > 0. The distribution from
which yt is drawn in the nextperiod will therefore have a larger variance.
The distribution of this model is shown in panel (d) of Figure 1.2 with β = 3,
α0 = 0.8, α1 = 0.8 and xt = {0, 1, 2, 3, 4}.
1.3 Joint Probability Distributions
The motivating examples of the previous section focus on the distribution
of yt at time t which is generally a function of its own lags and the current
10 The Maximum Likelihood Principle
and lagged values of explanatory variables xt. The derivation of the maxi-
mum likelihood estimator of the model parameters requires using all of the
information t = 1, 2, · · · , T by defining the joint probability density function
(pdf). In the case where both yt and xt are stochastic, the joint probability
pdf for a sample of T observations is
f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) , (1.1)
where ψ is a vector of parameters. An important feature of the previous
examples is that yt depends on the explanatory variable xt. To capture this
conditioning, the joint distribution in (1.1) is expressed as
f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ;ψ)
× f(x1, x2, · · · , xT ;ψ) , (1.2)
where the first term on the right hand side of (1.2) represents the conditional
distribution of {y1, y2, · · · , yT } on {x1, x2, · · · , xT } and the second term is
the marginal distribution of {x1, x2, · · · , xT }. Assuming that the parameter
vector ψ can be decomposed into {θ, θx} such that expression (1.2) becomes
f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ)
× f(x1, x2, · · · , xT ; θx) . (1.3)
In these circumstances, the maximum likelihood estimation of the parame-
ters θ is based on the conditional distribution without loss of information
from the exclusion of the marginal distribution f(x1, x2, · · · , xT ; θx).
The conditional distribution on the right hand side of expression (1.3)
simplifies further in the presence of additional restrictions.
Independent and identically distributed (iid)
In the simplest case, {y1, y2, · · · , yT } is independent of {x1, x2, · · · , xT } and
yt is iid with density function f(y; θ). The conditional pdf in equation (1.3)
is then
f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) =
T∏
t=1
f(yt; θ) . (1.4)
Examples of this case are the time invariant model (Example 1.1) and the
count model (Example 1.2).
If both yt and xt are iid and yt is dependent on xt then the decomposition
in equation (1.3) implies that inference can be based on
f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) =
T∏
t=1
f(yt |xt; θ) . (1.5)
1.3 Joint Probability Distributions 11
Examples include the regression models in Examples 1.3 and 1.4 if sampling
is iid.
Dependent
Now assume that {y1, y2, · · · , yT } depends on its own lags but is independent
of the explanatory variable {x1, x2, · · · , xT }. The joint pdf is expressed as
a sequence of conditional distributions where conditioning is based on lags
of yt. By using standard rules of probability the distributions for the first
three observations are, respectively,
f(y1; θ) = f(y1; θ)
f(y1, y2 ; θ) = f(y2|y1; θ)f(y1; θ)
f(y1, y2, y3; θ) = f(y3|y2, y1; θ)f(y2|y1; θ)f(y1; θ) ,
where y1 is the initial value with marginal probability density
Extending this sequence to a sample of T observations, yields the joint
pdf
f(y1, y2, · · · , yT ; θ) = f(y1 ; θ)
T∏
t=2
f(yt|yt−1, yt−2, · · · , y1; θ) . (1.6)
Examples of this general case are the AR model (Example 1.5), the bilinear
model (Example 1.6) and the ARCH model (Example 1.8). Extending the
model to allow for dependence on explanatory variables, xt, gives
f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) =
f(y1 |x1; θ)
T∏
t=2
f(yt|yt−1, yt−2, · · · , y1, xt, xt−1, · · · x1; θ) . (1.7)
An example is the autoregressive model with heteroskedasticity (Example
1.7).
Example 1.9 Autoregressive Model
The joint pdf for the AR(1) model in Example 1.5 is
f(y1, y2, · · · , yT ; θ) = f(y1; θ)
T∏
t=2
f(yt|yt−1; θ) ,
where the conditional distribution is
f (yt|yt−1; θ) =
1√
2πσ2
exp
[
−(yt − ρyt−1)
2
2σ2
]
,
12 The Maximum Likelihood Principle
and the marginal distribution is
f (y1; θ) =
1√
2πσ2/ (1− ρ2)
exp
[
− y
2
1
2σ2/ (1− ρ2)
]
.
Non-stochastic explanatory variables
In the case of non-stochastic explanatory variables, because xt is determin-
istic its probability mass is degenerate. Explanatory variables of this form
are also referred to as fixed in repeated samples. The joint probability in
expression (1.3) simplifies to
f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) .
Now ψ = θ and there is no potential loss of information from using the
conditional distribution to estimate θ.
1.4 Maximum Likelihood Framework
As emphasized previously, a time series of data represents the observed
realization of draws from a joint pdf. The maximum likelihood principle
makes use of this result by providing a general framework for estimating the
unknown parameters, θ, from the observed time series data, {y1, y2, · · · , yT }.
1.4.1 The Log-Likelihood Function
The standard interpretation of the joint pdf in (1.7) is that f is a function of
yt for given parameters, θ. In defining the maximum likelihood estimator this
interpretation is reversed, so that f is taken as a function of θ for given yt.
The motivation behind this change in the interpretation of the arguments of
the pdf is to regard {y1, y2, · · · , yT } as a realized data set which is no longer
random. The maximum likelihood estimator is then obtained by finding the
value of θ which is “most likely” to have generated the observed data. Here
the phrase “most likely” is loosely interpreted in a probability sense.
It is important to remember that the likelihood function is simply a re-
definition of the joint pdf in equation (1.7). For many problems it is simpler
to work with the logarithm of this joint density function. The log-likelihood
1.4 Maximum Likelihood Framework 13
function is defined as
lnLT (θ) =
1
T
ln f(y1 |x1; θ)
+
1
T
T∑
t=2
ln f(yt|yt−1, yt−2, · · · , y1, xt, xt−1, · · · x1; θ) ,
(1.8)
where the change of status of the arguments in the joint pdf is highlighted
by making θ the sole argument of this function and the T subscript indicates
that the log-likelihood is an average over the sample of the logarithm of the
density evaluated at yt. It is worth emphasizing that the term log-likelihood
function, used here without any qualification, is also known as the average
log-likelihood function. This convention is also used by, among others, Newey
and McFadden (1994) and White (1994). This definition of the log-likelihood
function is consistent with the theoretical development of the properties of
maximum likelihood estimators discussed in Chapter 2, particularly Sections
2.3 and 2.5.1.
For the special case where yt is iid, the log-likelihood function is based on
the joint pdf in (1.4) and is
lnLT (θ) =
1
T
T∑
t=1
ln f(yt; θ) .
In all cases, the log-likelihood function, lnLT (θ), is a scalar that represents
a summary measure of the data for given θ.
The maximum likelihood estimator of θ is defined as that value of θ, de-
noted θ̂, that maximizes the log-likelihood function. In a large number of
cases, this may be achieved using standard calculus. Chapter 3 discusses nu-
merical approaches to the problem of finding maximum likelihood estimates
when no analytical solutions exist, or are difficult to derive.
Example 1.10 Poisson Distribution
Let {y1, y2, · · · , yT } be iid observations from a Poisson distribution
f(y; θ) =
θy exp[−θ]
y!
,
where θ > 0. The log-likelihood function for the sample is
lnLT (θ) =
1
T
T∑
t=1
ln f(yt; θ) =
1
T
T∑
t=1
yt ln θ − θ −
ln(y1!y2! · · · yT !)
T
.
Consider the following T = 3 observations, yt = {8, 3, 4}. The log-likelihood
14 The Maximum Likelihood Principle
function is
lnLT (θ) =
15
3
ln θ − θ − ln(8!3!4!)
3
= 5 ln θ − θ − 5.191 .
A plot of the log-likelihoodfunction is given in panel (a) of Figure 1.3 for
values of θ ranging from 0 to 10. Even though the Poisson distribution is
a discrete distribution in terms of the random variable y, the log-likelihood
function is continuous in the unknown parameter θ. Inspection shows that
a maximum occurs at θ̂ = 5 with a log-likelihood value of
lnLT (5) = 5× ln 5− 5− 5.191 = −2.144 .
The contribution to the log-likelihood function at the first observation y1 =
8, evaluated at θ̂ = 5 is
ln f(y1; 5) = y1 ln 5− 5− ln(y1!) = 8× ln 5− 5− ln(8!) = −2.729 .
For the other two observations, the contributions are ln f(y2; 5) = −1.963,
ln f(y3; 5) = −1.740. The probabilities f(yt; θ) are between 0 and 1 by def-
inition and therefore all of the contributions are negative because they are
computed as the logarithm of f(yt; θ). The average of these T = 3 contri-
butions is lnLT (5) = −2.144, which corresponds to the value already given
above. A plot of ln f(yt; 5) in panel (b) of Figure 1.3 shows that observations
closer to θ̂ = 5 have a relatively greater contribution to the log-likelihood
function than observations further away in the sense that they are smaller
negative numbers.
Example 1.11 Exponential Distribution
Let {y1, y2, · · · , yT } be iid drawings from an exponential distribution
f(y; θ) = θ exp[−θy] ,
where θ > 0. The log-likelihood function for the sample is
lnLT (θ) =
1
T
T∑
t=1
ln f(yt; θ) =
1
T
T∑
t=1
(ln θ − θyt) = ln θ − θ
1
T
T∑
t=1
yt .
Consider the following T = 6 observations, yt = {2.1, 2.2, 3.1, 1.6, 2.5, 0.5}.
The log-likelihood function is
lnLT (θ) = ln θ − θ
1
T
T∑
t=1
yt = ln θ − 2 θ .
Plots of the log-likelihood function, lnLT (θ), and the likelihood LT (θ)
functions are given in Figure 1.4, which show that a maximum occurs at
1.4 Maximum Likelihood Framework 15
(a) Log-likelihood function
ln
L
T
(θ
)
θ
(b) Log-density function
ln
f
(y
t;
5
)
yt
1 2 3 4 5 6 7 8 9 100 5 10 15
-3
-2.5
-2
-1.5
-1
-0.5
0-30
-25
-20
-15
-10
-5
0
Figure 1.3 Plot of lnLT (θ) and and ln f(yt; θ̂ = 5) for the Poisson distri-
bution example with a sample size of T = 3.
(a) Log-likelihood function
ln
L
T
(θ
)
θ
(b) Likelihood function
L
T
(θ
)
×
1
0
5
θ
0 1 2 30 1 2 3
0.5
1
1.5
2
2.5
3
3.5
4
-40
-35
-30
-25
-20
-15
-10
Figure 1.4 Plot of lnLT (θ) for the exponential distribution example.
θ̂ = 0.5. Table 1.2 provides details of the calculations. Let the log-likelihood
function at each observation evaluated at the maximum likelihood estimate
be denoted ln lt(θ) = ln f(yt; θ). The second column shows ln lt(θ) evaluated
at θ̂ = 0.5
ln lt(0.5) = ln(0.5) − 0.5yt ,
resulting in a maximum value of the log-likelihood function of
lnLT (0.5) =
1
6
6∑
t=1
ln lt(0.5) =
−10.159
6
= −1.693 .
16 The Maximum Likelihood Principle
Table 1.2
Maximum likelihood calculations for the exponential distribution example. The
maximum likelihood estimate is θ̂T = 0.5.
yt ln lt(0.5) gt(0.5) ht(0.5)
2.1 -1.743 -0.100 -4.000
2.2 -1.793 -0.200 -4.000
3.1 -2.243 -1.100 -4.000
1.6 -1.493 0.400 -4.000
2.5 -1.943 -0.500 -4.000
0.5 -0.943 1.500 -4.000
lnLT (0.5) = −1.693 GT (0.5) = 0.000 HT (0.5) = −4.000
Example 1.12 Normal Distribution
Let {y1, y2, · · · , yT } be iid observations drawn from a normal distribution
f(y; θ) =
1√
2πσ2
exp
[
−(y − µ)
2
2σ2
]
,
with unknown parameters θ =
{
µ, σ2
}
. The log-likelihood function is
lnLT (θ) =
1
T
T∑
t=1
ln f(yt; θ)
=
1
T
T∑
t=1
(
− 1
2
ln 2π − 1
2
lnσ2 − (yt − µ)
2
2σ2
)
= −1
2
ln 2π − 1
2
lnσ2 − 1
2σ2T
T∑
t=1
(yt − µ)2.
Consider the following T = 6 observations, yt = {5,−1, 3, 0, 2, 3}. The
log-likelihood function is
lnLT (θ) = −
1
2
ln 2π − 1
2
lnσ2 − 1
12σ2
6∑
t=1
(yt − µ)2 .
A plot of this function in Figure 1.5 shows that a maximum occurs at µ̂ = 2
and σ̂2 = 4.
Example 1.13 Autoregressive Model
1.4 Maximum Likelihood Framework 17
PSfrag
µσ
2
ln
L
T
(µ
,
σ
2
)
1
1.5
2
2.5
3
3
3.5
4
4.5
5
Figure 1.5 Plot of lnLT (θ) for the normal distribution example.
From Example 1.9, the log-likelihood function for the AR(1) model is
lnLT (θ) =
1
T
(
1
2
ln
(
1− ρ2
)
− 1
2σ2
(
1− ρ2
)
y21
)
−1
2
ln 2π − 1
2
lnσ2 − 1
2σ2T
T∑
t=2
(yt − ρyt−1)2 .
The first term is commonly excluded from lnLT (θ) as its contribution dis-
appears asymptotically since
lim
T−→∞
1
T
(
1
2
ln
(
1− ρ2
)
− 1
2σ2
(
1− ρ2
)
y21
)
= 0 .
As the aim of maximum likelihood estimation is to find the value of θ that
maximizes the log-likelihood function, a natural way to do this is to use the
rules of calculus. This involves computing the first derivatives and second
derivatives of the log-likelihood function with respect to the parameter vec-
tor θ.
18 The Maximum Likelihood Principle
1.4.2 Gradient
Differentiating lnLT (θ), with respect to a (K×1) parameter vector, θ, yields
a (K × 1) gradient vector, also known as the score, given by
GT (θ) =
∂ lnLT (θ)
∂θ
=


∂ lnLT (θ)
∂θ1
∂ lnLT (θ)
∂θ2
...
∂ lnLT (θ)
∂θK


=
1
T
T∑
t=1
gt(θ) , (1.9)
where the subscript T emphasizes that the gradient is the sample average
of the individual gradients
gt(θ) =
∂ ln lt(θ)
∂θ
.
The maximum likelihood estimator of θ, denoted θ̂, is obtained by setting
the gradients equal to zero and solving the resultantK first-order conditions.
The maximum likelihood estimator, θ̂, therefore satisfies the condition
GT (θ̂) =
∂ lnLT (θ)
∂θ
∣∣∣∣
θ=θ̂
= 0 . (1.10)
Example 1.14 Poisson Distribution
From Example 1.10, the first derivative of lnLT (θ) with respect to θ is
GT (θ) =
1
Tθ
T∑
t=1
yt − 1 .
The maximum likelihood estimator is the solution of the first-order condition
1
T θ̂
T∑
t=1
yt − 1 = 0 ,
which yields the sample mean as the maximum likelihood estimator
θ̂ =
1
T
T∑
t=1
yt = y .
Using the data for yt in Example 1.10, the maximum likelihood estimate is
θ̂ = 15/3 = 5. Evaluating the gradient at θ̂ = 5 verifies that it is zero at the
1.4 Maximum Likelihood Framework 19
maximum likelihood estimate
GT (θ̂) =
1
T θ̂
T∑
t=1
yt − 1 =
15
3× 5 − 1 = 0 .
Example 1.15 Exponential Distribution
From Example 1.11, the first derivative of lnLT (θ) with respect to θ is
GT (θ) =
1
θ
− 1
T
T∑
t=1
yt .
Setting GT (θ̂) = 0 and solving the resultant first-order condition yields
θ̂ =
T∑T
t=1 yt
=
1
y
,
which is the reciprocal of the sample mean. Using the same observed data for
yt as in Example 1.11, the maximum likelihood estimate is θ̂ = 6/12 = 0.5.
The third column of Table 1.2 gives the gradients at each observation
evaluated at θ̂ = 0.5
gt(0.5) =
1
0.5
− yt .
The gradient is
GT (0.5) =
1
6
6∑
t=1
gt(0.5) = 0 ,
which follows from the properties of the maximum likelihood estimator.
Example 1.16 Normal Distribution
From Example 1.12, the first derivatives of the log-likelihood function are
∂ lnLT (θ)
∂µ
=
1
σ2T
T∑
t=1
(yt−µ) ,
∂ lnLT (θ)
∂(σ2)
= − 1
2σ2
+
1
2σ4T
T∑
t=1
(yt−µ)2 ,
yielding the gradient vector
GT (θ) =


1
σ2T
T∑
t=1
(yt − µ)
− 1
2σ2
+
1
2σ4T
T∑
t=1
(yt − µ)2


.
20 The Maximum Likelihood Principle
Evaluating the gradient at θ̂ and setting GT (θ̂) = 0, gives
GT (θ̂) =


1
σ̂2T
T∑
t=1
(yt − µ̂)
− 1
2σ̂2
+
1
2σ̂4T
T∑
t=1
(yt − µ̂)2


=


0
0

 .
Solving for θ̂ = {µ̂, σ̂2}, the maximum likelihood estimators are
µ̂ =
1
T
T∑
t=1
yt = y , σ̂
2 =
1
T
T∑
t=1
(yt − y)2 .
Using the data from Example 1.12, the maximum likelihood estimates are
µ̂ =
5− 1 + 3 + 0 + 2 + 3
6
= 2
σ̂2 =
(5− 2)2 + (−1− 2)2 + (3− 2)2 + (0− 2)2 + (2− 2)2 + (3− 2)2
6
= 4 ,
which agree with the values given in Example 1.12.
1.4.3 Hessian
To establish that θ̂ maximizes the log-likelihood function, it is necessary to
determine that the Hessian
HT (θ) =
∂2 lnLT (θ)
∂θ∂θ′
, (1.11)
associated with the log-likelihood function is negative definite. As θ is a
(K × 1) vector, the Hessian is the (K ×K) symmetric matrix
HT (θ) =

∂2 lnLT (θ)
∂θ1∂θ1
∂2 lnLT (θ)
∂θ1∂θ2
. . .
∂2 lnLT (θ)
∂θ1∂θK
∂2 lnLT (θ)
∂θ2∂θ1
∂2 lnLT (θ)
∂θ2∂θ2
. . .
∂2 lnLT (θ)
∂θ2∂θK
...
...
...
...
∂2 lnLT (θ)
∂θK∂θ1
∂2 lnLT (θ)
∂θK∂θ2
. . .
∂2 lnLT (θ)
∂θK∂θK


=
1
T
T∑
t=1
ht(θ) ,
1.4 Maximum Likelihood Framework 21
where the subscript T emphasizes that the Hessian is the sample average of
the individual elements
ht(θ) =
∂2 ln lt(θ)
∂θ∂θ′
.
The second-order condition for a maximum requires that the Hessian matrix
evaluated at θ̂,
HT (θ̂) =
∂2 lnLT (θ)
∂θ∂θ′
∣∣∣∣
θ=θ̂
, (1.12)
is negative definite. The conditions for negative definiteness are
|H11| < 0,
∣∣∣∣
H11 H12
H21 H22
∣∣∣∣ > 0,
∣∣∣∣∣∣∣∣
H11 H12 H13
H21 H22 H23
H31 H32 H33
∣∣∣∣∣∣∣∣
< 0, · · ·
where Hij is the ij
th element of HT (θ̂). In the case of K = 1, the condition
is
H11 < 0 . (1.13)
For the case of K = 2, the condition is
H11 < 0, H11H22 −H12H21 > 0 . (1.14)
Example 1.17 Poisson Distribution
From Examples 1.10 and 1.14, the second derivative of lnLT (θ) with re-
spect to θ is
HT (θ) = −
1
θ2T
T∑
t=1
yt .
Evaluating the Hessian at the maximum likelihood estimator, θ̂ = ȳ, yields
HT (θ̂) = −
1
θ̂2T
T∑
t=1
yt = −
1
ȳ2T
T∑
t=1
yt = −
1
ȳ
< 0 .
As ȳ is always positive because it is the mean of a sample of positive integers,
the Hessian is negative and a maximum is achieved. Using the data for yt
in Example 1.10, verifies that the Hessian at θ̂ = 5 is negative
HT (θ̂) = −
1
θ̂2T
T∑
t=1
yt = −
15
52 × 3 = −0.200 .
22 The Maximum Likelihood Principle
Example 1.18 Exponential Distribution
From Examples 1.11 and 1.15, the second derivative of lnLT (θ) with re-
spect to θ is
HT (θ) = −
1
θ2
.
Evaluating the Hessian at the maximum likelihood estimator yields
HT (θ̂) = −
1
θ̂2
< 0 .
As this term is negative for any θ̂, the condition in equation (1.13) is satisfied
and a maximum is achieved. The last column of Table 1.2 shows that the
Hessian at each observation evaluated at the maximum likelihood estimate
is constant. The value of the Hessian is
HT (0.5) =
1
6
6∑
t=1
ht(0.5) =
−24.000
6
= −4 ,
which is negative confirming that a maximum has been reached.
Example 1.19 Normal Distribution
From Examples 1.12 and 1.16, the second derivatives of lnLT (θ) with
respect to θ are
∂2 lnLT (θ)
∂µ2
= − 1
σ2
∂2 lnLT (θ)
∂µ∂σ2
= − 1
σ4T
T∑
t=1
(yt − µ)
∂2 lnLT (θ)
∂(σ2)2
=
1
2σ4
− 1
σ6T
T∑
t=1
(yt − µ)2 ,
so that the Hessian is
HT (θ) =


− 1
σ2
− 1
σ4T
T∑
t=1
(yt − µ)
− 1
σ4T
T∑
t=1
(yt − µ)
1
2σ4
− 1
σ6T
T∑
t=1
(yt − µ)2


.
Given that GT (θ̂) = 0, from Example 1.16 it follows that
∑T
t=1(yt − µ̂) = 0
1.5 Applications 23
and therefore
HT (θ̂) =


− 1
σ̂2
0
0 − 1
2σ̂4

 .
From equation (1.14)
H11 = −
T
σ̂2
< 0, H11H22 −H12H21 = −
( T
σ̂2
)(
− T
2σ̂4
)
− 02 > 0 ,
establishing that the second-order condition for a maximum is satisfied.
Using the maximum likelihood estimates from Example 1.16, the Hessian is
HT (µ̂, σ̂
2) =


−1
4
0
0 − 1
2× 42

 =


−0.250 0.000
0.000 −0.031

 .
1.5 Applications
To highlight the features of maximum likelihood estimation discussed thus
far, two applications are presented that focus on estimating the discrete time
version of the Vasicek (1977) model of interest rates, rt. The first application
is based on the marginal (stationary) distribution while the second focuses
on the conditional (transitional) distribution that gives the distribution of rt
conditional on rt−1. The interest rate data used are from Aı̈t-Sahalia (1996).
The data, plotted in Figure 1.6, consists of daily 7-day Eurodollar rates
(expressed as percentages) for the period 1 June 1973 to the 25 February
1995, a total of T = 5505 observations.
The Vasicek model expresses the change in the interest rate, rt, as a
function of a constant and the lagged interest rate
rt − rt−1 = α+ βrt−1 + ut
ut ∼ iidN
(
0, σ2
)
,
(1.15)
where θ = {α, β, σ2} are unknown parameters, with the restriction β < 0.
1.5.1 Stationary Distribution of the Vasicek Model
As a preliminary step to estimating the parameters of the Vasicek model in
equation (1.15), consider the alternative model where the level of the interest
24 The Maximum Likelihood Principle
%
t
1975 1980 1985 1990 1995
4
8
12
16
20
24
Figure 1.6 Daily 7-day Eurodollar interest rates from the 1 June 1973 to
25 February 1995 expressed as a percentage.
rate is independent of previous interest rates
rt = µs + vt , vt ∼ iidN(0, σ
2
s ) .
The stationary distribution of rt for this model is
f(r;µs, σ
2
s) =
1√
2πσ2s
exp
[
−(r − µs)
2
2σ2s
]
. (1.16)
The relationship between the parameters of the stationary distribution and
the parameters of the model in equation (1.15) is
µs = −
α
β
, σ2s = −
σ2
β (2 + β)
. (1.17)
which are obtained as the unconditional mean and variance of (1.15).
The log-likelihood function based on the stationary distribution in equa-
tion (1.16) for a sample of T observations is
lnLT (θ) = −
1
2
ln 2π − 1
2
lnσ2s −
1
2σ2sT
T∑
t=1
(rt − µs)2 ,
where θ = {µs, σ2s}. Maximizing lnLT (θ) with respect to θ gives
µ̂s =
1
T
T∑
t=1
rt , σ̂
2
s =
1
T
T∑
t=1
(rt − µ̂s)2 . (1.18)
Using the Eurodollar interest rates, the maximum likelihood estimates are
µ̂s = 8.362, σ̂
2
s = 12.893. (1.19)
1.5 Applications 25
f(
r)
Interest Rate
-5 0 5 10 15 20 25
Figure 1.7 Estimated stationary distribution of the Vasicek model based on
evaluating (1.16) at the maximum likelihood estimates (1.19), using daily
Eurodollar rates from the 1 June 1973 to 25 February 1995.
The stationary distribution is estimated by evaluating equation (1.16) at
the maximum likelihood estimates in (1.19) and is given by
f
(
r; µ̂s, σ̂
2
s
)
=
1√
2πσ̂2s
exp
[
−(r − µ̂s)
2
2σ̂2s
]
=
1√
2π × 12.893
exp
[
−(r − 8.362)
2
2× 12.893
]
, (1.20)
which is presented in Figure 1.7.
Inspection of the estimated distribution shows a potential problem with
the Vasicek stationary distribution, namely that the support of the distri-
bution is not restricted to being positive. The probability of negative values
for the interest rate is
Pr (r < 0) =
0∫
−∞
1√
2π × 12.893 exp
[
−(r − 8.362)
2
2× 12.893
]
dr = 0.01 .
To avoid this problem, alternative models of interest rates are specified where
the stationary distribution is just defined over the positive region. A well
known example is the CIR interest rate model (Cox, Ingersoll and Ross,
1985) which is discussed in Chapters 2, 3 and 12.
1.5.2 Transitional Distribution of the Vasicek Model
In contrast to the stationary model specification of the previous section,
the full dynamics of the Vasicek model in equation (1.15) are now used by
26 The Maximum Likelihood Principle
specifying the transitional distribution
f
(
r | rt−1;α, ρ, σ2
)
=
1√
2πσ2
exp
[
−(r − α− ρrt−1)
2
2σ2
]
, (1.21)
where θ =
{
α, ρ, σ2
}
and the substitution ρ = 1+β is made for convenience.
This distribution is now of the same form as the conditional distribution of
the AR(1) model in Examples 1.5, 1.9 and 1.13.
The log-likelihood function based on the transitional distribution in equa-
tion (1.21) is
lnLT (θ) = −
1
2
ln 2π − 1
2
lnσ2 − 1
2σ2(T − 1)
T∑
t=2
(rt − α− ρrt−1)2 ,
where the sample size is reduced by one observation as a result of the lagged
term rt−1. This form of the log-likelihood function does not contain the
marginal distribution f(r1; θ), a point that is made in Example 1.13. The
first derivatives of the log-likelihood function are
∂ lnL(θ)
∂α
=
1
σ2(T − 1)
T∑
t=2
(rt − α− ρrt−1)
∂ lnL(θ)
∂ρ
=
1
σ2(T − 1)
T∑
t=2
(rt − α− ρrt−1)rt−1
∂ lnL(θ)
∂(σ2)
= − 1
2σ2
+
1
2σ4(T − 1)
T∑
t=2
(rt − α− ρrt−1)2 .
Setting these derivatives to zero yields the maximum likelihood estimators
α̂ = r̄t − ρ̂ r̄t−1
ρ̂ =
T∑
t=2
(rt − r̄t)(rt−1 − r̄t−1)
T∑
t=2
(rt−1 − r̄t−1)2
σ̂2 =
1
T − 1
T∑
t=2
(rt − α̂− ρ̂rt−1)2 ,
where
r̄t =
1
T − 1
T∑
t=2rt , r̄t−1 =
1
T − 1
T∑
t=2
rt−1 .
1.5 Applications 27
The maximum likelihood estimates for the Eurodollar interest rates are
α̂ = 0.053, ρ̂ = 0.994, σ̂2 = 0.165. (1.22)
An estimate of β is obtained by using the relationship ρ = 1+β. Rearranging
for β and evaluating at ρ̂ gives β̂ = ρ̂− 1 = −0.006.
The estimated transitional distribution is obtained by evaluating (1.21)
at the maximum likelihood estimates in (1.22)
f
(
r | rt−1; α̂, ρ̂, σ̂2
)
=
1√
2πσ̂2
exp
[
−(r − α̂− ρ̂rt−1)
2
2σ̂2
]
. (1.23)
Plots of this distribution are given in Figure 1.8 for three values of the
conditioning variable rt−1, corresponding to the minimum (2.9%), median
(8.1%) and maximum (24.3%) interest rates in the sample.
f(
r)
r
0 5 10 15 20 25 30
.
Figure 1.8 Estimated transitional distribution of the Vasicek model, based
on evaluating (1.23) at the maximum likelihood estimates in (1.22) using
Eurodollar rates from 1 June 1973 to 25 February 1995. The dashed line is
the transitional density for the minimum (2.9%), the solid line is the transi-
tional density for the median (8.1%) and the dotted line is the transitional
density for the maximum (24.3%) Eurodollar rate.
The location of the three transitional distributions changes over time,
while the spread of each distribution remains constant at σ̂2 = 0.165. A
comparison of the estimates of the variances of the stationary and transi-
tional distributions, in equations (1.19) and (1.22), respectively, shows that
σ̂2 < σ̂2s . This result is a reflection of the property that by conditioning
on information, in this case rt−1, the transitional distribution is better at
tracking the time series behaviour of the interest rate, rt, than the stationary
distribution where there is no conditioning on lagged dependent variables.
28 The Maximum Likelihood Principle
Having obtained the estimated transitional distribution using the maxi-
mum likelihood estimates in (1.22), it is also possible to use these estimates
to reestimate the stationary interest rate distribution in (1.20) by using the
expressions in (1.17). The alternative estimates of the mean and variance of
the stationary distribution are
µ̃s = −
α̂
β̂
=
0.053
0.006
= 8.308,
σ̃2s = −
σ̂2
β̂
(
2 + β̂
) = 0.165
0.006 (2− 0.006) = 12.967 .
As these estimates are based on the transitional distribution, which incorpo-
rates the full dynamic specification of the Vasicek model, they represent the
maximum likelihood estimates of the parameters of the stationary distribu-
tion. This relationship between the maximum likelihood estimators of the
transitional and stationary distributions is based on the invariance property
of maximum likelihood estimators which is discussed in Chapter 2. While
the parameter estimates of the stationary distribution using the estimates
of the transitional distribution are numerically close to estimates obtained
in the previous section, the latter estimates are obtained from a misspecified
model as the stationary model excludes the dynamic structure in equation
(1.15). Issues relating to misspecified models are discussed in Chapter 9.
1.6 Exercises
(1) Sampling Data
Gauss file(s) basic_sample.g
Matlab file(s) basic_sample.m
This exercise reproduces the simulation results in Figures 1.1 and 1.2.
For each model, simulate T = 5 draws of yt and plot the corresponding
distribution at each point in time. Where applicable the explanatory
variable in these exercises is xt = {0, 1, 2, 3, 4} and wt are draws from a
uniform distribution on the unit circle.
(a) Time invariant model
yt = 2zt , zt ∼ iidN(0, 1) .
(b) Count model
f (y; 2) =
2y exp[−2]
y!
, y = 1, 2, · · · .
1.6 Exercises 29
(c) Linear regression model
yt = 3xt + 2zt , zt ∼ iidN(0, 1) .
(d) Exponential regression model
f(y; θ) =
1
µt
exp
[
− y
µt
]
, µt = 1 + 2xt .
(e) Autoregressive model
yt = 0.8yt−1 + 2zt , zt ∼ iidN(0, 1) .
(f) Bilinear time series model
yt = 0.8yt−1 + 0.4yt−1ut−1 + 2zt , zt ∼ iidN(0, 1) .
(g) Autoregressive model with heteroskedasticity
yt = 0.8yt−1 + σtzt , zt ∼ iidN(0, 1)
σ2t = 0.8 + 0.8wt .
(h) The ARCH regression model
yt = 3xt + ut
ut = σtzt
σ2t = 4 + 0.9u
2
t−1
zt ∼ iidN(0, 1) .
(2) Poisson Distribution
Gauss file(s) basic_poisson.g
Matlab file(s) basic_poisson.m
A sample of T = 4 observations, yt = {6, 2, 3, 1}, is drawn from the
Poisson distribution
f(y; θ) =
θy exp[−θ]
y!
.
(a) Write the log-likelihood function, lnLT (θ).
(b) Derive and interpret the maximum likelihood estimator, θ̂.
(c) Compute the maximum likelihood estimate, θ̂.
(d) Compute the log-likelihood function at θ̂ for each observation.
(e) Compute the value of the log-likelihood function at θ̂.
30 The Maximum Likelihood Principle
(f) Compute
gt(θ̂) =
d ln lt(θ)
dθ
∣∣∣∣
θ=θ̂
and ht(θ̂) =
d2 ln lt(θ)
dθ2
∣∣∣∣
θ=θ̂
,
for each observation.
(g) Compute
GT (θ̂) =
1
T
T∑
t=1
gt(θ̂) and HT (θ̂) =
1
T
T∑
t=1
ht(θ̂) .
(3) Exponential Distribution
Gauss file(s) basic_exp.g
Matlab file(s) basic_exp.m
A sample of T = 4 observations, yt = {5.5, 2.0, 3.5, 5.0}, is drawn from
the exponential distribution
f(y; θ) = θ exp[−θy] .
(a) Write the log-likelihood function, lnLT (θ).
(b) Derive and interpret the maximum likelihood estimator, θ̂.
(c) Compute the maximum likelihood estimate, θ̂.
(d) Compute the log-likelihood function at θ̂ for each observation.
(e) Compute the value of the log-likelihood function at θ̂.
(f) Compute
gt(θ̂) =
d ln lt(θ)
dθ
∣∣∣∣
θ=θ̂
and ht(θ̂) =
d2 ln lt(θ)
dθ2
∣∣∣∣
θ=θ̂
,
for each observation.
(g) Compute
GT (θ̂) =
1
T
T∑
t=1
gt(θ̂) and HT (θ̂) =
1
T
T∑
t=1
ht(θ̂) .
(4) Alternative Form of Exponential Distribution
Consider a random sample of size T , {y1, y2, · · · , yT }, of iid random
variables from the exponential distribution with parameter θ
f(y; θ) =
1
θ
exp
[
−y
θ
]
.
(a) Derive the log-likelihood function, lnLT (θ).
(b) Derive the first derivative of the log-likelihood function, GT (θ).
1.6 Exercises 31
(c) Derive the second derivative of the log-likelihood function, HT (θ).
(d) Derive the maximum likelihood estimator of θ. Compare the result
with that obtained in Exercise 3.
(5) Normal Distribution
Gauss file(s) basic_normal.g, basic_normal_like.g
Matlab file(s) basic_normal.m, basic_normal_like.m
A sample of T = 5 observations consisting of the values {1, 2, 5, 1, 2} is
drawn from the normal distribution
f(y; θ) =
1√
2πσ2
exp
[
−(y − µ)
2
2σ2
]
,
where θ = {µ, σ2}.
(a) Assume that σ2 = 1.
(i) Derive the log-likelihood function, lnLT (θ).
(ii) Derive and interpret the maximum likelihood estimator, θ̂.
(iii) Compute the maximum likelihood estimate, θ̂.
(iv) Compute ln lt(θ̂), gt(θ̂) and ht(θ̂).
(v) Compute lnLT (θ̂), GT (θ̂) and HT (θ̂).
(b) Repeat part (a) for the case where both the mean and the variance
are unknown, θ = {µ, σ2}.
(6) A Model of the Number of Strikes
Gauss file(s) basic_count.g, strike.dat
Matlab file(s) basic_count.m, strike.mat
The data are the number of strikes per annum, yt, in the U.S. from 1968
to 1976, taken from Kennan (1985). The number of strikes is specified
as a Poisson-distributed random variable with unknown parameter θ
f (y; θ) =
θy exp[−θ]
y!
.
(a) Write the log-likelihood function for a sample of T observations.
(b) Derive and interpret the maximum likelihood estimator of θ.
(c) Estimate θ and interpret the result.
(d) Use the estimate from part (c), to plot the distribution of the number
of strikes and interpret this plot.
32 The Maximum Likelihood Principle
(e) Compute a histogram of yt and comment on its consistency with
the distribution of strike numbers estimated in part (d).
(7) A Model of the Duration of Strikes
Gauss file(s) basic_strike.g, strike.dat
Matlab file(s) basic_strike.m, strike.mat
The data are 62 observations, taken from the same source as Exercise
6, of the duration of strikes in the U.S. per annum expressed in days,
yt. Durations are assumed to be drawn from an exponential distribution