Baixe o app para aproveitar ainda mais
Prévia do material em texto
Econometric Modelling with Time Series Specification, Estimation and Testing V. L. Martin, A. S. Hurn and D. Harris iv Preface This book provides a general framework for specifying, estimating and test- ing time series econometric models. Special emphasis is given to estima- tion by maximum likelihood, but other methods are also discussed includ- ing quasi-maximum likelihood estimation, generalized method of moments, nonparametrics and estimation by simulation. An important advantage of adopting the principle of maximum likelihood as the unifying framework for the book is that many of the estimators and test statistics proposed in econo- metrics can be derived within a likelihood framework, thereby providing a coherent vehicle for understanding their properties and interrelationships. In contrast to many existing econometric textbooks, which deal mainly with the theoretical properties of estimators and test statistics through a theorem-proof presentation, this book is very concerned with implemen- tation issues in order to provide a fast-track between the theory and ap- plied work. Consequently many of the econometric methods discussed in the book are illustrated by means of a suite of programs written in GAUSS and MATLABR©.1 The computer code emphasizes the computational side of econometrics and follows the notation in the book as closely as possible, thereby reinforcing the principles presented in the text. More generally, the computer code also helps to bridge the gap between theory and practice by enabling the reproduction of both theoretical and empirical results pub- lished in recent journal articles. The reader, as a result, may build on the code and tailor it to more involved applications. Organization of the Book Part ONE of the book is an exposition of the basic maximum likelihood framework. To implement this approach, three conditions are required: the probability distribution of the stochastic process must be known and spec- ified correctly, the parametric specifications of the moments of the distri- bution must be known and specified correctly, and the likelihood must be tractable. The properties of maximum likelihood estimators are presented and three fundamental testing procedures – namely, the Likelihood Ratio test, the Wald test and the Lagrange Multiplier test – are discussed in detail. There is also a comprehensive treatment of iterative algorithms to compute maximum likelihood estimators when no analytical expressions are available. Part TWO is the usual regression framework taught in standard econo- metric courses but presented within the maximum likelihood framework. 1 GAUSS is a registered trademark of Aptech Systems, Inc. http://www.aptech.com/ and MATLABR© is a registered trademark of The MathWorks, Inc. http://www.mathworks.com/. v Both nonlinear regression models and non-spherical models exhibiting ei- ther autocorrelation or heteroskedasticity, or both, are presented. A further advantage of the maximum likelihood strategy is that it provides a mecha- nism for deriving new estimators and new test statistics, which are designed specifically for non-standard problems. Part THREE provides a coherent treatment of a number of alternative es- timation procedures which are applicable when the conditions to implement maximum likelihood estimation are not satisfied. For the case where the probability distribution is incorrectly specified, quasi-maximum likelihood is appropriate. If the joint probability distribution of the data is treated as unknown, then a generalized method of moments estimator is adopted. This estimator has the advantage of circumventing the need to specify the dis- tribution and hence avoids any potential misspecification from an incorrect choice of the distribution. An even less restrictive approach is not to specify either the distribution or the parametric form of the moments of the distri- bution and use nonparametric procedures to model either the distribution of variables or the relationships between variables. Simulation estimation methods are used for models where the likelihood is intractable arising, for example, from the presence of latent variables. Indirect inference, efficient methods of moments and simulated methods of moments are presented and compared. Part FOUR examines stationary time series models with a special empha- sis on using maximum likelihood methods to estimate and test these models. Both single equation models, including the autoregressive moving average class of models, and multiple equation models, including vector autoregres- sions and structural vector autoregressions, are dealt with in detail. Also discussed are linear factor models where the factors are treated as latent. The presence of the latent factor means that the full likelihood is generally not tractable. However, if the models are specified in terms of the normal distribution with moments based on linear parametric representations, a Kalman filter is used to rewrite the likelihood in terms of the observable variables thereby making estimation and testing by maximum likelihood feasible. Part FIVE focusses on nonstationary time series models and in particular tests for unit roots and cointegration. Some important asymptotic results for nonstationary time series are presented followed by a comprehensive dis- cussion of testing for unit roots. Cointegration is tackled from the perspec- tive that the well-known Johansen estimator may be usefully interpreted as a maximum likelihood estimator based on the assumption of a normal distribution applied to a system of equations that is subject to a set of vi cross-equation restrictions arising from the assumption of common long-run relationships. Further, the trace and maximum eigenvalue tests of cointegra- tion are shown to be likelihood ratio tests. Part SIX is concerned with nonlinear time series models. Models that are nonlinear in mean include the threshold class of model, bilinear models and also artificial neural network modelling, which, contrary to many existing treatments, is again addressed from the econometric perspective of estima- tion and testing based on maximum likelihood methods. Nonlinearities in variance are dealt with in terms of the GARCH class of models. The final chapter focusses on models that deal with discrete or truncated time series data. Even in a project of this size and scope, sacrifices have had to be made to keep the length of the book manageable. Accordingly, there are a number of important topics that have had to be omitted. (i) Although Bayesian methods are increasingly being used in many areas of statistics and econometrics, no material on Bayesian econometrics is included. This is an important field in its own right and the interested reader is referred to recent books by Koop (2003), Geweke (2005), Koop, Poirier and Tobias (2007) and Greenberg (2008), inter alia. Where ap- propriate, references to Bayesian methods are provided in the body of the text. (ii) With great reluctance a chapter on bootstrapping was not included be- cause of space issues. A good place to start reading is the introductory text by Efron and Tibshirani (1993) and the useful surveys by Horowitz (1997) and Li and Maddala (1996b,1996a). (iii) In Part SIX, in the chapter dealing with modelling the variance of time series, there are important recent developments in stochastic volatility and realized volatility that would be worthy of inclusion. For stochastic volatility, there is an excellent volume of readings edited by Shephard (2005), while the seminal articles in the area of realized volatility are Anderson et al. (2001, 2003). The fact that these areas have not been covered should not be regarded as a value judgement about their relative importance. Instead the subject matter chosen for inclusion reflects a balance between the interests of the authors and purely operational decisions aimedat preserving the flow and continuity of the book. vii Computer Code Specifically, computer code is available from a companion website to repro- duce relevant examples in the text, to reproduce figures in the text that are not part of an example, to reproduce the applications presented in the final section of each chapter, and to complete the exercises. Where applicable, the time series data used in these examples, applications and exercises are also available in a number of different formats. Presenting numerical results in the examples immediately gives rise to two important issues concerning numerical precision. (1) In all of the examples listed in the front of the book where computer code has been used, the numbers appearing in the text are rounded versions of those generated by the code. Accordingly, the rounded numbers should be interpreted as such and not be used independently of the computer code to try and reproduce the numbers reported in the text. (2) In many of the examples, simulation has been used to demonstrate a concept. Since GAUSS and MATLAB have different random number gen- erators, the results generated by the different sets of code will not be identical to one another. For consistency we have always used the GAUSS output for reporting purposes. Although GAUSS and MATLAB are very similar high-level programming languages, there are some important differences that require explanation. Probably the most important difference is one of programming style. GAUSS programs are script files that allow calls to both inbuilt GAUSS and user- defined procedures. MATLAB, on the other hand, does not support the use of user-defined functions in script files. Furthermore, MATLAB programming style favours writing user-defined functions in separate files and then calling them as if they were in-built functions. This style of programming does not suit the learning-by-doing environment that the book tries to create. Con- sequently, the MATLAB programs are written mainly as function files with a main function and all the required user-defined functions required to im- plement the procedure in the same file. The only exception to this rule is that a few MATLAB utility files, which greatly facilitate the conversion and interpretation of code from GAUSS to MATLAB, which are provided as sep- arate stand-alone MATLAB function files. Finally, all the figures in the text were created using MATLAB together with a utility file laprint.m written by Arno Linnemann of the University of Kessel.2 2 A user guide is available at http://www.uni-kassel.de/fb16/rat/matlab/laprint/laprintdoc.ps. viii Acknowledgements Creating a manuscript of this scope and magnitude is a daunting task and there are many people to whom we are indebted. In particular, we would like to thank Kenneth Lindsay, Adrian Pagan and Andy Tremayne for their careful reading of various chapters of the manuscript and for many helpful comments and suggestions. Gael Martin helped with compiling a suitable list of references to Bayesian econometric methods. Ayesha Scott compiled the index, a painstaking task for a manuscript of this size. Many others have commented on earlier drafts of chapters and we are grateful to the following individuals: our colleagues, Gunnar B̊ardsen, Ralf Becker, Adam Clements, Vlad Pavlov and Joseph Jeisman; and our graduate students, Tim Christensen, Christopher Coleman-Fenn, Andrew McClelland, Jessie Wang and Vivianne Vilar. We also wish to express our deep appreciation to the team at Cambridge University Press, particularly Peter C. B. Phillips for his encouragement and support throughout the long gestation period of the book as well as for reading and commenting on earlier drafts. Scott Parris, with his energy and enthusiasm for the project, was a great help in sustaining the authors during the long slog of completing the manuscript. Our thanks are also due to our CUP readers who provided detailed and constructive feedback at various stages in the compilation of the final document. Michael Erkelenz of Fine Line Writers edited the entire manuscript, helped to smooth out the prose and provided particular assistance with the correct use of adjectival constructions in the passive voice. It is fair to say that writing this book was an immense task that involved the consumption of copious quantities of chillies, champagne and port over a protracted period of time. The biggest debt of gratitude we owe, therefore, is to our respective families. To Gael, Sarah and David; Cath, Iain, Robert and Tim; and Fiona and Caitlin: thank you for your patience, your good humour in putting up with and cleaning up after many a pizza night, your stoicism in enduring yet another vacant stare during an important conversation and, ultimately, for making it all worthwhile. Vance Martin, Stan Hurn & David Harris November 2011 Contents List of illustrations page 1 Computer Code used in the Examples 4 PART ONE MAXIMUM LIKELIHOOD 1 1 The Maximum Likelihood Principle 3 1.1 Introduction 3 1.2 Motivating Examples 3 1.3 Joint Probability Distributions 9 1.4 Maximum Likelihood Framework 12 1.4.1 The Log-Likelihood Function 12 1.4.2 Gradient 18 1.4.3 Hessian 20 1.5 Applications 23 1.5.1 Stationary Distribution of the Vasicek Model 23 1.5.2 Transitional Distribution of the Vasicek Model 25 1.6 Exercises 28 2 Properties of Maximum Likelihood Estimators 35 2.1 Introduction 35 2.2 Preliminaries 35 2.2.1 Stochastic Time Series Models and Their Prop- erties 36 2.2.2 Weak Law of Large Numbers 41 2.2.3 Rates of Convergence 45 2.2.4 Central Limit Theorems 47 2.3 Regularity Conditions 55 2.4 Properties of the Likelihood Function 57 x Contents 2.4.1 The Population Likelihood Function 57 2.4.2 Moments of the Gradient 58 2.4.3 The Information Matrix 61 2.5 Asymptotic Properties 63 2.5.1 Consistency 63 2.5.2 Normality 67 2.5.3 Efficiency 68 2.6 Finite-Sample Properties 72 2.6.1 Unbiasedness 73 2.6.2 Sufficiency 74 2.6.3 Invariance 75 2.6.4 Non-Uniqueness 76 2.7 Applications 76 2.7.1 Portfolio Diversification 78 2.7.2 Bimodal Likelihood 80 2.8 Exercises 82 3 Numerical Estimation Methods 91 3.1 Introduction 91 3.2 Newton Methods 92 3.2.1 Newton-Raphson 93 3.2.2 Method of Scoring 94 3.2.3 BHHH Algorithm 95 3.2.4 Comparative Examples 98 3.3 Quasi-Newton Methods 101 3.4 Line Searching 102 3.5 Optimisation Based on Function Evaluation 104 3.6 Computing Standard Errors 106 3.7 Hints for Practical Optimization 109 3.7.1 Concentrating the Likelihood 109 3.7.2 Parameter Constraints 110 3.7.3 Choice of Algorithm 111 3.7.4 Numerical Derivatives 112 3.7.5 Starting Values 113 3.7.6 Convergence Criteria 113 3.8 Applications 114 3.8.1 Stationary Distribution of the CIR Model 114 3.8.2 Transitional Distribution of the CIR Model 116 3.9 Exercises 118 Contents xi 4 Hypothesis Testing 124 4.1 Introduction 124 4.2 Overview 124 4.3 Types of Hypotheses 126 4.3.1 Simple and Composite Hypotheses 126 4.3.2 Linear Hypotheses 127 4.3.3 Nonlinear Hypotheses 128 4.4 Likelihood Ratio Test 129 4.5 Wald Test 133 4.5.1 Linear Hypotheses 134 4.5.2 Nonlinear Hypotheses 136 4.6 Lagrange Multiplier Test 137 4.7 Distribution Theory 139 4.7.1 Asymptotic Distribution of the Wald Statistic 139 4.7.2 Asymptotic Relationships Among the Tests 142 4.7.3 Finite Sample Relationships 143 4.8 Size and Power Properties 145 4.8.1 Size of a Test 145 4.8.2 Power of a Test 146 4.9 Applications 148 4.9.1 Exponential Regression Model 148 4.9.2 Gamma Regression Model 151 4.10 Exercises 153 PART TWO REGRESSION MODELS 159 5 Linear Regression Models 161 5.1 Introduction 161 5.2 Specification 162 5.2.1 Model Classification 162 5.2.2 Structural and Reduced Forms 163 5.3 Estimation 166 5.3.1 Single Equation: Ordinary Least Squares 166 5.3.2 Multiple Equations: FIML 170 5.3.3 Identification 175 5.3.4 Instrumental Variables 177 5.3.5 Seemingly Unrelated Regression181 5.4 Testing 182 5.5 Applications 187 xii Contents 5.5.1 Linear Taylor Rule 187 5.5.2 The Klein Model of the U.S. Economy 189 5.6 Exercises 191 6 Nonlinear Regression Models 199 6.1 Introduction 199 6.2 Specification 199 6.3 Maximum Likelihood Estimation 201 6.4 Gauss-Newton 208 6.4.1 Relationship to Nonlinear Least Squares 212 6.4.2 Relationship to Ordinary Least Squares 213 6.4.3 Asymptotic Distributions 213 6.5 Testing 214 6.5.1 LR, Wald and LM Tests 214 6.5.2 Nonnested Tests 218 6.6 Applications 221 6.6.1 Robust Estimation of the CAPM 221 6.6.2 Stochastic Frontier Models 224 6.7 Exercises 228 7 Autocorrelated Regression Models 234 7.1 Introduction 234 7.2 Specification 234 7.3 Maximum Likelihood Estimation 236 7.3.1 Exact Maximum Likelihood 237 7.3.2 Conditional Maximum Likelihood 238 7.4 Alternative Estimators 240 7.4.1 Gauss-Newton 241 7.4.2 Zig-zag Algorithms 244 7.4.3 Cochrane-Orcutt 247 7.5 Distribution Theory 248 7.5.1 Maximum Likelihood Estimator 249 7.5.2 Least Squares Estimator 253 7.6 Lagged Dependent Variables 258 7.7 Testing 260 7.7.1 Alternative LM Test I 262 7.7.2 Alternative LM Test II 263 7.7.3 Alternative LM Test III 264 7.8 Systems of Equations 265 7.8.1 Estimation 266 7.8.2 Testing 268 Contents xiii 7.9 Applications 268 7.9.1 Illiquidity and Hedge Funds 268 7.9.2 Beach-Mackinnon Simulation Study 269 7.10 Exercises 271 8 Heteroskedastic Regression Models 280 8.1 Introduction 280 8.2 Specification 280 8.3 Estimation 283 8.3.1 Maximum Likelihood 283 8.3.2 Relationship with Weighted Least Squares 286 8.4 Distribution Theory 289 8.5 Testing 289 8.6 Heteroskedasticity in Systems of Equations 295 8.6.1 Specification 295 8.6.2 Estimation 297 8.6.3 Testing 299 8.6.4 Heteroskedastic and Autocorrelated Disturbances 300 8.7 Applications 302 8.7.1 The Great Moderation 302 8.7.2 Finite Sample Properties of the Wald Test 304 8.8 Exercises 306 PART THREE OTHER ESTIMATION METHODS 313 9 Quasi-Maximum Likelihood Estimation 315 9.1 Introduction 315 9.2 Misspecification 316 9.3 The Quasi-Maximum Likelihood Estimator 320 9.4 Asymptotic Distribution 323 9.4.1 Misspecification and the Information Equality 325 9.4.2 Independent and Identically Distributed Data 328 9.4.3 Dependent Data: Martingale Difference Score 329 9.4.4 Dependent Data and Score 330 9.4.5 Variance Estimation 331 9.5 Quasi-Maximum Likelihood and Linear Regression 333 9.5.1 Nonnormality 336 9.5.2 Heteroskedasticity 337 9.5.3 Autocorrelation 338 9.5.4 Variance Estimation 342 xiv Contents 9.6 Testing 346 9.7 Applications 348 9.7.1 Autoregressive Models for Count Data 348 9.7.2 Estimating the Parameters of the CKLS Model 351 9.8 Exercises 354 10 Generalized Method of Moments 361 10.1 Introduction 361 10.2 Motivating Examples 362 10.2.1 Population Moments 362 10.2.2 Empirical Moments 363 10.2.3 GMM Models from Conditional Expectations 368 10.2.4 GMM and Maximum Likelihood 371 10.3 Estimation 372 10.3.1 The GMM Objective Function 372 10.3.2 Asymptotic Properties 373 10.3.3 Estimation Strategies 378 10.4 Over-Identification Testing 382 10.5 Applications 387 10.5.1 Monte Carlo Evidence 387 10.5.2 Level Effect in Interest Rates 393 10.6 Exercises 396 11 Nonparametric Estimation 404 11.1 Introduction 404 11.2 The Kernel Density Estimator 405 11.3 Properties of the Kernel Density Estimator 409 11.3.1 Finite Sample Properties 410 11.3.2 Optimal Bandwidth Selection 410 11.3.3 Asymptotic Properties 414 11.3.4 Dependent Data 416 11.4 Semi-Parametric Density Estimation 417 11.5 The Nadaraya-Watson Kernel Regression Estimator 419 11.6 Properties of Kernel Regression Estimators 423 11.7 Bandwidth Selection for Kernel Regression 427 11.8 Multivariate Kernel Regression 430 11.9 Semi-parametric Regression of the Partial Linear Model 432 11.10 Applications 433 11.10.1Derivatives of a Nonlinear Production Function 434 11.10.2Drift and Diffusion Functions of SDEs 436 11.11 Exercises 439 Contents xv 12 Estimation by Simulation 447 12.1 Introduction 447 12.2 Motivating Example 448 12.3 Indirect Inference 450 12.3.1 Estimation 451 12.3.2 Relationship with Indirect Least Squares 455 12.4 Efficient Method of Moments (EMM) 456 12.4.1 Estimation 456 12.4.2 Relationship with Instrumental Variables 458 12.5 Simulated Generalized Method of Moments (SMM) 459 12.6 Estimating Continuous-Time Models 461 12.6.1 Brownian Motion 464 12.6.2 Geometric Brownian Motion 467 12.6.3 Stochastic Volatility 470 12.7 Applications 472 12.7.1 Simulation Properties 473 12.7.2 Empirical Properties 475 12.8 Exercises 477 PART FOUR STATIONARY TIME SERIES 483 13 Linear Time Series Models 485 13.1 Introduction 485 13.2 Time Series Properties of Data 486 13.3 Specification 488 13.3.1 Univariate Model Classification 489 13.3.2 Multivariate Model Classification 491 13.3.3 Likelihood 493 13.4 Stationarity 493 13.4.1 Univariate Examples 494 13.4.2 Multivariate Examples 495 13.4.3 The Stationarity Condition 496 13.4.4 Wold’s Representation Theorem 497 13.4.5 Transforming a VAR to a VMA 498 13.5 Invertibility 501 13.5.1 The Invertibility Condition 501 13.5.2 Transforming a VMA to a VAR 502 13.6 Estimation 502 13.7 Optimal Choice of Lag Order 506 xvi Contents 13.8 Distribution Theory 508 13.9 Testing 511 13.10 Analyzing Vector Autoregressions 513 13.10.1Granger Causality Testing 515 13.10.2Impulse Response Functions 517 13.10.3Variance Decompositions 523 13.11 Applications 525 13.11.1Barro’s Rational Expectations Model 525 13.11.2The Campbell-Shiller Present Value Model 526 13.12 Exercises 528 14 Structural Vector Autoregressions 537 14.1 Introduction 537 14.2 Specification 538 14.2.1 Short-Run Restrictions 542 14.2.2 Long-Run Restrictions 544 14.2.3 Short-Run and Long-Run Restrictions 548 14.2.4 Sign Restrictions 550 14.3 Estimation 553 14.4 Identification 558 14.5 Testing 559 14.6 Applications 561 14.6.1 Peersman’s Model of Oil Price Shocks 561 14.6.2 A Portfolio SVAR Model of Australia 563 14.7 Exercises 566 15 Latent Factor Models 571 15.1 Introduction 571 15.2 Motivating Examples 572 15.2.1 Empirical 572 15.2.2 Theoretical 574 15.3 The Recursions of the Kalman Filter 575 15.3.1 Univariate 576 15.3.2 Multivariate 581 15.4 Extensions 585 15.4.1 Intercepts 585 15.4.2 Dynamics 585 15.4.3 Nonstationary Factors 587 15.4.4 Exogenous and Predetermined Variables 589 15.5 Factor Extraction 589 15.6 Estimation 591 Contents xvii 15.6.1 Identification 591 15.6.2 Maximum Likelihood 591 15.6.3 Principal Components Estimator 593 15.7 Relationship to VARMA Models 596 15.8 Applications 597 15.8.1 The Hodrick-Prescott Filter 597 15.8.2 A Factor Model of Spreads with Money Shocks 601 15.9 Exercises 603 PART FIVE NON-STATIONARY TIME SERIES 613 16 Nonstationary Distribution Theory 615 16.1 Introduction 615 16.2 Specification 616 16.2.1 Models of Trends 616 16.2.2 Integration 618 16.3 Estimation 620 16.3.1 Stationary Case 621 16.3.2 Nonstationary Case: Stochastic Trends 624 16.3.3 Nonstationary Case: Deterministic Trends 626 16.4 Asymptotics for Integrated Processes 629 16.4.1 Brownian Motion 630 16.4.2 Functional Central Limit Theorem 631 16.4.3 Continuous Mapping Theorem 635 16.4.4 Stochastic Integrals 637 16.5 Multivariate Analysis 638 16.6 Applications 640 16.6.1 Least Squares Estimator of the AR(1) Model 641 16.6.2 Trend Misspecification 643 16.7 Exercises 644 17 Unit Root Testing 651 17.1 Introduction 651 17.2 Specification 651 17.3 Detrending 653 17.3.1 Ordinary Least Squares: Dickey and Fuller 655 17.3.2 First Differences: Schmidt and Phillips 656 17.3.3 Generalized Least Squares: Elliott, Rothenberg and Stock 657 17.4 Testing 658 xviii Contents 17.4.1 Dickey-Fuller Tests 659 17.4.2 M Tests 660 17.5 Distribution Theory 662 17.5.1 Ordinary Least Squares Detrending 664 17.5.2 Generalized Least Squares Detrending 665 17.5.3 Simulating Critical Values 66717.6 Power 668 17.6.1 Near Integration and the Ornstein-Uhlenbeck Processes 669 17.6.2 Asymptotic Local Power 671 17.6.3 Point Optimal Tests 671 17.6.4 Asymptotic Power Envelope 673 17.7 Autocorrelation 675 17.7.1 Dickey-Fuller Test with Autocorrelation 675 17.7.2 M Tests with Autocorrelation 676 17.8 Structural Breaks 678 17.8.1 Known Break Point 681 17.8.2 Unknown Break Point 684 17.9 Applications 685 17.9.1 Power and the Initial Value 685 17.9.2 Nelson-Plosser Data Revisited 687 17.10 Exercises 687 18 Cointegration 695 18.1 Introduction 695 18.2 Long-Run Economic Models 696 18.3 Specification: VECM 698 18.3.1 Bivariate Models 698 18.3.2 Multivariate Models 700 18.3.3 Cointegration 701 18.3.4 Deterministic Components 703 18.4 Estimation 705 18.4.1 Full-Rank Case 706 18.4.2 Reduced-Rank Case: Iterative Estimator 707 18.4.3 Reduced Rank Case: Johansen Estimator 709 18.4.4 Zero-Rank Case 715 18.5 Identification 716 18.5.1 Triangular Restrictions 716 18.5.2 Structural Restrictions 717 18.6 Distribution Theory 718 Contents xix 18.6.1 Asymptotic Distribution of the Eigenvalues 718 18.6.2 Asymptotic Distribution of the Parameters 720 18.7 Testing 724 18.7.1 Cointegrating Rank 724 18.7.2 Cointegrating Vector 727 18.7.3 Exogeneity 730 18.8 Dynamics 731 18.8.1 Impulse responses 731 18.8.2 Cointegrating Vector Interpretation 732 18.9 Applications 732 18.9.1 Rank Selection Based on Information Criteria 733 18.9.2 Effects of Heteroskedasticity on the Trace Test 735 18.10 Exercises 737 PART SIX NONLINEAR TIME SERIES 747 19 Nonlinearities in Mean 749 19.1 Introduction 749 19.2 Motivating Examples 749 19.3 Threshold Models 755 19.3.1 Specification 755 19.3.2 Estimation 756 19.3.3 Testing 758 19.4 Artificial Neural Networks 761 19.4.1 Specification 761 19.4.2 Estimation 764 19.4.3 Testing 766 19.5 Bilinear Time Series Models 767 19.5.1 Specification 767 19.5.2 Estimation 768 19.5.3 Testing 769 19.6 Markov Switching Model 770 19.7 Nonparametric Autoregression 774 19.8 Nonlinear Impulse Responses 775 19.9 Applications 779 19.9.1 A Multiple Equilibrium Model of Unemployment 779 19.9.2 Bivariate Threshold Models of G7 Countries 781 19.10 Exercises 784 xx Contents 20 Nonlinearities in Variance 795 20.1 Introduction 795 20.2 Statistical Properties of Asset Returns 795 20.3 The ARCH Model 799 20.3.1 Specification 799 20.3.2 Estimation 801 20.3.3 Testing 804 20.4 Univariate Extensions 807 20.4.1 GARCH 807 20.4.2 Integrated GARCH 812 20.4.3 Additional Variables 813 20.4.4 Asymmetries 814 20.4.5 Garch-in-Mean 815 20.4.6 Diagnostics 817 20.5 Conditional Nonnormality 818 20.5.1 Parametric 819 20.5.2 Semi-Parametric 821 20.5.3 Nonparametric 821 20.6 Multivariate GARCH 825 20.6.1 VECH 826 20.6.2 BEKK 827 20.6.3 DCC 830 20.6.4 DECO 836 20.7 Applications 837 20.7.1 DCC and DECO Models of U.S. Zero Coupon Yields 837 20.7.2 A Time-Varying Volatility SVAR Model 838 20.8 Exercises 841 21 Discrete Time Series Models 850 21.1 Introduction 850 21.2 Motivating Examples 850 21.3 Qualitative Data 853 21.3.1 Specification 853 21.3.2 Estimation 857 21.3.3 Testing 861 21.3.4 Binary Autoregressive Models 863 21.4 Ordered Data 865 21.5 Count Data 867 21.5.1 The Poisson Regression Model 869 Contents xxi 21.5.2 Integer Autoregressive Models 871 21.6 Duration Data 874 21.7 Applications 876 21.7.1 An ACH Model of U.S. Airline Trades 876 21.7.2 EMM Estimator of Integer Models 879 21.8 Exercises 881 Appendix A Change of Variable in Probability Density Func- tions 887 Appendix B The Lag Operator 888 B.1 Basics 888 B.2 Polynomial Convolution 889 B.3 Polynomial Inversion 890 B.4 Polynomial Decomposition 891 Appendix C FIML Estimation of a Structural Model 892 C.1 Log-likelihood Function 892 C.2 First-order Conditions 892 C.3 Solution 893 Appendix D Additional Nonparametric Results 897 D.1 Mean 897 D.2 Variance 899 D.3 Mean Square Error 901 D.4 Roughness 902 D.4.1 Roughness Results for the Gaussian Distribution 902 D.4.2 Roughness Results for the Gaussian Kernel 903 References 905 Author index 915 Subject index 918 Illustrations 1.1 Probability distributions of y for various models 5 1.2 Probability distributions of y for various models 7 1.3 Log-likelihood function for Poisson distribution 15 1.4 Log-likelihood function for exponential distribution 15 1.5 Log-likelihood function for the normal distribution 17 1.6 Eurodollar interest rates 24 1.7 Stationary density of Eurodollar interest rates 25 1.8 Transitional density of Eurodollar interest rates 27 2.1 Demonstration of the weak law of large numbers 42 2.2 Demonstration of the Lindeberg-Levy central limit theorem 49 2.3 Convergence of log-likelihood function 65 2.4 Consistency of sample mean for normal distribution 65 2.5 Consistency of median for Cauchy distribution 66 2.6 Illustrating asymptotic normality 69 2.7 Bivariate normal distribution 77 2.8 Scatter plot of returns on Apple and Ford stocks 78 2.9 Gradient of the bivariate normal model 81 3.1 Stationary density of Eurodollar interest rates: CIR model 115 3.2 Estimated variance function of CIR model 117 4.1 Illustrating the LR and Wald tests 125 4.2 Illustrating the LM test 126 4.3 Simulated and asymptotic distributions of the Wald test 142 5.1 Simulating a bivariate regression model 166 5.2 Sampling distribution of a weak instrument 180 5.3 U.S. data on the Taylor Rule 188 6.1 Simulated exponential models 201 6.2 Scatter of plot Martin Marietta returns data 222 6.3 Stochastic frontier disturbance distribution 225 7.1 Simulated models with autocorrelated disturbances 236 2 Illustrations 7.2 Distribution of maximum likelihood estimator in an autocorre- lated regression model 252 8.1 Simulated data from heteroskedastic models 282 8.2 The Great Moderation 303 8.3 Sampling distribution of Wald test 305 8.4 Power of Wald test 305 9.1 Comparison of true and misspecified log-likelihood functions 317 9.2 U.S. Dollar/British Pound exchange rates 345 9.3 Estimated variance function of CKLS model 353 11.1 Bias and variance of the kernel estimate of density 411 11.2 Kernel estimate of distribution of stock index returns 413 11.3 Bivariate normal density 414 11.4 Semiparametric density estimator 419 11.5 Parametric conditional mean estimates 420 11.6 Nadaraya-Watson nonparametric kernel regression 424 11.7 Effect of bandwidth on kernel regression 425 11.8 Cross validation bandwidth selection 429 11.9 Two-dimensional product kernel 431 11.10 Semiparametric regression 433 11.11 Nonparametric production function 435 11.12 Nonparametric estimates of drift and diffusion functions 438 12.1 Simulated AR(1) model 450 12.2 Illustrating Brownian motion 462 13.1 U.S. macroeconomic data 487 13.2 Plots of simulated stationary time series 490 13.3 Choice of optimal lag order 508 14.1 Bivariate SVAR model 541 14.2 Bivariate SVAR with short-run restrictions 545 14.3 Bivariate SVAR with long-run restrictions 547 14.4 Bivariate SVAR with short- and long-run restrictions 549 14.5 Bivariate SVAR with sign restrictions 552 14.6 Impuse responses of Peerman’s model 564 15.1 Daily U.S. zero coupon rates 573 15.2 Alternative priors for latent factors in the Kalman filter 588 15.3 Factor loadings of a term structure model 595 15.4 Hodrick-Prescott filter of real U.S. GPD 601 16.1 Nelson-Plosser data 618 16.2 Simulated distribution of AR1 parameter 624 16.3 Continuous-time processes 633 16.4 Functional Central Limit Theorem 635 16.5 Distribution of a stochastic integral 638 16.6 Mixed normal distribution 640 17.1 Real U.S. GDP 652 Illustrations 3 17.2 Detrending 658 17.3 Near unit root process 669 17.4 Aymptotic power curve of ADF tests 672 17.5 Asymptotic power envelope of ADF tests 674 17.6 Structural breaks in U.S. GDP 679 17.7 Union of rejections approach 686 18.1 Permanent income hypothesis 696 18.2 Long run money demand 697 18.3 Term structure of U.S. yields 698 18.4 Error correction phase diagram 699 19.1 Propertiesof an AR(2) model 750 19.2 Limit cycle 751 19.3 Strange attractor 752 19.4 Nonlinear error correction model 753 19.5 U.S. unemployment 754 19.6 Threshold functions 757 19.7 Decomposition of an ANN 762 19.8 Simulated bilinear time series models 768 19.9 Markov switching model of U.S. output 773 19.10 Nonparametric estimate of a TAR(1) model 775 19.11 Simulated TAR models for G7 countries 783 20.1 Statistical properties of FTSE returns 796 20.2 Distribution of FTSE returns 799 20.3 News impact curve 801 20.4 ACF of GARCH(1,1) models 810 20.5 Conditional variance of FTSE returns 812 20.6 Risk-return preferences 816 20.7 BEKK model of U.S. zero coupon bonds 829 20.8 DECO model of interest rates 838 20.9 SVAR model of U.K. Libor spread 840 21.1 U.S. Federal funds target rate from 1984 to 2009 852 21.2 Money demand equation with a floor interest rate 853 21.3 Duration descriptive statistics for AMR 877 Computer Code used in the Examples (Code is written in GAUSS in which case the extension is .g and in MATLAB in which case the extension is .m) 1.1 basic sample.* 4 1.2 basic sample.* 6 1.3 basic sample.* 6 1.4 basic sample.* 6 1.5 basic sample.* 7 1.6 basic sample.* 8 1.7 basic sample.* 8 1.8 basic sample.* 9 1.10 basic poisson.* 13 1.11 basic exp.* 14 1.12 basic normal like.* 16 1.14 basic poisson.* 18 1.15 basic exp.* 19 1.16 basic normal like.* 19 1.18 basic exp.* 22 1.19 basic normal.* 22 2.5 prop wlln1.* 41 2.6 prop wlln2.* 42 2.8 prop moment.* 45 2.10 prop lindlevy.* 48 2.21 prop consistency.* 64 2.22 prop normal.* 64 2.23 prop cauchy.* 65 2.25 prop asymnorm.* 68 2.28 prop edgeworth.* 72 2.29 prop bias.* 73 3.2 max exp.* 93 3.3 max exp.* 95 3.4 max exp.* 97 3.6 max weibull.* 99 Computer Code used in the Examples 5 3.7 max exp.* 102 3.8 max exp.* 103 4.3 test weibull.* 133 4.5 test weibull.* 135 4.7 test weibull.* 139 4.10 test asymptotic.* 141 4.11 text size.* 145 4.12 test power.* 147 4.13 test power.* 147 5.5 linear simulation.* 165 5.6 linear estimate.* 169 5.7 linear fiml.* 171 5.8 linear fiml.* 173 5.10 linear weak.* 179 5.14 linear lr.*, linear wd.*, linear lm.* 182 5.15 linear fiml lr.*, linear fiml wd.*, linear fiml lm.* 185 6.3 nls simulate.* 200 6.5 nls exponential.* 206 6.7 nls consumption estimate.* 210 6.8 nls contest.* 215 6.11 nls money.* 219 7.1 auto simulate.* 235 7.5 auto invest.* 240 7.8 auto distribution.* 251 7.11 auto test.* 260 7.12 auto system.* 267 8.1 hetero simulate.* 281 8.3 hetero estimate.* 284 8.7 hetero test.* 293 8.9 hetero system.* 298 8.10 hetero system.* 299 8.11 hetero general.* 301 10.2 gmm table.* 366 10.3 gmm table.* 367 10.11 gmm ccapm.* 382 11.1 npd kernel.* 407 11.2 npd property.* 410 11.3 npd ftse.* 412 11.4 npd bivariate.* 414 11.5 npd seminonlin.* 418 11.6 npr parametric.* 419 11.7 npr nadwatson.* 422 11.8 npr property.* 424 6 Computer Code used in the Examples 11.10 npr bivariate.* 430 11.11 npr semi.* 432 12.1 sim mom.* 450 12.3 sim accuracy.* 453 12.4 sim ma1indirect.* 454 12.5 sim ma1emm.* 457 12.6 sim ma1overid.* 460 12.7 sim brownind.*,sim brownemm.* 466 13.1 stsm simulate.* 489 13.8 stsm root.* 496 13.9 stsm root.* 497 13.17 stsm varma.* 504 13.21 stsm anderson.* 511 13.24 stsm recursive.* 513 13.25 stsm recursive.* 516 13.26 stsm recursive.* 522 13.27 stsm recursive.* 523 14.2 svar bivariate.* 540 14.5 svar bivariate.* 544 14.9 svar bivariate.* 547 14.10 svar bivariate.* 548 14.12 svar bivariate.* 552 14.13 svar shortrun.* 554 14.14 svar longrun.* 556 14.15 svar recursive.* 557 14.17 svar test.* 560 14.18 svar test.* 561 15.1 kalman termfig.* 572 15.5 kalman uni.* 580 15.6 kalman multi.* 583 15.8 kalman smooth.* 590 15.9 kalman uni.* 592 15.10 kalman term.* 592 15.11 kalman fvar.* 594 15.12 kalman panic.* 594 16.1 nts nelplos.* 616 16.2 nts nelplos.* 616 16.3 nts nelplos.* 617 16.4 nts moment.* 622 16.5 nts moment.* 624 16.6 nts moment.* 628 16.7 nts yts.* 632 16.8 nts fclt.* 635 Computer Code used in the Examples 7 16.10 nts stochint.* 637 16.11 nts mixednormal.* 639 17.1 unit qusgdp.* 657 17.2 unit qusgdp.* 661 17.3 unit asypower1.* 671 17.4 unit asypowerenv.* 674 17.5 unit maicsim.* 677 17.6 unit qusgdp.* 679 17.8 unit qusgdp.* 683 17.9 unit qusgdp.* 685 18.1 coint lrgraphs.* 696 18.2 coint lrgraphs.* 696 18.3 coint lrgraphs.* 697 18.4 coint lrgraphs.* 702 18.6 coint bivterm.* 707 18.7 coint bivterm.* 708 18.8 coint bivterm.* 712 18.9 coint permincome.* 714 18.10 coint bivterm.* 715 18.11 coint triterm.* 716 18.13 coint simevals.* 719 18.16 coint bivterm.* 728 19.1 nlm features.* 750 19.2 nlm features.* 750 19.3 nlm features.* 751 19.4 nlm features.* 752 19.6 nlm tarsim.* 760 19.7 nlm annfig.* 762 19.8 nlm bilinear.* 767 19.9 nlm hamilton.* 772 19.10 nlm tar.* 774 19.11 nlm girf.* 778 20.1 garch nic.* 800 20.2 garch estimate.* 804 20.3 garch test.* 806 20.4 garch simulate.* 809 20.5 garch estimate.* 810 20.6 garch seasonality.* 813 20.7 garch mean.* 816 20.9 mgarch bekk.* 828 21.2 discrete mpol.* 852 21.3 discrete floor.* 852 21.4 discrete simulation.* 857 8 Computer Code used in the Examples 21.7 discrete probit.* 859 21.8 discrete probit.* 862 21.9 discrete ordered.* 866 21.11 discrete thinning.* 871 21.12 discrete poissonauto.* 873 Code Disclaimer Information Note that the computer code is provided for illustrative purposes only and although care has been taken to ensure that it works properly, it has not been thoroughly tested under all conditions and on all platforms. The authors and Cambridge University Press cannot guarantee or imply reliability, service- ability, or function of this computer code. All code is therefore provided ‘as is’ without any warranties of any kind. PART ONE MAXIMUM LIKELIHOOD 1 The Maximum Likelihood Principle 1.1 Introduction Maximum likelihood estimation is a general method for estimating the pa- rameters of econometric models from observed data. The principle of max- imum likelihood plays a central role in the exposition of this book, since a number of estimators used in econometrics can be derived within this frame- work. Examples include ordinary least squares, generalized least squares and full-information maximum likelihood. In deriving the maximum likelihood estimator, a key concept is the joint probability density function (pdf) of the observed random variables, yt. Maximum likelihood estimation requires that the following conditions are satisfied. (1) The form of the joint pdf of yt is known. (2) The specification of the moments of the joint pdf are known. (3) The joint pdf can be evaluated for all values of the parameters, θ. Parts ONE and TWO of this book deal with models in which all these conditions are satisfied. Part THREE investigates models in which these conditions are not satisfied and considers four important cases. First, if the distribution of yt is misspecified, resulting in both conditions 1 and 2 being violated, estimation is by quasi-maximum likelihood (Chapter 9). Second, if condition 1 is not satisfied, a generalized method of moments estimator (Chapter 10) is required. Third, if condition 2 is not satisfied, estimation relies on nonparametric methods (Chapter 11). Fourth, if condition 3 is violated, simulation-based estimation methods are used (Chapter 12). 1.2 Motivating Examples To highlight the role of probability distributions in maximum likelihood esti- mation, this section emphasizes the link between observed sample data and 4 The Maximum Likelihood Principle the probability distribution from which they are drawn. This relationship is illustrated with a number of simulation examples where samples of size T = 5 are drawn from a range of alternative models. The realizations of these draws for each model are listed in Table 1.1. Table 1.1 Realisations of yt from alternative models: t = 1, 2, · · · , 5. Model t=1 t=2 t=3 t=4 t=5 Time Invariant -2.720 2.470 0.495 0.597 -0.960 Count 2.000 4.000 3.000 4.000 0.000 Linear Regression 2.850 3.105 5.693 8.101 10.387 Exponential Regression 0.874 8.284 0.507 3.7225.865 Autoregressive 0.000 -1.031 -0.283 -1.323 -2.195 Bilinear 0.000 -2.721 0.531 1.350 -2.451 ARCH 0.000 3.558 6.989 7.925 8.118 Poisson 3.000 10.000 17.000 20.000 23.000 Example 1.1 Time Invariant Model Consider the model yt = σzt , where zt is a disturbance term and σ is a parameter. Let zt be a standardized normal distribution, N(0, 1), defined by f(z) = 1√ 2π exp [ −z 2 2 ] . The distribution of yt is obtained from the distribution of zt using the change of variable technique (see Appendix A for details) f(y ; θ) = f(z) ∣∣∣∣ ∂z ∂y ∣∣∣∣ , where θ = {σ2}. Applying this rule, and recognising that z = y/σ, yields f(y ; θ) = 1√ 2π exp [ −(y/σ) 2 2 ] ∣∣∣∣ 1 σ ∣∣∣∣ = 1√ 2πσ2 exp [ − y 2 2σ2 ] , or yt ∼ N(0, σ 2). In this model, the distribution of yt is time invariant because neither the mean nor the variance depend on time. This property is highlighted in panel (a) of Figure 1.1 where the parameter is σ = 2. For comparative purposes the distributions of both yt and zt are given. As yt = 2zt, the distribution of yt is flatter than the distribution of zt. 1.2 Motivating Examples 5 (a) Time Invariant Model f (y ) y z y (b) Count Model f (y ) y (c) Linear Regression Model f (y ) y (d) Exponential Regression Model f (y ) y -10 0 10 20-10 0 10 20 0 1 2 3 4 5 6 7 8 9-10 0 10 0 0.2 0.4 0.6 0.8 1 0 0.05 0.1 0.15 0.2 0 0.1 0.2 0.3 0 0.1 0.2 0.3 0.4 Figure 1.1 Probability distributions of y generated from the time invariant, count, linear regression and exponential regression models. Except for the time invariant and count models, the solid line represents the density at t = 1, the dashed line represents the density at t = 3 and the dotted line represents the density at t = 5. As the distribution of yt in Example 1.1 does not depend on lagged values yt−i, yt is independently distributed. In addition, since the distribution of yt is the same at each t, yt is identically distributed. These two properties are abbreviated as iid. Conversely, the distribution is dependent if yt depends on its own lagged values and non-identical if it changes over time. 6 The Maximum Likelihood Principle Example 1.2 Count Model Consider a time series of counts modelled as a series of draws from a Poisson distribution f (y; θ) = θy exp[−θ] y! , y = 0, 1, 2, · · · , where θ > 0 is an unknown parameter. A sample of T = 5 realizations of yt, given in Table 1.1, is drawn from the Poisson probability distribution in panel (b) of Figure 1.1 for θ = 2. By assumption, this distribution is the same at each point in time. In contrast to the data in the previous example where the random variable is continuous, the data here are discrete as they are positive integers that measure counts. Example 1.3 Linear Regression Model Consider the regression model yt = βxt + σzt , zt ∼ iidN(0, 1) , where xt is an explanatory variable that is independent of zt and θ = {β, σ2}. The distribution of y conditional on xt is f(y |xt; θ) = 1√ 2πσ2 exp [ −(y − βxt) 2 2σ2 ] , which is a normal distribution with conditional mean βxt and variance σ 2, or yt ∼ N(βxt, σ 2). This distribution is illustrated in panel (c) of Figure 1.1 with β = 3, σ = 2 and explanatory variable xt = {0, 1, 2, 3, 4}. The effect of xt is to shift the distribution of yt over time into the positive region, resulting in the draws of yt given in Table 1.1 becoming increasingly positive. As the variance at each point in time is constant, the spread of the distributions of yt is the same for all t. Example 1.4 Exponential Regression Model Consider the exponential regression model f(y |xt; θ) = 1 µt exp [ − y µt ] , where µt = β0+β1xt is the time-varying conditional mean, xt is an explana- tory variable and θ = {β0, β1}. This distribution is highlighted in panel (d) of Figure 1.1 with β0 = 1, β1 = 1 and xt = {0, 1, 2, 3, 4}. As β1 > 0, the ef- fect of xt is to cause the distribution of yt to become more positively skewed over time. 1.2 Motivating Examples 7 (a) Autoregressive Model f (y ) y (b) Bilinear Model f (y ) y (c) Autoregressive Heteroskedastic Model f (y ) y (d) ARCH Model f (y ) y -10 0 10 20-10 0 10 -10 0 10 20-10 0 10 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.05 0.1 0.15 0.2 0 0.05 0.1 0.15 0.2 Figure 1.2 Probability distributions of y generated from the autoregressive, bilinear, autoregressive with heteroskedasticity and ARCH models. The solid line represents the density at t = 1, the dashed line represents the density at t = 3 and the dotted line represents the density at t = 5. Example 1.5 Autoregressive Model An example of a first-order autoregressive model, denoted AR(1), is yt = ρyt−1 + ut , ut ∼ iidN(0, σ 2) , 8 The Maximum Likelihood Principle with |ρ| < 1 and θ = {ρ, σ2}. The distribution of y, conditional on yt−1, is f(y | yt−1; θ) = 1√ 2πσ2 exp [ −(y − ρyt−1) 2 2σ2 ] , which is a normal distribution with conditional mean ρyt−1 and variance σ2, or yt ∼ N(ρyt−1, σ2). If 0 < ρ < 1, then a large positive (negative) value of yt−1 shifts the distribution into the positive (negative) region for yt, raising the probability that the next draw from this distribution is also positive (negative). This property of the autoregressive model is highlighted in panel (a) of Figure 1.2 with ρ = 0.8, σ = 2 and initial value y1 = 0. Example 1.6 Bilinear Time Series Model The autoregressive model discussed above specifies a linear relationship between yt and yt−1. The following bilinear model is an example of a non- linear time series model yt = ρyt−1 + γyt−1ut−1 + ut , ut ∼ iidN(0, σ 2) , where yt−1ut−1 represents the bilinear term and θ = {ρ, γ, σ2}. The distri- bution of yt conditional on yt−1 is f(y | yt−1; θ) = 1√ 2πσ2 exp [ −(y − µt) 2 2σ2 ] , which is a normal distribution with conditional mean µt = ρyt−1+γyt−1ut−1 and variance σ2. To highlight the nonlinear property of the model, substitute out ut−1 in the equation for the mean µt = ρyt−1 + γyt−1(yt−1 − ρyt−2 − γyt−2ut−2) = ρyt−1 + γy 2 t−1 − γρyt−1yt−2 − γ2yt−1yt−2ut−2 , which shows that the mean is a nonlinear function of yt−1. Setting γ = 0 yields the linear AR(1) model of Example 1.5. The distribution of the bilinear model is illustrated in panel (b) of Figure 1.2 with ρ = 0.8, γ = 0.4, σ = 2 and initial value y1 = 0. Example 1.7 Autoregressive Model with Heteroskedasticity An example of an AR(1) model with heteroskedasticity is yt = ρyt−1 + σtzt σ2t = α0 + α1wt zt ∼ iidN(0, 1) , where θ = {ρ, α0, α1} and wt is an explanatory variable. The distribution 1.3 Joint Probability Distributions 9 of yt conditional on yt−1 and wt is f(y | yt−1, wt; θ) = 1√ 2πσ2t exp [ −(y − ρyt−1) 2 2σ2t ] , which is a normal distribution with conditional mean ρyt−1 and conditional variance α0 + α1wt. For this model, the distribution shifts because of the dependence on yt−1 and the spread of the distribution changes because of wt. These features are highlighted in panel (c) of Figure 1.2 with ρ = 0.8, α0 = 0.8, α1 = 0.8, wt is defined as a uniform random number on the unit interval and the initial value is y1 = 0. Example 1.8 Autoregressive Conditional Heteroskedasticity The autoregressive conditional heteroskedasticity (ARCH) class of models is a special case of the heteroskedastic regression model where wt in Example 1.7 is expressed in terms of lagged values of the disturbance term squared. An example of a regression model as in Example 1.3 with ARCH is yt = βxt + ut ut = σtzt σ2t = α0 + α1u 2 t−1 zt ∼ iidN(0, 1), where xt is an explanatory variable and θ = {β, α0, α1}. The distribution of y conditional on yt−1, xt and xt−1 is f (y | yt−1, xt, xt−1; θ) = 1√ 2π ( α0 + α1 (yt−1 − βxt−1)2 ) × exp − (y − βxt) 2 2 ( α0 + α1 (yt−1 − βxt−1)2 ) . For this model, a large shock, represented by a large value of ut, results in an increased variance in the next period if α1 > 0. The distribution from which yt is drawn in the nextperiod will therefore have a larger variance. The distribution of this model is shown in panel (d) of Figure 1.2 with β = 3, α0 = 0.8, α1 = 0.8 and xt = {0, 1, 2, 3, 4}. 1.3 Joint Probability Distributions The motivating examples of the previous section focus on the distribution of yt at time t which is generally a function of its own lags and the current 10 The Maximum Likelihood Principle and lagged values of explanatory variables xt. The derivation of the maxi- mum likelihood estimator of the model parameters requires using all of the information t = 1, 2, · · · , T by defining the joint probability density function (pdf). In the case where both yt and xt are stochastic, the joint probability pdf for a sample of T observations is f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) , (1.1) where ψ is a vector of parameters. An important feature of the previous examples is that yt depends on the explanatory variable xt. To capture this conditioning, the joint distribution in (1.1) is expressed as f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ;ψ) × f(x1, x2, · · · , xT ;ψ) , (1.2) where the first term on the right hand side of (1.2) represents the conditional distribution of {y1, y2, · · · , yT } on {x1, x2, · · · , xT } and the second term is the marginal distribution of {x1, x2, · · · , xT }. Assuming that the parameter vector ψ can be decomposed into {θ, θx} such that expression (1.2) becomes f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) × f(x1, x2, · · · , xT ; θx) . (1.3) In these circumstances, the maximum likelihood estimation of the parame- ters θ is based on the conditional distribution without loss of information from the exclusion of the marginal distribution f(x1, x2, · · · , xT ; θx). The conditional distribution on the right hand side of expression (1.3) simplifies further in the presence of additional restrictions. Independent and identically distributed (iid) In the simplest case, {y1, y2, · · · , yT } is independent of {x1, x2, · · · , xT } and yt is iid with density function f(y; θ). The conditional pdf in equation (1.3) is then f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) = T∏ t=1 f(yt; θ) . (1.4) Examples of this case are the time invariant model (Example 1.1) and the count model (Example 1.2). If both yt and xt are iid and yt is dependent on xt then the decomposition in equation (1.3) implies that inference can be based on f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) = T∏ t=1 f(yt |xt; θ) . (1.5) 1.3 Joint Probability Distributions 11 Examples include the regression models in Examples 1.3 and 1.4 if sampling is iid. Dependent Now assume that {y1, y2, · · · , yT } depends on its own lags but is independent of the explanatory variable {x1, x2, · · · , xT }. The joint pdf is expressed as a sequence of conditional distributions where conditioning is based on lags of yt. By using standard rules of probability the distributions for the first three observations are, respectively, f(y1; θ) = f(y1; θ) f(y1, y2 ; θ) = f(y2|y1; θ)f(y1; θ) f(y1, y2, y3; θ) = f(y3|y2, y1; θ)f(y2|y1; θ)f(y1; θ) , where y1 is the initial value with marginal probability density Extending this sequence to a sample of T observations, yields the joint pdf f(y1, y2, · · · , yT ; θ) = f(y1 ; θ) T∏ t=2 f(yt|yt−1, yt−2, · · · , y1; θ) . (1.6) Examples of this general case are the AR model (Example 1.5), the bilinear model (Example 1.6) and the ARCH model (Example 1.8). Extending the model to allow for dependence on explanatory variables, xt, gives f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) = f(y1 |x1; θ) T∏ t=2 f(yt|yt−1, yt−2, · · · , y1, xt, xt−1, · · · x1; θ) . (1.7) An example is the autoregressive model with heteroskedasticity (Example 1.7). Example 1.9 Autoregressive Model The joint pdf for the AR(1) model in Example 1.5 is f(y1, y2, · · · , yT ; θ) = f(y1; θ) T∏ t=2 f(yt|yt−1; θ) , where the conditional distribution is f (yt|yt−1; θ) = 1√ 2πσ2 exp [ −(yt − ρyt−1) 2 2σ2 ] , 12 The Maximum Likelihood Principle and the marginal distribution is f (y1; θ) = 1√ 2πσ2/ (1− ρ2) exp [ − y 2 1 2σ2/ (1− ρ2) ] . Non-stochastic explanatory variables In the case of non-stochastic explanatory variables, because xt is determin- istic its probability mass is degenerate. Explanatory variables of this form are also referred to as fixed in repeated samples. The joint probability in expression (1.3) simplifies to f(y1, y2, · · · , yT , x1, x2, · · · , xT ;ψ) = f(y1, y2, · · · , yT |x1, x2, · · · , xT ; θ) . Now ψ = θ and there is no potential loss of information from using the conditional distribution to estimate θ. 1.4 Maximum Likelihood Framework As emphasized previously, a time series of data represents the observed realization of draws from a joint pdf. The maximum likelihood principle makes use of this result by providing a general framework for estimating the unknown parameters, θ, from the observed time series data, {y1, y2, · · · , yT }. 1.4.1 The Log-Likelihood Function The standard interpretation of the joint pdf in (1.7) is that f is a function of yt for given parameters, θ. In defining the maximum likelihood estimator this interpretation is reversed, so that f is taken as a function of θ for given yt. The motivation behind this change in the interpretation of the arguments of the pdf is to regard {y1, y2, · · · , yT } as a realized data set which is no longer random. The maximum likelihood estimator is then obtained by finding the value of θ which is “most likely” to have generated the observed data. Here the phrase “most likely” is loosely interpreted in a probability sense. It is important to remember that the likelihood function is simply a re- definition of the joint pdf in equation (1.7). For many problems it is simpler to work with the logarithm of this joint density function. The log-likelihood 1.4 Maximum Likelihood Framework 13 function is defined as lnLT (θ) = 1 T ln f(y1 |x1; θ) + 1 T T∑ t=2 ln f(yt|yt−1, yt−2, · · · , y1, xt, xt−1, · · · x1; θ) , (1.8) where the change of status of the arguments in the joint pdf is highlighted by making θ the sole argument of this function and the T subscript indicates that the log-likelihood is an average over the sample of the logarithm of the density evaluated at yt. It is worth emphasizing that the term log-likelihood function, used here without any qualification, is also known as the average log-likelihood function. This convention is also used by, among others, Newey and McFadden (1994) and White (1994). This definition of the log-likelihood function is consistent with the theoretical development of the properties of maximum likelihood estimators discussed in Chapter 2, particularly Sections 2.3 and 2.5.1. For the special case where yt is iid, the log-likelihood function is based on the joint pdf in (1.4) and is lnLT (θ) = 1 T T∑ t=1 ln f(yt; θ) . In all cases, the log-likelihood function, lnLT (θ), is a scalar that represents a summary measure of the data for given θ. The maximum likelihood estimator of θ is defined as that value of θ, de- noted θ̂, that maximizes the log-likelihood function. In a large number of cases, this may be achieved using standard calculus. Chapter 3 discusses nu- merical approaches to the problem of finding maximum likelihood estimates when no analytical solutions exist, or are difficult to derive. Example 1.10 Poisson Distribution Let {y1, y2, · · · , yT } be iid observations from a Poisson distribution f(y; θ) = θy exp[−θ] y! , where θ > 0. The log-likelihood function for the sample is lnLT (θ) = 1 T T∑ t=1 ln f(yt; θ) = 1 T T∑ t=1 yt ln θ − θ − ln(y1!y2! · · · yT !) T . Consider the following T = 3 observations, yt = {8, 3, 4}. The log-likelihood 14 The Maximum Likelihood Principle function is lnLT (θ) = 15 3 ln θ − θ − ln(8!3!4!) 3 = 5 ln θ − θ − 5.191 . A plot of the log-likelihoodfunction is given in panel (a) of Figure 1.3 for values of θ ranging from 0 to 10. Even though the Poisson distribution is a discrete distribution in terms of the random variable y, the log-likelihood function is continuous in the unknown parameter θ. Inspection shows that a maximum occurs at θ̂ = 5 with a log-likelihood value of lnLT (5) = 5× ln 5− 5− 5.191 = −2.144 . The contribution to the log-likelihood function at the first observation y1 = 8, evaluated at θ̂ = 5 is ln f(y1; 5) = y1 ln 5− 5− ln(y1!) = 8× ln 5− 5− ln(8!) = −2.729 . For the other two observations, the contributions are ln f(y2; 5) = −1.963, ln f(y3; 5) = −1.740. The probabilities f(yt; θ) are between 0 and 1 by def- inition and therefore all of the contributions are negative because they are computed as the logarithm of f(yt; θ). The average of these T = 3 contri- butions is lnLT (5) = −2.144, which corresponds to the value already given above. A plot of ln f(yt; 5) in panel (b) of Figure 1.3 shows that observations closer to θ̂ = 5 have a relatively greater contribution to the log-likelihood function than observations further away in the sense that they are smaller negative numbers. Example 1.11 Exponential Distribution Let {y1, y2, · · · , yT } be iid drawings from an exponential distribution f(y; θ) = θ exp[−θy] , where θ > 0. The log-likelihood function for the sample is lnLT (θ) = 1 T T∑ t=1 ln f(yt; θ) = 1 T T∑ t=1 (ln θ − θyt) = ln θ − θ 1 T T∑ t=1 yt . Consider the following T = 6 observations, yt = {2.1, 2.2, 3.1, 1.6, 2.5, 0.5}. The log-likelihood function is lnLT (θ) = ln θ − θ 1 T T∑ t=1 yt = ln θ − 2 θ . Plots of the log-likelihood function, lnLT (θ), and the likelihood LT (θ) functions are given in Figure 1.4, which show that a maximum occurs at 1.4 Maximum Likelihood Framework 15 (a) Log-likelihood function ln L T (θ ) θ (b) Log-density function ln f (y t; 5 ) yt 1 2 3 4 5 6 7 8 9 100 5 10 15 -3 -2.5 -2 -1.5 -1 -0.5 0-30 -25 -20 -15 -10 -5 0 Figure 1.3 Plot of lnLT (θ) and and ln f(yt; θ̂ = 5) for the Poisson distri- bution example with a sample size of T = 3. (a) Log-likelihood function ln L T (θ ) θ (b) Likelihood function L T (θ ) × 1 0 5 θ 0 1 2 30 1 2 3 0.5 1 1.5 2 2.5 3 3.5 4 -40 -35 -30 -25 -20 -15 -10 Figure 1.4 Plot of lnLT (θ) for the exponential distribution example. θ̂ = 0.5. Table 1.2 provides details of the calculations. Let the log-likelihood function at each observation evaluated at the maximum likelihood estimate be denoted ln lt(θ) = ln f(yt; θ). The second column shows ln lt(θ) evaluated at θ̂ = 0.5 ln lt(0.5) = ln(0.5) − 0.5yt , resulting in a maximum value of the log-likelihood function of lnLT (0.5) = 1 6 6∑ t=1 ln lt(0.5) = −10.159 6 = −1.693 . 16 The Maximum Likelihood Principle Table 1.2 Maximum likelihood calculations for the exponential distribution example. The maximum likelihood estimate is θ̂T = 0.5. yt ln lt(0.5) gt(0.5) ht(0.5) 2.1 -1.743 -0.100 -4.000 2.2 -1.793 -0.200 -4.000 3.1 -2.243 -1.100 -4.000 1.6 -1.493 0.400 -4.000 2.5 -1.943 -0.500 -4.000 0.5 -0.943 1.500 -4.000 lnLT (0.5) = −1.693 GT (0.5) = 0.000 HT (0.5) = −4.000 Example 1.12 Normal Distribution Let {y1, y2, · · · , yT } be iid observations drawn from a normal distribution f(y; θ) = 1√ 2πσ2 exp [ −(y − µ) 2 2σ2 ] , with unknown parameters θ = { µ, σ2 } . The log-likelihood function is lnLT (θ) = 1 T T∑ t=1 ln f(yt; θ) = 1 T T∑ t=1 ( − 1 2 ln 2π − 1 2 lnσ2 − (yt − µ) 2 2σ2 ) = −1 2 ln 2π − 1 2 lnσ2 − 1 2σ2T T∑ t=1 (yt − µ)2. Consider the following T = 6 observations, yt = {5,−1, 3, 0, 2, 3}. The log-likelihood function is lnLT (θ) = − 1 2 ln 2π − 1 2 lnσ2 − 1 12σ2 6∑ t=1 (yt − µ)2 . A plot of this function in Figure 1.5 shows that a maximum occurs at µ̂ = 2 and σ̂2 = 4. Example 1.13 Autoregressive Model 1.4 Maximum Likelihood Framework 17 PSfrag µσ 2 ln L T (µ , σ 2 ) 1 1.5 2 2.5 3 3 3.5 4 4.5 5 Figure 1.5 Plot of lnLT (θ) for the normal distribution example. From Example 1.9, the log-likelihood function for the AR(1) model is lnLT (θ) = 1 T ( 1 2 ln ( 1− ρ2 ) − 1 2σ2 ( 1− ρ2 ) y21 ) −1 2 ln 2π − 1 2 lnσ2 − 1 2σ2T T∑ t=2 (yt − ρyt−1)2 . The first term is commonly excluded from lnLT (θ) as its contribution dis- appears asymptotically since lim T−→∞ 1 T ( 1 2 ln ( 1− ρ2 ) − 1 2σ2 ( 1− ρ2 ) y21 ) = 0 . As the aim of maximum likelihood estimation is to find the value of θ that maximizes the log-likelihood function, a natural way to do this is to use the rules of calculus. This involves computing the first derivatives and second derivatives of the log-likelihood function with respect to the parameter vec- tor θ. 18 The Maximum Likelihood Principle 1.4.2 Gradient Differentiating lnLT (θ), with respect to a (K×1) parameter vector, θ, yields a (K × 1) gradient vector, also known as the score, given by GT (θ) = ∂ lnLT (θ) ∂θ = ∂ lnLT (θ) ∂θ1 ∂ lnLT (θ) ∂θ2 ... ∂ lnLT (θ) ∂θK = 1 T T∑ t=1 gt(θ) , (1.9) where the subscript T emphasizes that the gradient is the sample average of the individual gradients gt(θ) = ∂ ln lt(θ) ∂θ . The maximum likelihood estimator of θ, denoted θ̂, is obtained by setting the gradients equal to zero and solving the resultantK first-order conditions. The maximum likelihood estimator, θ̂, therefore satisfies the condition GT (θ̂) = ∂ lnLT (θ) ∂θ ∣∣∣∣ θ=θ̂ = 0 . (1.10) Example 1.14 Poisson Distribution From Example 1.10, the first derivative of lnLT (θ) with respect to θ is GT (θ) = 1 Tθ T∑ t=1 yt − 1 . The maximum likelihood estimator is the solution of the first-order condition 1 T θ̂ T∑ t=1 yt − 1 = 0 , which yields the sample mean as the maximum likelihood estimator θ̂ = 1 T T∑ t=1 yt = y . Using the data for yt in Example 1.10, the maximum likelihood estimate is θ̂ = 15/3 = 5. Evaluating the gradient at θ̂ = 5 verifies that it is zero at the 1.4 Maximum Likelihood Framework 19 maximum likelihood estimate GT (θ̂) = 1 T θ̂ T∑ t=1 yt − 1 = 15 3× 5 − 1 = 0 . Example 1.15 Exponential Distribution From Example 1.11, the first derivative of lnLT (θ) with respect to θ is GT (θ) = 1 θ − 1 T T∑ t=1 yt . Setting GT (θ̂) = 0 and solving the resultant first-order condition yields θ̂ = T∑T t=1 yt = 1 y , which is the reciprocal of the sample mean. Using the same observed data for yt as in Example 1.11, the maximum likelihood estimate is θ̂ = 6/12 = 0.5. The third column of Table 1.2 gives the gradients at each observation evaluated at θ̂ = 0.5 gt(0.5) = 1 0.5 − yt . The gradient is GT (0.5) = 1 6 6∑ t=1 gt(0.5) = 0 , which follows from the properties of the maximum likelihood estimator. Example 1.16 Normal Distribution From Example 1.12, the first derivatives of the log-likelihood function are ∂ lnLT (θ) ∂µ = 1 σ2T T∑ t=1 (yt−µ) , ∂ lnLT (θ) ∂(σ2) = − 1 2σ2 + 1 2σ4T T∑ t=1 (yt−µ)2 , yielding the gradient vector GT (θ) = 1 σ2T T∑ t=1 (yt − µ) − 1 2σ2 + 1 2σ4T T∑ t=1 (yt − µ)2 . 20 The Maximum Likelihood Principle Evaluating the gradient at θ̂ and setting GT (θ̂) = 0, gives GT (θ̂) = 1 σ̂2T T∑ t=1 (yt − µ̂) − 1 2σ̂2 + 1 2σ̂4T T∑ t=1 (yt − µ̂)2 = 0 0 . Solving for θ̂ = {µ̂, σ̂2}, the maximum likelihood estimators are µ̂ = 1 T T∑ t=1 yt = y , σ̂ 2 = 1 T T∑ t=1 (yt − y)2 . Using the data from Example 1.12, the maximum likelihood estimates are µ̂ = 5− 1 + 3 + 0 + 2 + 3 6 = 2 σ̂2 = (5− 2)2 + (−1− 2)2 + (3− 2)2 + (0− 2)2 + (2− 2)2 + (3− 2)2 6 = 4 , which agree with the values given in Example 1.12. 1.4.3 Hessian To establish that θ̂ maximizes the log-likelihood function, it is necessary to determine that the Hessian HT (θ) = ∂2 lnLT (θ) ∂θ∂θ′ , (1.11) associated with the log-likelihood function is negative definite. As θ is a (K × 1) vector, the Hessian is the (K ×K) symmetric matrix HT (θ) = ∂2 lnLT (θ) ∂θ1∂θ1 ∂2 lnLT (θ) ∂θ1∂θ2 . . . ∂2 lnLT (θ) ∂θ1∂θK ∂2 lnLT (θ) ∂θ2∂θ1 ∂2 lnLT (θ) ∂θ2∂θ2 . . . ∂2 lnLT (θ) ∂θ2∂θK ... ... ... ... ∂2 lnLT (θ) ∂θK∂θ1 ∂2 lnLT (θ) ∂θK∂θ2 . . . ∂2 lnLT (θ) ∂θK∂θK = 1 T T∑ t=1 ht(θ) , 1.4 Maximum Likelihood Framework 21 where the subscript T emphasizes that the Hessian is the sample average of the individual elements ht(θ) = ∂2 ln lt(θ) ∂θ∂θ′ . The second-order condition for a maximum requires that the Hessian matrix evaluated at θ̂, HT (θ̂) = ∂2 lnLT (θ) ∂θ∂θ′ ∣∣∣∣ θ=θ̂ , (1.12) is negative definite. The conditions for negative definiteness are |H11| < 0, ∣∣∣∣ H11 H12 H21 H22 ∣∣∣∣ > 0, ∣∣∣∣∣∣∣∣ H11 H12 H13 H21 H22 H23 H31 H32 H33 ∣∣∣∣∣∣∣∣ < 0, · · · where Hij is the ij th element of HT (θ̂). In the case of K = 1, the condition is H11 < 0 . (1.13) For the case of K = 2, the condition is H11 < 0, H11H22 −H12H21 > 0 . (1.14) Example 1.17 Poisson Distribution From Examples 1.10 and 1.14, the second derivative of lnLT (θ) with re- spect to θ is HT (θ) = − 1 θ2T T∑ t=1 yt . Evaluating the Hessian at the maximum likelihood estimator, θ̂ = ȳ, yields HT (θ̂) = − 1 θ̂2T T∑ t=1 yt = − 1 ȳ2T T∑ t=1 yt = − 1 ȳ < 0 . As ȳ is always positive because it is the mean of a sample of positive integers, the Hessian is negative and a maximum is achieved. Using the data for yt in Example 1.10, verifies that the Hessian at θ̂ = 5 is negative HT (θ̂) = − 1 θ̂2T T∑ t=1 yt = − 15 52 × 3 = −0.200 . 22 The Maximum Likelihood Principle Example 1.18 Exponential Distribution From Examples 1.11 and 1.15, the second derivative of lnLT (θ) with re- spect to θ is HT (θ) = − 1 θ2 . Evaluating the Hessian at the maximum likelihood estimator yields HT (θ̂) = − 1 θ̂2 < 0 . As this term is negative for any θ̂, the condition in equation (1.13) is satisfied and a maximum is achieved. The last column of Table 1.2 shows that the Hessian at each observation evaluated at the maximum likelihood estimate is constant. The value of the Hessian is HT (0.5) = 1 6 6∑ t=1 ht(0.5) = −24.000 6 = −4 , which is negative confirming that a maximum has been reached. Example 1.19 Normal Distribution From Examples 1.12 and 1.16, the second derivatives of lnLT (θ) with respect to θ are ∂2 lnLT (θ) ∂µ2 = − 1 σ2 ∂2 lnLT (θ) ∂µ∂σ2 = − 1 σ4T T∑ t=1 (yt − µ) ∂2 lnLT (θ) ∂(σ2)2 = 1 2σ4 − 1 σ6T T∑ t=1 (yt − µ)2 , so that the Hessian is HT (θ) = − 1 σ2 − 1 σ4T T∑ t=1 (yt − µ) − 1 σ4T T∑ t=1 (yt − µ) 1 2σ4 − 1 σ6T T∑ t=1 (yt − µ)2 . Given that GT (θ̂) = 0, from Example 1.16 it follows that ∑T t=1(yt − µ̂) = 0 1.5 Applications 23 and therefore HT (θ̂) = − 1 σ̂2 0 0 − 1 2σ̂4 . From equation (1.14) H11 = − T σ̂2 < 0, H11H22 −H12H21 = − ( T σ̂2 )( − T 2σ̂4 ) − 02 > 0 , establishing that the second-order condition for a maximum is satisfied. Using the maximum likelihood estimates from Example 1.16, the Hessian is HT (µ̂, σ̂ 2) = −1 4 0 0 − 1 2× 42 = −0.250 0.000 0.000 −0.031 . 1.5 Applications To highlight the features of maximum likelihood estimation discussed thus far, two applications are presented that focus on estimating the discrete time version of the Vasicek (1977) model of interest rates, rt. The first application is based on the marginal (stationary) distribution while the second focuses on the conditional (transitional) distribution that gives the distribution of rt conditional on rt−1. The interest rate data used are from Aı̈t-Sahalia (1996). The data, plotted in Figure 1.6, consists of daily 7-day Eurodollar rates (expressed as percentages) for the period 1 June 1973 to the 25 February 1995, a total of T = 5505 observations. The Vasicek model expresses the change in the interest rate, rt, as a function of a constant and the lagged interest rate rt − rt−1 = α+ βrt−1 + ut ut ∼ iidN ( 0, σ2 ) , (1.15) where θ = {α, β, σ2} are unknown parameters, with the restriction β < 0. 1.5.1 Stationary Distribution of the Vasicek Model As a preliminary step to estimating the parameters of the Vasicek model in equation (1.15), consider the alternative model where the level of the interest 24 The Maximum Likelihood Principle % t 1975 1980 1985 1990 1995 4 8 12 16 20 24 Figure 1.6 Daily 7-day Eurodollar interest rates from the 1 June 1973 to 25 February 1995 expressed as a percentage. rate is independent of previous interest rates rt = µs + vt , vt ∼ iidN(0, σ 2 s ) . The stationary distribution of rt for this model is f(r;µs, σ 2 s) = 1√ 2πσ2s exp [ −(r − µs) 2 2σ2s ] . (1.16) The relationship between the parameters of the stationary distribution and the parameters of the model in equation (1.15) is µs = − α β , σ2s = − σ2 β (2 + β) . (1.17) which are obtained as the unconditional mean and variance of (1.15). The log-likelihood function based on the stationary distribution in equa- tion (1.16) for a sample of T observations is lnLT (θ) = − 1 2 ln 2π − 1 2 lnσ2s − 1 2σ2sT T∑ t=1 (rt − µs)2 , where θ = {µs, σ2s}. Maximizing lnLT (θ) with respect to θ gives µ̂s = 1 T T∑ t=1 rt , σ̂ 2 s = 1 T T∑ t=1 (rt − µ̂s)2 . (1.18) Using the Eurodollar interest rates, the maximum likelihood estimates are µ̂s = 8.362, σ̂ 2 s = 12.893. (1.19) 1.5 Applications 25 f( r) Interest Rate -5 0 5 10 15 20 25 Figure 1.7 Estimated stationary distribution of the Vasicek model based on evaluating (1.16) at the maximum likelihood estimates (1.19), using daily Eurodollar rates from the 1 June 1973 to 25 February 1995. The stationary distribution is estimated by evaluating equation (1.16) at the maximum likelihood estimates in (1.19) and is given by f ( r; µ̂s, σ̂ 2 s ) = 1√ 2πσ̂2s exp [ −(r − µ̂s) 2 2σ̂2s ] = 1√ 2π × 12.893 exp [ −(r − 8.362) 2 2× 12.893 ] , (1.20) which is presented in Figure 1.7. Inspection of the estimated distribution shows a potential problem with the Vasicek stationary distribution, namely that the support of the distri- bution is not restricted to being positive. The probability of negative values for the interest rate is Pr (r < 0) = 0∫ −∞ 1√ 2π × 12.893 exp [ −(r − 8.362) 2 2× 12.893 ] dr = 0.01 . To avoid this problem, alternative models of interest rates are specified where the stationary distribution is just defined over the positive region. A well known example is the CIR interest rate model (Cox, Ingersoll and Ross, 1985) which is discussed in Chapters 2, 3 and 12. 1.5.2 Transitional Distribution of the Vasicek Model In contrast to the stationary model specification of the previous section, the full dynamics of the Vasicek model in equation (1.15) are now used by 26 The Maximum Likelihood Principle specifying the transitional distribution f ( r | rt−1;α, ρ, σ2 ) = 1√ 2πσ2 exp [ −(r − α− ρrt−1) 2 2σ2 ] , (1.21) where θ = { α, ρ, σ2 } and the substitution ρ = 1+β is made for convenience. This distribution is now of the same form as the conditional distribution of the AR(1) model in Examples 1.5, 1.9 and 1.13. The log-likelihood function based on the transitional distribution in equa- tion (1.21) is lnLT (θ) = − 1 2 ln 2π − 1 2 lnσ2 − 1 2σ2(T − 1) T∑ t=2 (rt − α− ρrt−1)2 , where the sample size is reduced by one observation as a result of the lagged term rt−1. This form of the log-likelihood function does not contain the marginal distribution f(r1; θ), a point that is made in Example 1.13. The first derivatives of the log-likelihood function are ∂ lnL(θ) ∂α = 1 σ2(T − 1) T∑ t=2 (rt − α− ρrt−1) ∂ lnL(θ) ∂ρ = 1 σ2(T − 1) T∑ t=2 (rt − α− ρrt−1)rt−1 ∂ lnL(θ) ∂(σ2) = − 1 2σ2 + 1 2σ4(T − 1) T∑ t=2 (rt − α− ρrt−1)2 . Setting these derivatives to zero yields the maximum likelihood estimators α̂ = r̄t − ρ̂ r̄t−1 ρ̂ = T∑ t=2 (rt − r̄t)(rt−1 − r̄t−1) T∑ t=2 (rt−1 − r̄t−1)2 σ̂2 = 1 T − 1 T∑ t=2 (rt − α̂− ρ̂rt−1)2 , where r̄t = 1 T − 1 T∑ t=2rt , r̄t−1 = 1 T − 1 T∑ t=2 rt−1 . 1.5 Applications 27 The maximum likelihood estimates for the Eurodollar interest rates are α̂ = 0.053, ρ̂ = 0.994, σ̂2 = 0.165. (1.22) An estimate of β is obtained by using the relationship ρ = 1+β. Rearranging for β and evaluating at ρ̂ gives β̂ = ρ̂− 1 = −0.006. The estimated transitional distribution is obtained by evaluating (1.21) at the maximum likelihood estimates in (1.22) f ( r | rt−1; α̂, ρ̂, σ̂2 ) = 1√ 2πσ̂2 exp [ −(r − α̂− ρ̂rt−1) 2 2σ̂2 ] . (1.23) Plots of this distribution are given in Figure 1.8 for three values of the conditioning variable rt−1, corresponding to the minimum (2.9%), median (8.1%) and maximum (24.3%) interest rates in the sample. f( r) r 0 5 10 15 20 25 30 . Figure 1.8 Estimated transitional distribution of the Vasicek model, based on evaluating (1.23) at the maximum likelihood estimates in (1.22) using Eurodollar rates from 1 June 1973 to 25 February 1995. The dashed line is the transitional density for the minimum (2.9%), the solid line is the transi- tional density for the median (8.1%) and the dotted line is the transitional density for the maximum (24.3%) Eurodollar rate. The location of the three transitional distributions changes over time, while the spread of each distribution remains constant at σ̂2 = 0.165. A comparison of the estimates of the variances of the stationary and transi- tional distributions, in equations (1.19) and (1.22), respectively, shows that σ̂2 < σ̂2s . This result is a reflection of the property that by conditioning on information, in this case rt−1, the transitional distribution is better at tracking the time series behaviour of the interest rate, rt, than the stationary distribution where there is no conditioning on lagged dependent variables. 28 The Maximum Likelihood Principle Having obtained the estimated transitional distribution using the maxi- mum likelihood estimates in (1.22), it is also possible to use these estimates to reestimate the stationary interest rate distribution in (1.20) by using the expressions in (1.17). The alternative estimates of the mean and variance of the stationary distribution are µ̃s = − α̂ β̂ = 0.053 0.006 = 8.308, σ̃2s = − σ̂2 β̂ ( 2 + β̂ ) = 0.165 0.006 (2− 0.006) = 12.967 . As these estimates are based on the transitional distribution, which incorpo- rates the full dynamic specification of the Vasicek model, they represent the maximum likelihood estimates of the parameters of the stationary distribu- tion. This relationship between the maximum likelihood estimators of the transitional and stationary distributions is based on the invariance property of maximum likelihood estimators which is discussed in Chapter 2. While the parameter estimates of the stationary distribution using the estimates of the transitional distribution are numerically close to estimates obtained in the previous section, the latter estimates are obtained from a misspecified model as the stationary model excludes the dynamic structure in equation (1.15). Issues relating to misspecified models are discussed in Chapter 9. 1.6 Exercises (1) Sampling Data Gauss file(s) basic_sample.g Matlab file(s) basic_sample.m This exercise reproduces the simulation results in Figures 1.1 and 1.2. For each model, simulate T = 5 draws of yt and plot the corresponding distribution at each point in time. Where applicable the explanatory variable in these exercises is xt = {0, 1, 2, 3, 4} and wt are draws from a uniform distribution on the unit circle. (a) Time invariant model yt = 2zt , zt ∼ iidN(0, 1) . (b) Count model f (y; 2) = 2y exp[−2] y! , y = 1, 2, · · · . 1.6 Exercises 29 (c) Linear regression model yt = 3xt + 2zt , zt ∼ iidN(0, 1) . (d) Exponential regression model f(y; θ) = 1 µt exp [ − y µt ] , µt = 1 + 2xt . (e) Autoregressive model yt = 0.8yt−1 + 2zt , zt ∼ iidN(0, 1) . (f) Bilinear time series model yt = 0.8yt−1 + 0.4yt−1ut−1 + 2zt , zt ∼ iidN(0, 1) . (g) Autoregressive model with heteroskedasticity yt = 0.8yt−1 + σtzt , zt ∼ iidN(0, 1) σ2t = 0.8 + 0.8wt . (h) The ARCH regression model yt = 3xt + ut ut = σtzt σ2t = 4 + 0.9u 2 t−1 zt ∼ iidN(0, 1) . (2) Poisson Distribution Gauss file(s) basic_poisson.g Matlab file(s) basic_poisson.m A sample of T = 4 observations, yt = {6, 2, 3, 1}, is drawn from the Poisson distribution f(y; θ) = θy exp[−θ] y! . (a) Write the log-likelihood function, lnLT (θ). (b) Derive and interpret the maximum likelihood estimator, θ̂. (c) Compute the maximum likelihood estimate, θ̂. (d) Compute the log-likelihood function at θ̂ for each observation. (e) Compute the value of the log-likelihood function at θ̂. 30 The Maximum Likelihood Principle (f) Compute gt(θ̂) = d ln lt(θ) dθ ∣∣∣∣ θ=θ̂ and ht(θ̂) = d2 ln lt(θ) dθ2 ∣∣∣∣ θ=θ̂ , for each observation. (g) Compute GT (θ̂) = 1 T T∑ t=1 gt(θ̂) and HT (θ̂) = 1 T T∑ t=1 ht(θ̂) . (3) Exponential Distribution Gauss file(s) basic_exp.g Matlab file(s) basic_exp.m A sample of T = 4 observations, yt = {5.5, 2.0, 3.5, 5.0}, is drawn from the exponential distribution f(y; θ) = θ exp[−θy] . (a) Write the log-likelihood function, lnLT (θ). (b) Derive and interpret the maximum likelihood estimator, θ̂. (c) Compute the maximum likelihood estimate, θ̂. (d) Compute the log-likelihood function at θ̂ for each observation. (e) Compute the value of the log-likelihood function at θ̂. (f) Compute gt(θ̂) = d ln lt(θ) dθ ∣∣∣∣ θ=θ̂ and ht(θ̂) = d2 ln lt(θ) dθ2 ∣∣∣∣ θ=θ̂ , for each observation. (g) Compute GT (θ̂) = 1 T T∑ t=1 gt(θ̂) and HT (θ̂) = 1 T T∑ t=1 ht(θ̂) . (4) Alternative Form of Exponential Distribution Consider a random sample of size T , {y1, y2, · · · , yT }, of iid random variables from the exponential distribution with parameter θ f(y; θ) = 1 θ exp [ −y θ ] . (a) Derive the log-likelihood function, lnLT (θ). (b) Derive the first derivative of the log-likelihood function, GT (θ). 1.6 Exercises 31 (c) Derive the second derivative of the log-likelihood function, HT (θ). (d) Derive the maximum likelihood estimator of θ. Compare the result with that obtained in Exercise 3. (5) Normal Distribution Gauss file(s) basic_normal.g, basic_normal_like.g Matlab file(s) basic_normal.m, basic_normal_like.m A sample of T = 5 observations consisting of the values {1, 2, 5, 1, 2} is drawn from the normal distribution f(y; θ) = 1√ 2πσ2 exp [ −(y − µ) 2 2σ2 ] , where θ = {µ, σ2}. (a) Assume that σ2 = 1. (i) Derive the log-likelihood function, lnLT (θ). (ii) Derive and interpret the maximum likelihood estimator, θ̂. (iii) Compute the maximum likelihood estimate, θ̂. (iv) Compute ln lt(θ̂), gt(θ̂) and ht(θ̂). (v) Compute lnLT (θ̂), GT (θ̂) and HT (θ̂). (b) Repeat part (a) for the case where both the mean and the variance are unknown, θ = {µ, σ2}. (6) A Model of the Number of Strikes Gauss file(s) basic_count.g, strike.dat Matlab file(s) basic_count.m, strike.mat The data are the number of strikes per annum, yt, in the U.S. from 1968 to 1976, taken from Kennan (1985). The number of strikes is specified as a Poisson-distributed random variable with unknown parameter θ f (y; θ) = θy exp[−θ] y! . (a) Write the log-likelihood function for a sample of T observations. (b) Derive and interpret the maximum likelihood estimator of θ. (c) Estimate θ and interpret the result. (d) Use the estimate from part (c), to plot the distribution of the number of strikes and interpret this plot. 32 The Maximum Likelihood Principle (e) Compute a histogram of yt and comment on its consistency with the distribution of strike numbers estimated in part (d). (7) A Model of the Duration of Strikes Gauss file(s) basic_strike.g, strike.dat Matlab file(s) basic_strike.m, strike.mat The data are 62 observations, taken from the same source as Exercise 6, of the duration of strikes in the U.S. per annum expressed in days, yt. Durations are assumed to be drawn from an exponential distribution
Compartilhar