The importance of synthetic datasets in empirical testing: comparison of NL models and MNL with error components models.

Author(s)
Garrow, L. & Bodea, T.
Year
Abstract

There are two key purposes of this paper. First, we present the results of an empirical analysis are presented that show that the ability to recover logsum parameter estimates that exhibit a low degree of variation acrossmultiple datasets depends critically on the number of observations, the frequency of chosen alternatives, and the amount of correlation in the nests. Moreover, even after controlling for variation across datasets, it is found that logsum parameter estimates are biased in specific ways. Specifically, it is found that: (1) the wide range of logsum parameter estimates of 0.125 obtained from 20 datasets of 10,000 observations decreases to a range of 0.006 when 20 datasets of 1,000,000 observations are used, (2) as the frequency of chosen alternatives in a nest decreases, so too does the ability to recover logsum parameter estimates for that nest, and (3) as thelogsum coefficients decrease (or amount of correlation in the nest increases) the ability to recover parameter estimates slightly decreases. In regards to coefficient bias it is found that: (1) all logsum coefficients arebiased upwards in synthetic datasets generated using the procedure described in Garrow and Bodea (2005), (2) the bias dramatically increases for those nests that have a low choice frequency, and (3) the bias tends to be more pronounced for those nests with high correlations among alternatives. The second key purpose of this paper is to complete the empirical comparison of NL models and MNL with error components models. Specifically, there are two common approaches for incorporating correlation among alternatives. The first approach uses a mixed MNL model and allows the parameters of the utility function to vary across alternatives in such a way that analogsto GEV models, such as the NL model, are created. The attributes used to create correlation and/or heterogeneity are called error components. The second approach uses a more complicated GEV model, such as the NL model, torepresent correlation among alternatives. The benefit of this approach isthat it has fewer dimensions of integration so should require less computational time. The disadvantage is that the researcher needs to program a more complicated log-likelihood function. Based on the findings described above, 100,000 observations are used to empirically compare these two approaches for representing correlation among alternatives. Specifically, this study explores the sensitivity of empirical identification in mixed MNL models to different factors, including the choice frequency of alternatives,amount of correlation in the nests, and the number of Halton-draws used as support points. Results indicate a clear lack of empirical identification for nests that have a low choice frequency (defined as each alternative in the low-frequency nest being chosen approximately 2,500 times). Moreover, while models with equal choice frequencies (defined as each alternativebeing chosen approximately 16,667 times) converge, the coefficients associated with error components are biased. Both of these findings provide a motivation for estimation of mixed GEV models. For the covering abstract see ITRD E135582.

Request publication

13 + 6 =
Solve this simple math problem and enter the result. E.g. for 1+3, enter 4.

Publication

Library number
C 46434 (In: C 46251 [electronic version only]) /71 / ITRD E135978
Source

In: Proceedings of the European Transport Conference ETC, Strasbourg, France, 18-20 September 2006, Pp.

Our collection

This publication is one of our other publications, and part of our extensive collection of road safety literature, that also includes the SWOV publications.