|
 |
Doing Good Research in the Face
of “Missing Data”
Denis Leung
Associate Professor, School of Economics & Social Sciences |
All statistical analysis requires the availability of good data sets. Yet, in most studies, it is not easy to obtain a complete data set. Incompleteness may be attributed to a random phenomenon or by design. In some data collection process, because of cost considerations, either in time or monetary terms, only limited information on the sample is collected. Incompleteness may also be non-random. For example, some respondents may refuse to give information relating to sensitive personal issues, such as income or sexual preferences. Whatever the sources of data incompleteness, it poses a serious problem for researchers. At best, incompleteness reduces the accuracy of results from the studies; at worst, it could lead to biased conclusions.
In recent years, Prof Denis Leung has spent much time finding ways to deal with the problem of “missing data”. In particular, he has extensively studied the use of surrogate data to recover information lost through missing data. Surrogate data are proxies of the intended but missing data. For example, a proxy for household income may be the home address and a proxy for household spending may be the household size. Even though a surrogate does not directly provide the intended information, it nevertheless provides some information about the missing data.
According to Prof Leung, in some cases, researchers might in fact prefer to use surrogate data rather than the actual intended data. Quite often, surrogate data are much cheaper and easier to obtain than the actual data. Furthermore, it is conceivable that surrogate data could provide more accurate information than the actual data, even if the latter were available. For example, in surveying the sexual preferences of individuals, the information obtained from posing the direct questions to the interviewees may not be as accurate as those provided by surrogate data.
There are two fundamental steps involved in the attempt to recover information through the use of surrogate data, said Prof Leung. The first step is to establish the relationship between the missing data and the surrogate data, through the use of some models. This is often a difficult task as the modeling is targeted at data that is already “missing”. The second step, once a model has been chosen, is to "fill in'' the missing data using the surrogate. The "complete" data thus obtained is then used for research purposes.
As Prof Leung noted, the process of information recovery must fulfill two objectives: it must be robust to errors in specification and it must also be efficient. Unfortunately, in most practical situations, these two aims cannot simultaneously be satisfied. In other words, a very robust approach is bound to be inefficient and vice versa. “The challenge is to come up with models that strike a good balance between these two aims”, he said, “Meeting such challenges has been the focus of my work in the past few years.”
Specifically, Prof Leung has focused on using models that require as few assumptions as possible, through a method called empirical likelihood. “An advantage of this method is that it combines the "filling in" and the analysis in one step,” he said. In his studies, he found that the method could recover almost all the information lost as long as the surrogate data are ``representative" of the intended but missing data. In addition, he found that this method will not be worse than any other method that does not make use of surrogate data, even when the surrogate is completely unrelated to the intended data.
Examples of Prof Leung’s works may be seen in two recent publications: “Information recovery in a study with surrogate endpoints” in Journal of the American Statistical Association 2003 98: 1052-1062 (with Chen, S.X. and Qin, J) and Estimation with survey data under nonignorable non-response or informative sampling in Journal of the American Statistical Association 97: 193-200 (with Qin, J. and Shao, J).
|