Relationship between Sparse Estimators and the Correct Selection of Variables

Project Motivation:

A good dataset looks like swiss cheese–areas of big holes and areas rich in cheese; sparsity refers to the big holes area. While this may seem counter-intuitive, having big regions of holes is good because that usually means there are big regions of clusters which have something.

Now, this is important because as more variables are added to a dataset, more noise gets introduced. This makes the holes smaller and smaller and smaller until the data is totally uniform. At this point, everything looks the same and the statisticians say you really can’t do anything with it. This is the “curse of dimensionality.”

This project helps find the relationship between selecting the correct number of variables (to help prevent the curse of the dimensionality) and using a sparse estimator (a quantity calculated from the sample dataset to give information about an unknown quantity in the population filled which has a lot element that are zero). Sparse estimators are especially important because since they carry less information, then they can help speed up computation in determining an unknown quantity in the population.

What was my role in this research?

Compressed sensing, a relatively new field in signal processing in which sparsity is a primary component, has a lot overlap in statistics which have yet to be explored. I focused on one area, implementing sparse estimators amongst a set of models. To select the best model to represent our noisy sample data, I used the model selection criteria of AIC and BIC. Each criteria outputs a score–the lower the score, the better the model choice. My research results suggested that BIC performs better in selecting the most important variables than AIC does, but the mean square error of the BIC scores was greater than the mean square error of the AIC scores. However, this was just for one-dimensional case. While my work helped start the process to show that better model selection can be achieved with the cost of more bias on the parameter estimates, more research needs to be performed to determine if this is true for higher dimensions.

Most Challenging Part of the Project:

The most challenging part of the project was finding the right dataset to use to perform model selection. A lot of the datasets I found online had more than a 30 different variables. While it would easy to select the most important variables, modelling this high-dimensional dataset would be very complex. Instead, I chose a dataset with one variable, which was simpler to model and extract conclusions from.