NTD in AI: Best Subset Selection

Non-technical definitions in AI

Best subset selection is an algorithm devised to discover which predictor variables (also known as features) in a training data set result in the best model. That is, it tries to weed out irrelevant variables.

In a simple example, if there are 3 features, a model, for instance a linear regresson using a sum of squares to fit, would be trained against each of the 3 variables in turn, then all possible combinations of 2 variables ({1,2}, {1,3}, {2,3}), then with all 3 variables {1,2,3}. (The algorithm actually also calls for the model to also predict the mean from each observation without any predictors, called the null model.) Let’s call each of these levels. Hence in the null model, the first level one variable was used, in the second, two and in the third, three.

For each of those levels we then find which one had the best fit (in our example which had the smallest residual sum of squares), so we end up with one choice per level.

We then use cross validation methods on the shortlist of selected models to find the best model. The result is the subset with the best performance given the method we used to fit the model.

Machine learning is a technical subject and the use of technical terms by engineers have the potential of coming between clear communication with non-engineers, especially in the business setting. In spare moments I started to put together simple, non-technical definitions of nouns and verbs used in the field of machine learning as a kind of Rosetta Stone for non-engineers.This is a work-in-progress which I may collect into a book one day. This is one of those definitions.

NTD in AI: Best Subset Selection

Other non-technical definitions: