Tuesday, December 18, 2012

Bagging

http://blog.bigml.com/2012/12/14/the-rewards-of-ignoring-data/?goback=%2Egde_4166980_member_196901693

Fortunately, there is a way out of this hot mess, and it comes in the form of bootstrap aggregation or bagging as it is known in the biz.  The basic idea is this:  We’re going to create not just one, but many models on this dataset.  Then, when we want to make a prediction, the models will all vote on the correct outcome.

How will this effect the problem above?  Depending on the size of the subset, that huge plum may not appear in the data.  Thus, a majority of the models may not have the splits associated with the plums, and six centimeter fruits will be classified as apples.  In practice, we typically use the majority of the data in each subset, so the majority prediction for a four centimeter fruit will still be plum on average, but if you look at the votes of the individual models, the majority will be quite weak.

What we’ve done in theory is reduced the variance of the classifier. 

No comments: