This content comes primarily from the notes of Mark Herbster (contributed to by Massi Pontil and John Shawe-Taylor) of University College London.
Introduction
The Wisdom of the Crowds, or majority rule and related ideas tend to come up pretty often. Democracy is based (partly) on the majority of people being able to make the correct decision, often you might make decisions in a group of friends based on what the most people want, and it is logical to take into account popular opinion when reasoning on issues where you have imperfect information. On the other hand, of course, there is the Argumentum ad Populum fallacy which states that a popular belief isn’t necessarily true.
This is idea appears also in Applied Machine Learning – ensemble methods such as Random Forests, Gradient Boosted Models (especially XGBoost) and stacking of Neural Networks have resulted in overall more powerful models. This is especially notable in Kaggle competitions, where it is almost always an ensemble model (combination of models) that achieves the best score.
These different methods of ‘ensembling’ or ‘consulting the crowds’ are all roughly based on having a few different models/predictors that are make prediction or classification errors in slightly different ways so that when there is a new data point for which to predict a value, the ‘noise’ or error in their predictions is smoothed out by there being a number of them or when tasked with classifying, the (sometimes weighted) majority gets the prediction correct.
A bound
A significant part of classical Supervised Learning is proving bounds on algorithms. We will now do a bit of maths to show that under certain conditions, we can use an ensemble to drive model error to zero. Thereafter we will discuss why this bound is unlikely to be achieved in practice, but how we may strive for it nonetheless.
This idea comes from 1785, when Marquis de Condorcet noted that for votes with reasonably aware voters, the probability of a correct outcome increased with the number of voters.
Say we have 2T + 1 (this non-obvious-looking number will become clearer in a bit) individuals who have to predict an outcome of some binary classification problem. This could be an election, referendum or something as simple as the result of a sports match.
We will call each individual such that our list of individuals is . Each of these individuals will predict either 1 or 0, each representing one of the options. The probability of each individual being correct is 0.5 + with (we are okay though with i.e. each person just has to be slightly better than random at coming up with the correct answer).
The key assumption, however, is that each person comes up with the answer independently from one another.
What we are interested is the overall prediction that comes from our group- this could be a weighted average but we will make it our overall prediction based on whether the sum of the predictions is greater than T (i.e. simple majority rule). As a formula this is:
where is the indicator function
We want to know the probability that will be wrong. That is to say, we want to figure out the probability that our crowd is not so wise after all.
We will invoke 1) that independent Bernoulli Random Variables follow a Binomial Distribution and 2) the Chernoff bound.
Consequence of 1: We have than the probability of being wrong is just T or fewer individuals choosing the right answer. This is therefore:
Statement of 2: Let be independent random variables, assume that . Let and then for :
So now we have a probability and a bound that we can use together make some sense of this situation. As you might expect, the next step is to use the Chernoff bound to find an upper limit on our probability.
To relate the two, we note that each individuals prediction is a random variable that is in the required interval (in fact it can only take on the 2 boundary values of the interval, which is acceptable). Then we note that we have the mean, of the sum of these for free from the Binomial distribution (the mean of a binomial random variable with n observations and probability of success, p, is simply np). So in terms of the bound we have discussed we have X and already.
Now to get k we just note that we are interested in the total number of votes for the correct option being less than T (i.e. less than half of people being correct), which corresponds to the probability from the Binomial distribution that we have set up above, so we can set . To make this explicit:
Substituting our values for the binomial distribution into the formula for its mean gives .
Now we have all the ingredients we need to find a non-trivial bound for our probability. We will do some algebra and show the required result.
Substituting our probability into the bound gives:
At this point we will do some bounding- looking at the exponent we note the minus sign- this tells us that if we decrease the numerator in the exponent we will get a larger overall number. Likewise for increasing the denominator.
We note that the numerator is greater than as when the squared bracket in the numerator is expanded all of the terms are positive and is one of these terms.
We note that the denominator is smaller than 6T as 4T< 2T and <2T.
So then we have:
The bound is admittedly crude, but it is non-trivial and it shows that as T grows larger, the probability of our crowd being wrong decays exponentially to 0. This is good! It motivates utilising a crowd (under certain conditions) to make the probability of error as small as we like.
Conclusion
Now we have our bound, but it only works under the 2 conditions that we stated above: 1) that each individual is better than chance and 2) that each individual predicts independently.
Taking these into account when considering how ‘majority rules’ fails despite our bound makes things clear- often individuals are not as good as chance when it comes to prediction and these individuals will likely rely on the same information, e.g. election campaigns, news articles, common misconceptions.
So how do we get around these problems in Machine Learning?
The first is done by using an algorithm that is a ‘weak learner’: this means one that can consistently find rules that are slightly better than guessing. An example of this is a ‘decision stump’, which is a one-rule decision tree used to create Random Forests.
The second is done by giving the ‘weak learner’ slightly different information from which to generate its rules. This would be achieved by learning each individual rule based on subsets of the data and of the features in the data.
In human terms, we might only allow each person to use information from certain sources and on certain issues to make decisions, for example, if we were to predict the winner of a rugby match, Person 1 might only be allowed to use what SuperSport, News24 and Sky News said about the forwards in each team, whilst Person 2 might only be allowed to use what News24, BBC News and The Sunday Times said about the coaches’ ability and the defensive organisation of each team.*
Each person has some information that should make them better than average at predicting, but this information is independent (it isn’t really independent, but this is where we try to make it ‘more independent’ to aim for our bound) so when the predictions are wrong they will not be wrong in the same way. The idea is then to use a lot of these sort of predictions, whose different methods of being wrong will cancel each other out.
*This method of ensuring independence is specific to Bagging (e.g.. random forests), Boosting algorithms use a different method of giving each individual rule/model different information
Leave a Reply