Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process

<h1 class="entry-title">Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process</h1>
Mar 20th, 2012 
<div class="entry-content">
Imagine you’re a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day.
Your goal: to find natural groups of foodies, so that you can better cater to each cluster’s tastes. For example, your fratboy friends might love <a href="https://twitter.com/#%21/edchedch/status/166343879547822080" rel="noopener noreferrer" target="_blank"> wings and beer</a>, your anime friends might love soba and sushi, your hipster friends probably dig tofu, and so on.
So how can you use the data you’ve gathered to discover different kinds of groups?
<a href="http://dl.dropbox.com/u/10506/blog/dirichlet-process/clustering-example.png" rel="noopener noreferrer" target="_blank"><img alt="Clustering Example" src="https://beijingoptbbs.oss-cn-beijing.aliyuncs.com/cs/5606289-7a531478d6b790555cac5d43745c7147.png"></a>
One way is to use a standard clustering algorithm like k-means or Gaussian mixture modeling (see <a href="http://blog.echen.me/2011/03/19/counting-clusters/" rel="noopener noreferrer" target="_blank"> this previous post</a> for a brief introduction). The problem is that these both assume a fixed number of clusters, which they need to be told to find. There are a couple methods for selecting the number of clusters to learn (e.g., the <a href="http://blog.echen.me/2011/03/19/counting-clusters/" rel="noopener noreferrer" target="_blank">gap and prediction strength statistics</a>), but the problem is a more fundamental one: most real-world data simply doesn’t have a fixed number of clusters.
That is, suppose we’ve asked 10 of our friends what they ate in the past day, and we want to find groups of eating preferences. There’s really an infinite number of foodie types (carnivore, vegan, snacker, Italian, healthy, fast food, heavy eaters, light eaters, and so on), but with only 10 friends, we simply don’t have enough data to detect them all. (Indeed, we’re limited to 10 clusters!) So whereas k-means starts with the incorrect assumption that there’s a fixed, finite number of clusters that our points come from, no matter if we feed it more data, what we’d really like is a method positing an infinite number of hidden clusters that naturally arise as we ask more friends about their food habits. (For example, with only 2 data points, we might not be able to tell the difference between vegans and vegetarians, but with 200 data points, we probably could.)
Luckily for us, this is precisely the purview of nonparametric Bayes.*
*Nonparametric Bayes refers to a class of techniques that allow some parameters to change with the data. In our case, for example, instead of fixing the number of clusters to be discovered, we allow it to grow as more data comes in.
<h1>A Generative Story</h1>
Let’s describe a generative model for finding clusters in any set of data. We assume an infinite set of latent groups, where each group is described by some set of parameters. For example, each group could be a Gaussian with a specified mean 

μi
 and standard deviation 

σ<span style="position:absolute; top:-2.275em; left