(Warning: this is a post with a lot of statistical jargon. The too long; didn’t read version: I’ve settled on logistic regression as the approach I’m going to use to build a model. The next step is to collect data to build the model with, but this will take some time.)
Here’s a little update on my progress in developing better statistical methods for MMA…
I have skimmed through the statistical learning course I mentioned earlier. I have watched every video lecture and read some of the book. Some of the course material (like linear regression) was simply review for me. But most of it has been useful. I’ve been introduced to methods I have heard of – such as random forests – but not known about.
Having skimmed through the course, it’s clear that logistic regression is the approach I should use for making MMA predictions. There are two types of statistical learning – learning for a numerical output and learning for a categorical output. For MMA, we’re looking at a categorical output… win, loss, or draw.
This is also known as a classification problem. There are a number of methods that produce better models for classification than logistic regression… but those methods don’t yield a probability estimate. Logistic regression both builds a model and produces a probability estimate… so instead of the model simply saying “Frankie Edgar win, Cub Swanson loss,” the model will give a percentage chance of each fighter winning. It might say “Frankie Edgar 65%, Cub Swanson 35%” or something like that.
The next step is to collect the data I need to use to build the model. Unfortunately this is going to be a long process. Here’s why:
For each fight I want to input each fighter’s prior statistics, including things like significant strikes landed per minute. However, different fighters have different amounts of data. For a fighter like Edgar, who has a lot of data to work with, the resulting statistic of 3.52 significant strikes per minute landed should be pretty reliable. But what about a fighter like Doo Ho Choi, whose 10 significant strikes landed in 18 seconds results in a rate of 33.33 significant strikes per minute?
Obviously Choi’s “true” rate of strikes landed isn’t nearly that high. The solution is to regress each fighter’s data to the mean, but how much regression does that require? That’s what I’m trying to figure out now – for each of the statistics I’m going to use to build this model.
That means going through every fighter’s Fight Metric page, recording each fighter’s data, and dividing the data in a number of ways. It would be nice if Fight Metric had a true stats page, with a list of fighters and their resulting statistics… but they don’t. This is going to take a while, and I’ve already spent a large number of hours recording… so far, I’ve gotten through Evan Dunham, working alphabetically… so I have a very long way to go.
I’m still a long way off from having an end product, but I wanted to provide an update on my progress, since those of you reading this blog have been so supportive. I’ll be sure to write another update soon… and who knows, I might even give some very quick predictions for the upcoming UFC 181. Stay tuned for that.