(Warning: this is a post with a lot of statistical jargon. The too long; didn’t read version: I’ve settled on logistic regression as the approach I’m going to use to build a model. The next step is to collect data to build the model with, but this will take some time.)

Here’s a little update on my progress in developing better statistical methods for MMA…

I have skimmed through the statistical learning course I mentioned earlier. I have watched every video lecture and read some of the book. Some of the course material (like linear regression) was simply review for me. But most of it has been useful. I’ve been introduced to methods I have heard of – such as random forests – but not known about.

Having skimmed through the course, it’s clear that logistic regression is the approach I should use for making MMA predictions. There are two types of statistical learning – learning for a numerical output and learning for a categorical output. For MMA, we’re looking at a categorical output… win, loss, or draw.

This is also known as a classification problem. There are a number of methods that produce better models for classification than logistic regression… but those methods don’t yield a probability estimate. Logistic regression both builds a model and produces a probability estimate… so instead of the model simply saying “Frankie Edgar win, Cub Swanson loss,” the model will give a percentage chance of each fighter winning. It might say “Frankie Edgar 65%, Cub Swanson 35%” or something like that.

The next step is to collect the data I need to use to build the model. Unfortunately this is going to be a long process. Here’s why:

For each fight I want to input each fighter’s prior statistics, including things like significant strikes landed per minute. However, different fighters have different amounts of data. For a fighter like Edgar, who has a lot of data to work with, the resulting statistic of 3.52 significant strikes per minute landed should be pretty reliable. But what about a fighter like Doo Ho Choi, whose 10 significant strikes landed in 18 seconds results in a rate of 33.33 significant strikes per minute?

Obviously Choi’s “true” rate of strikes landed isn’t nearly that high. The solution is to regress each fighter’s data to the mean, but how much regression does that require? That’s what I’m trying to figure out now – for each of the statistics I’m going to use to build this model.

That means going through every fighter’s Fight Metric page, recording each fighter’s data, and dividing the data in a number of ways. It would be nice if Fight Metric had a true stats page, with a list of fighters and their resulting statistics… but they don’t. This is going to take a while, and I’ve already spent a large number of hours recording… so far, I’ve gotten through Evan Dunham, working alphabetically… so I have a very long way to go.

I’m still a long way off from having an end product, but I wanted to provide an update on my progress, since those of you reading this blog have been so supportive. I’ll be sure to write another update soon… and who knows, I might even give some very quick predictions for the upcoming UFC 181. Stay tuned for that.

### Like this:

Like Loading...

*Related*

It’s great to hear from you again. Collecting stats from all fighters is a lot of work, maybe you should first do the popular ones, main card materials.

Thanks Mirko. The data collection process has been slower than I anticipated, so I’ve decided to collect only enough data to reach statistical significance. I’ve now determined when a number of statistics become reliable: for example, significant strikes landed per minute becomes reliable with 40 minutes of data.

appreciate the work you do,but,isn`t going strictly with a numbers based criteria a bit limiting?…are you trying to incorporate things like “strength of opposition” into your analysis?…numbers are great but there is so much more involved(momentum/home court/height advantage/lefty vs righty etc)…what I`m saying is that it might be a good idea to run the numbers and then add some objective situational considerations as an aside…..

just thinking out loud….

g.l. with it…I know it`s not easy…

I’m not going to use only Fight Metric data. I plan on having variables for a lot of different things. Strength of opposition is one of those things, and VERY important to include. Height and reach will also be included. Looking at lefty/righty matchups is a good idea too – you may have just given me an idea for another variable to include in the model!

What do you think about weigh-ins criteria, how fighter looks and acts, will you involve that aswell?

I think the only thing I could do there is include a variable for whether or not the fighter misses weight. I don’t think there have been very many instances of that happening over the years, so I’m going to leave that out of the model for now.

UFC 181 quick picks? awesome.