A Monte Carlo Study on Methods for Handling Class Imbalance in Machine Learning

I recently ran a simulation study comparing methods for handling class imbalance (in this case, when the class of interest is less than about 3% of the data) for a statistical computing course. I simulated 500 data sets, varying some characteristics like sample size and minority class size, and tested a number of preprocessing techniques (e.g., SMOTE) and algorithms (e.g., XGBoost). You can view the working paper by clicking here.

If you don't want to slog through the whole paper, the plot below shows densities of how each model (combination of sampling technique and algorithm) performed. I totally left off models that used no preprocessing and oversampling, since they made so few positive predictions that metrics like F1 scores couldn't even be calculated most of the time!

Feel free to check out the GitHub repository, as well.

ridgeplot.png

Using Beta Regression to Better Model Norms in Political Psychology

Update 2018-08-23: The link below is outdated. A full, more detailed paper can be found at my GitHub.

I recently wrote a short working paper on how to use beta regression and how it helps take into account norms in correlational studies of ideology, politics, and prejudice. It is a little long for a blog post (and this platform does not support LaTeX), so I uploaded it as a working paper. Click here to download the paper.

I hope it is instructive and informative, and that it fills in a few gaps from previous papers. Please email me if you have any questions about it. As always, the code can be found over at my GitHub.

Updated 2017-12-11

Screen Shot 2017-12-11 at 11.13.04 AM.png

In Support of Open Seeding in the NBA

This is a post arguing why the NBA should adopt open seeding in the playoffs: Instead of taking the top 8 teams in each conference, the top 16 teams in the NBA should make the playoffs.

The first thing I wanted to do was diagnose the problem. I looked at every year from 1984 (when the NBA adopted the 16-team playoff structure) through 2017. For each year, I tallied the number of teams making the playoffs who had a smaller win percentage than a team in the other conference. These data come from Basketball-Reference.com, and the code for scraping, cleaning, and visualizing these data can be found over at GitHub.

The following figure shows this tally per year, and the years are grouped by color based on the conference who had a team with the worse record. The years between dashed vertical lines represent the years in which division winners were guaranteed a top three or four seed.

plot of chunk figure

This analysis spans 34 seasons. Of these 34 seasons:

  • There were 10 seasons where the 16 teams with the best records were the 16 in the playoffs (although not seeded as such, since the playoff bracket is split by conference).
  • Of the 24 (71%) seasons where at least one team with a worse win percentage than a lottery team in the other conference made the playoffs, the offending conference was the East 16 times, while the West 8 times.
  • The worst year was 2008, where more than half of the playoff teams in the Eastern Conference had a worse record than the 9th-place Golden State Warriors, who went .585. Actually, the 10th-place Portland Trail Blazers went .500, placing them ahead of the 7th- and 8th-seeded Eastern Conference teams, and the 11th-place Sacramento Kings had a better record than the 8th-seeded Eastern Conference Atlanta Hawks (a team that only won 37 games).

I have heard the argument that we should not worry about unbalanced conferences in any one year, because “Sometimes the East is better, sometimes the West is better—it balances out in the long-run!” While my analyses don't control for strength of schedule in each conference, it simply isn't true that the conference imbalance evens out over time. I'm looking at the past 34 seasons, and the East was worse twice as often as the West (at least in terms of worse teams making the playoffs).

That argument also doesn't make sense to me because championships are not decided over multiple-years. They are an award given out at the end of every season. So even if it balanced out between conferences over time, this would not matter, because every year some below-average team is making the playoffs. And from these data, we can see that 71% of the seasons in the last 34 years have resulted in at least one team making the playoffs that had a worse record than a lottery team in the opposite conference.