Using SQL to analyze 5 years of my own NBA Predictions Game

5 min readJul 29, 2022

Hi there! My name is Felipe Gollnick and it’s now been a few months since I started a journey through my very first studies in Data Analytics, back in March. This Medium page aims to be some kind of portfolio. I have already published a couple of articles in Portuguese and have now decided to re-publish the last one, now translating it into English.

You may find me on LinkedIn here.

As a basketball lover, and for the sake of having fun with a few friends of mine, five years ago I started making my own predictions game — the one and only Gollnick’s NBA Playoffs Predictions Game. When the NBA postseason approaches, my friends already know that I’ll be sending them several Google Forms where they may put their predictions for the Playoffs’ matchups as the action unfolds.

This was all stored in dozens of spreadsheets; and, when I started studying SQL and relational databases, it was soon very clear that I could use it as my own case study. After all, it was five years of data that I collected from my friends predictions. I was then eager to transform it into a relational database that I could use for some queries in MySQL Workbench and see if I could discover something from the numbers.

So the first thing to do was to gather all the data in one place and make all the 1,198 predictions received in 5 seasons to follow the same pattern (this took me several hours of work since they were not standardized); after that, I could normalize the data from one flat file to five different tables that I could then import to Workbench, where I could finally create the Primary and Foreign Keys to each table, leading to the diagram:

The 5 tables being:
1) People: containing the ID for each participant, their full name and the year they debuted in Gollnick’s Predictions Game;
2) Predictions: the biggest table from this data set, with almost 1,200 rows containing the participants’ IDs, their guesses for each series, if the prediction was right, and if it was a conservative prediction (will explain it later);
3) Series: containing matchup information like which teams were playing, which was higher-seeded, and the conference, round and season they were playing;
4) Rd (short for “round”): containing the full name for each round/stage (“Conference Semi-Finals”, “The Finals” etc);
5) Team: containing each team’s ID and full name.

For an instance: if I predicted that Boston would make 4–1 in the 2018 Eastern Conference Semi-Finals, my full name would be in the People table; my prediction (and if it was right or wrong) would be in Predictions; the year, conference and teams playing, in Series; the full name of the round (“Conference Semi-Finals”), in Rd; and the full name of the team (“Boston Celtics”) in Team.

***For this study, please note that I am considering just the final score of games won in each series (as in 4–1), not the score for each individual game (as in, say 121–109).***

With all that set, I could now start querying! For an instance, let’s find out all the predictions I made through all the five years of my own game:

This is just a fraction of the result, since there’s 67 series that I predicted in total. But you may see that I was confident enough to think that the Warriors would sweep the Clippers in the 1st round of the 2019 playoffs (it was actually 4–2)

To reach that output, I used several INNER JOINs as below. I needed to join the Teams table twice, so the full team name would appear twice in the result:

The People ID is made of 3 or 4 letters; I used my ID (‘GOLL’) to filter my predictions apart from the others

—

From there, I started thinking that there was probably a way to query for a ranking of the best participants after these five seasons, with the results accumulated. Well, the Predictions table contains a column that indicates if the guess was right about the team who won each matchup. With ‘1’ (true) meaning the prediction was right about the winning team and ‘0’ (false) meaning it was wrong, it shouldn’t be impossible to sum that up. And voilà:

Above, wherever it’s NULL, that means that the person didn’t participate in that year. And please let’s not raise any suspicions about me (‘GOLL’) leading my own ranking, ok? :D

The “TOTAL” column is returning the number of times each person was right about the team who won a series.

To reach that output, I used a window function to get the rank in the first column and a simple SUM to get the results. The best part was using CASE to group the results by year:

I need to include a **WHERE** clause so the query won’t return results from the participant ‘REAL’, which is not a human being but the actual, final result of the series. **The actual results are in the Predictions table as if they were from a regular participant, and I have to keep this in mind for every query I build.**

—

The table above, however, contains lots of right predictions. After all, it’s not hard to guess the winning team of a series — you have 50% of chances of doing so (or even more, in the case of a great team facing a much worse team).

To find out who really masters the Gollnick’s Predictions Game, I can make a ranking of BANGERS. A BANGER means a prediction that was right about the final score of the series (say Raptors 4–2 Warriors). In Portuguese, I call the BANGER as a “little flip-flop”, but it doesn’t really have the same appeal, has it?

So this new ranking makes me realize that, actually, I do not master my own game (as the previous ranking suggested), since I am now in the eighth place, behind four folks that participated in less seasons than I did.

The query is pretty much the same as the previous, just changing the column being summed up.

I need to include a **WHERE** clause so the query won’t return results from the participant ‘REAL’, which is not a human being but the actual result of the series. **The actual results are in the Predictions table as if they were from a regular participant, and I have to keep this in mind for every query I build.**

—

Please continue to the second part, where I’ll show some more interesting queries.

Using SQL to analyze 5 years of my own NBA Predictions Game

Written by Felipe Gollnick