Using SQL to analyze 5 years of my own NBA Predictions Game (part 2)

Felipe Gollnick
4 min readJul 29, 2022

If you skipped Part 1, please click here to start from the beginning. Also, find me on LinkedIn here.

Photo by Tim Hart on Unsplash

So, from dealing with all those predictions in five years, I had a feeling that my friends would send more conservative guesses in the early stages of the Playoffs (where higher-seeded teams with better campaigns played against lower-seeded teams with worse campaigns). In the following rounds, as only the best teams remains, predictions would gradually be less conservative, as matchups gets more balanced.

And by a Conservative Prediction, I mean a prediction where the participant thought that the team with the best campaign in the regular season would win that series. So, in average, predictions in the 1st round would be much more conservative than in the grand Final. But were them, really?

The Predictions table has a column indicating the conservative guess, with ‘1’ for true and ‘0’ for false. With this, I could use easily use some CASEs to make conditional calculations and reach the conservative predictions for each round and year:

I started making the Predictions Game after the 1st Round of 2018.

So my feeling was wrong, as there’s not really a pattern for the percentage to decrease in each round within a year. BUT, when looking at the last column, my feeling then turns right, as the percentage does decreases. So I could not get a conclusion from that.

To reach that output, together with the CASEs, I used AVG and ROUND:

I need to include a WHERE clause so the query won’t return results from the participant ‘REAL’, which is not a human being but the actual, final result of the series. The actual results are in the Predictions table as if they were from a regular participant, and I have to keep this in mind for every query I build.

In total, 76.8% of all predictions made by all participants through five seasons of game were conservative. That’s really something. But does being conservative pays off?

What is the percentage of the conservative predictions that were successful in the end of each series? Let’s find out:

When there’s 100% in a field, it means that that of all the series of that round of a particular year were won by the higher-seeded team. For each year, the last row will always mean a single series, as it is the grand final — so the conservative predictions will always be 100% wrong or right, except by the TOTAL column.

Within these five years, it’s possible to note that, in the 1st Round, it’s always better to play safe; in the conference semi-finals, however, some underdogs came to stun Those Who Not Believe (remember Trae Young’s Hawks versus the Sixers in 2021).

To reach that output, I queried for all the conservative predictions and then averaged by year, grouping by round:

For the sake of having fun in building queries, it’s also possible to reach the same result using a CTE in order to have a new table with only the conservative predictions (a subquery in FROM also could do it). That was my first thought, actually, but it didn’t make sense since it’s just more work

I could go on with lots of queries more, but, for now, let’s close with a last one: from all the NBA Playoffs’ series in those five years, I wanted to know in which of them the participants were most confident that one team would easily beat its opponent.

In a best-of-seven series, when Team A wins 4 games and Team B wins none, they say that Team A swept Team B. So I wanted to know if I could rank all the series based on how the average predictions were most close to a sweep.

Yes, I could, but that required a bit of extra thinking:

In the first row, you can see that, in average, the 14 participants of the 2019 Gollnick’s Prediction Game predicted that the Warriors would win 4.0 games in that series, and the Clippers 0.29. The difference between those averages (“dif” column) is 3.71, which is the largest difference in this database. In real life, though, that series was much more balanced than the participants expected, as it ended 4–2.

So the thing here was: first, to calculate the average predictions of the scores of the higher and the lower-seeded teams; second, to subtract one average from the other to get a difference or an index that I could use to make a rank; third, to use a window function that could rank those results by absolute values; and fourth, to also return the real results of the series, in order to compare if the average predictions were close to reality. And, well:

So I need to remove the real scores from the predictions in every single calculation, except when I do want to show the real result for comparing

This was surely the biggest query I built by myself, outside of the guidance from online courses. And just in case you had read this article down to this point and are wondering what were supposed to be the most balanced series according to my friends’ predictions, here they are:

Back in the first season of this game, those few 7 participants averaged equal scores for Boston and Philadelphia in the Eastern Conference Semi-Finals. They couldn’t be more wrong, as the Celtics won 4–1. Funny fact: Boston is in seven of the 15 series with the most balanced predictions.

Well, there it is. If you asked me a few months ago, I would never had expected that I would amuse myself by typing codes in the search of interesting facts in statistics. For me, those few queries were like small samples of the ridiculous amount of possibilities that SQL commands gives to someone who knows how to explore its powers.

Thank you for reading! If you feel like it, do drop me a message on LinkedIn and let’s talk more about SQL, basketball, analytics and… international container freight, maybe?

--

--

Felipe Gollnick

SQL | BI | Data | Power BI & Qlik | Passionate about sea trade logistics, music and Bahia