The analysis begins with a query joining all the tables containing useful variables within the SQLite database using sqlite3. The output table will be the master table that will be used to fuel the future analyses. We see that there is minimal information about the songs themselves provided in the database, as it is more focused towards the reviews themselves. As such, it would be an interesting project to apply natural language processing to this dataset to extract some sort of sentiment from review comments to have an extra variable to analyze; this is out of the scope of this particular project, but a very interesting avenue to take to extend it. We will focus on genre, record label, year of song release, review date, score, and whether the review gave the song the award of "best new music".
Beginning with a general exploration going clockwise starting from the upper left, we see that rock songs are the most popular in terms of reviews. It could be the case that the magazine focuses more on rock than any other genre, but it could also just as likely be the case that rock songs are more mainstream than other songs. In terms of release years, we see that reviews have mostly been about songs released since the early 2000's. Next, we see number of records that belong in 5 different categories. These categories represent sizes of record labels given the amount of songs they published that have been reviewed. We see a fairly intuitive relation here, with most record labels being ones with single releases. Finally, we see the density of scores for reviews that gave out the "best new music" award vs. scores for reviews that didn't. We see that songs that received "best new music" got an average score greater than the average score of the songs that didn't receive the award. This was also shown statistically by running a t-test (the p-value was very close to 0).
Now we will take a look at scores over time to get an answer to the first question posed. We don't see a pattern really appear in the first half of the review years captured in the dataset due to less data availability. However, we see a general trend appearing for the later years: score averages increasing over time. This is an interesting find, as it could be the case that songs are getting better over time, or audience behavior is changing with time. Overall, average scores remain in the 6-8 range.
Next we take a look at changes in scores as the time between reviews and song releases increase. The first thing we notice is that there are some reviews that were written for songs that actually released after them. This is probably from users that got early access to the song and wrote a review before release, but could also be an error in the dataset. The next thing that we notice is that most reviews are made the same year or a year after a song is released. These years offer the most accurate aggregate (least potential error). The next thing we notice is that the number of reviews reduces as the time between song and review increase. This makes sense as there are relatively less old songs compared to new ones. So, although we might notice an increase of average scores as time between song and review increases (what we referred to as a potential "nostalgia effect"), it might be fueled entirely by an increasing lack of data.
Next, we take a look at record label types to see if there's any evidence indicating preferential treatment of larger record labels. Beginning with score, we see that average scores remain constant regardless of the size of the record label.
On the other hand, looking at the number of reviews that gave out "best new music" awards (graph on the left), we see that there is a slight indication of preferential treatment. Although the largest record labels (ones with 100+ songs with reviews) do not have the most amount of songs with reviews, they do have the most songs with "best new music" awards. This is interesting and could prompt further analysis. On the right we see a similar graph showing a similar analysis but for genre. In this case, genre doesn't show the same interesting find as record label type.
So far we've seen score, difference in time between release and review, record label types, and genre. We will be seeing how useful these variables are in determining whether a song receives the "best new music" award. To do so, we first see if we should exclude some of the variables by running a simple logistic regression predicting best new music categories (no award = 0, award = 1) with the whole dataset and check the p-value of the coefficients. First, we normalize all numeric variables and encode the genres using dummy variables. We see that the p-value related to the difference in review and release times is way higher than desired, and thus might be negatively affecting our analysis. We will remove the variable for future analysis.
We will now be using the random forest machine learning model to predict best new music categories. To find the optimal set of hyper-parameters, we run a grid search optimizing via F1-score using a 5-fold cross validation and the hyper-parameter values seen below. The dataset is split into 75-25 training-testing, ensuring that the proportion of 0's and 1's remains constant throughout data subsets.
A series of performance measures for the model with optimized hyper-parameters can be found below. We see from precision, recall, and F1-score that the model is not performing well. Note that for this dataset, there are many more reviews that did not give out "best new music" awards than those that did. Having low values for these measures and a higher value for accuracy is indicative of a model that is over predicting the majority class and making a lot of false negative errors in the process. We can also visualize this using a confusion matrix.
Below we have the confusion matrix on the testing subset. We see that over predicting the majority class is beneficial for the model due to how many more 0's there are than 1's, which indicates that there is not a strong enough pattern being found in the variables included in the model's construction.
Even though the model is not performing at its best, it could still be interesting to look at the model outcomes. Below we see what is known as a feature importance plot, showing how important each variable was while training the random forest model. We see that score was the most important, with the type of record label being the second most important, by a long shot. Another indication that a model is underperforming is when a single variable dominates the feature importance plot, so this isn't great. Nonetheless, we see that score is important, most likely because of the difference in distributions between "best new music" reviews and non "best new music" reviews as we saw in the initial visualization.
A way to further understand the feature importance plot is to look at partial dependence plots, which show how the output of the model (in the case of binary classification, the y-axis goes from 0 to 1 indicating how often the model outputs one over the other) changes when all variables are kept constant and a single variable is changed. Here we see the partial dependence plots for the score and record label type variables. For score, we see that when score is increased to larger values (1 STD away from the mean), we have a spike in the model output, indicating that the model will have a 50/50 chance of assigning a review "best new music" if its score is 1 STD away from its mean. The effect is much smaller for record label type, but we see an increase in the output when the record labels are larger. In general, results for this model shouldn't be trusted too much, but for the scope of this project we can say with some confidence that the most important variables to consider when assigning "best new music" are score and record label size.
Overall, answers to the posed questions were sufficiently provided, although more work could be done to proceed deeper in this analysis. This dataset only contains data up until 2016, and more data could be beneficial to see if the rising score pattern persists, for example. I would additionally like to explore the role of the larger record label types in influencing the giving of best new music awards. This would mean digging deeper into the record label themselves and seeing trends within each of them. Finally, it would be nice to acquire technical data for songs to see if trends appear (best captured by machine learning models) that make it easier to predict whether a song gets best new music or not.