Billboard Top 100 Songs of 2000
The data is on 317 different tracks that were in the top 100 in the year 2000, including genre, duration of song, title, artist, date peaked, date entered and weekly positions in the top 100.
For next week’s podcast episode of ‘Are You Entertained?’, we analyzed the top 100 songs of 2000. By performing exploratory analyses on the data, cleaning and loading the data, forming the hypothesis, observing correlations, plotting using visualizations and evaluating the hypothesis, we hope to figure out what makes a hit soar to the top and how they stay there.
Hypothesis:
The genre Rap takes the longest time to peak.
The hypothesis hopes to answer the question, “Does genre have an influence on time it takes to peak?” The hypothesis is formed from the assumption that generally Pop music is usually the most popular genre of music and tends to peak rapidly. Also, compared with other genres such as rock and country, rap in 2000 would be less popular than songs from the the rock or country genre.
Risks & Assumptions:
Billboard Top 100 data in 2000 is based on a combination of radio airplay and sales* Sales of tracks are calculated using CD sales and not independant tracks in 2000. During this time streaming is not a relevant metric in figuring out the Billboard Top 100 formula. Another assumption to remember is that the time is in seconds. Furthemore, there is only one artist per song and doesn’t take into account featuring or multiple artists on one song. Another assumption is that Rock and Rock n’ Roll are essentially the same genre and have been combined.
The risk involved with the data is that some songs are not clearly definied in the genre. For example, some songs with certain genres made not necessarily be categorized correctly. Futhermore, the names are artist aren’t always alphabetized by last name and sometimes includes names of groups. Another risk is the low amount of songs in certain genres. Some genres only have 1 or 2 songs within that specific genre and forming analysis may be difficult especially when comparing various statistical analyses.
Cleaning the Data/ Data Munging
The data was cleaned by:
1 Creation of NaN values for songs that may have fall off from the top 100 list on a given week.
2 Converting time (duration of song) to seconds
3 Converting object types into date types or integers
4 Cleaning Genre types and format of genre types
5 Sorting by Artist Name
Data Analysis (Observatory)
The Billboard Top 100 data library is imported using Pandas. Other packages that were imported include numpy, scipy, maplotlib, seaborn, and plotly. These python packages are used to form the analysis on the data in meaningful ways.
Some exploratory data analysis taken were the longevity and rank of all 317 songs. throughout 2000. Also, we looked at aggregate functions like mean, median, mode, standard deviation, variance and length by genre.
Hypothesis Evaluation
In order to analyze the hypothesis of rap music, the first step was to create another column with the days it took each song to peak. This was calculated by finding the difference of date the song entered the charts and the day the song peaked. By finding the mean day to peak by genre, we were able to see, which genre takes the longest to peak.
The graph below shows the average days a song takes to peak by genre.
The results show Latin, Electronica, and Country were actually the genre types that take the longest to peak. Rap was actually third in terms of fastest to rising to the peak (Jazz and R&B being the fastest)
Futhermore, to test the statistical relevance, P Value and T- Statistic models were used to determine wheter to reject the null or reject the alternative hypothesis.
P Value & T- Statistic
By comparing each individual rap songs’ time to peak compared to rest of the songs’ times to peak, we were able to find the P Value and T- Statistic
The null hypothesis is that there is no difference regardless of genre on how long it takes for a song to peak.
The alternative hypothesis is that there is a difference, specifically with rap music.
The T- test revealed T statistic of 1.90 and a P Value of .0577
Since the p-value is higher than the significance level, assuming an alpha of .05, the alternative hypothesis is rejected and the null hypothesis is true.
TO NOTE: ** It is hard to exactly determine whether to reject null or alternative hypothesis in this particular scenario because the P Value is only very slightly higher than the .05 (difference of .007) **
Considerations
There are multiple considerations to observe when looking at this data. As mentioned earlier, some of the risk involved with this data is to observe how there is not sufficient data in each genre to correctly analyze and test the hypothesis. For example by finding the amount of songs in each genre that have it to the top 100, it is disproportionately in favor of Rock, Country, Rap, and R&B. Genres like Jazz, Reggae and Gospel only have one song in the top 100 in 2000. This can skew the data as it is only one point and the previous graph shows Jazz as the fastest genre to peak.
In order to account for this looking at the relationship of mean days to peak with the amount of songs in each genre is crucial.
Below are charts showing the relationship between the days to peak by genre and the amount of songs in each genre.
The plot shows only 4 genres with more than 10 songs in the top 100 billboard and some correlation between amount of songs and average days to peak.
Lastly, it is important to consider proportion of time to peak (days on the chart) and how long the song has been in the top 100 by genre.
Conclusion:
Using genre to predict time to peak is a mostly inconclusive process and prediction model. Other factors need to be taken into account to form a more accurate picture into which songs reach the peak the fastest and whether or not genre has an affect at all on time a song takes to peak.