IOWA LIQUOR SALES 2015 - 2016
Market Opportunity / Business Expansion
A liquor store owner in Iowa is looking to expand to new locations and has hired you to investigate the market data for potential new locations. The business owner is interested in the details of the best model you can fit to the data so that his team can evaluate potential locations for a new storefront. Include recommendations for volume sold and price per bottle.
Using statewide Iowa liquor store transactional data for 2105 and Q1 2016, recommendations for viable expansion locations were made. A model was also created using this data to create a prediction for future sales and suggestions for average bottle price and the amount of bottles sold, need to be presented.
Which Iowa location is the best to open a new store?
Strategic thinking behind the decision of a new Iowa liquor store location is dervied from looking at a variety of factors and variables associated with sales and profit of stores in specific areas. Some notable factors include county, zipcode, city, item description, price per bottle, cost per bottle, bottles sold, and sales. The data had over 2.7 million liquor sales transactions and 18 characteristics.
Risks & Assumptions:
When looking at location, there are three possible ways to segment out Iowa into location territories. The three ways to segment Iowa in our dataset is through counties, zipcodes, and cities. When looking at the quantity of stores by zipcode, cities, and counties, counties contained the most amount of stores as it represents the largest area out of the three location types. A further dive into the data shows that after the top 10 counties, the amount of stores per county drops to less than 20. Therefore, drilling down into zipcodes and cities would cause the amount of stores in each location to be much smaller.
Another assumption/ risk to consider is the the multi- colinearlity between many of the variables. The amount of bottles sold and the price of the bottles has a direct correlation with profit and sales. Therefore, our model’s may seem too perfect and may be a refelection of the relationship of price of bottle * bottles sold = total sales.
Lastly, because of very large outliers (presumably due to alcohol distributors in the state of Iowa, with huge sales) we took the route of looking at both the mean and the median to model out our predictions. To account for the outliers without directly removing them, the median gave us a better estimate of our average sales and we looked at both to come to that conclusion.
Data Cleaning/ Munging
Steps in cleaning data:
1 Convert money columns from string to float
2 Checking to see if all zipcodes have length of 5 and changed to integer
Found 1 zipcode value '712-2', turns out to be 51529 (changed)
Found 1 zipcode value outside of Iowa, switched from 56201 to 52601
3 All city values changed to uppercase
4 Replace NaN by creating a dictionary with “City” as key and filled in corresponding “County” and “County Name” as values of the dictionary Updated remaining NaN values for “County” manually
5 Replace NaN values for “Category Name” using “Item Description” and correlating matching values * Manually updated related NaN values in “Category Name”
Data Analysis / EDA
In order to start forming decisions on locations to open a new store, it is necessary to evaluate possibilities for further data analysis and perform exploratory data analysis. To do this, the date was separated out into seperate columns of months and year to account for seasonality and a new year. Overall gross profit and gross profit per bottle was calculated to see expected profits by store. Number of transactions per county was also calculated.
In addition, 2010 demographic data for land area per store and population per store was appended to provide a more robust and detailed analysis of our data. The thinking was that reach per store by populatin and alnd size would have an affect on sales. Correlating median sales in relationship to area per store and population per store resulted in no direct correlation.
To account for seasonality affect and differences between months and years, the data was split into 2015 sales and 2016 sales. After calculating the top 10 counties for gross profit and sales, Polk came out overwhelming at the top for both. This may be attributed to the population in the county and the capital city of Iowa being located in Polk.
Lastly, through the exploration of the data and plotting the data, we noticed two extreme outliers. The outliers skewed the liquor sales data to be overwhelmingly right- skewed. The outliers showed two stores having sales over $10 MM. To account for the right skew in our data, in determining average sales to predict our models, both median and mean were used to determine average.
Data Mining
The EDA process helped formulate and ready the stage for the data mining process in analyzing which counties would be ideal for opening up a new liquor store location. After aggregating the data, we found general metrics such as sales and bottles sold to be highly correlated with
Below are some metric that describe the Iowa Liquor Sales 2015 - 2016 Dataset:
Sales
The Mean total sale per store across state is $206,784.42 The Median total sale per store across state is $71,428.33 The Standard deviation of total sale per store across state is $503,127.27
Bottles Sold
Mean total bottles sold per store across state is 15,689.05 bottles. Median total bottles sold per store across state is 6,579.0 bottles. Standard deviation of total bottles sold per store across state is 31,518.96 bottles.
Before moving forward into the data mining process, a few considerations and decisions needed to be taken.
Market Size vs Sales
The idea of market size vs sales for determingin the best locaiton for a new store needs to taken account. Basically, the quesiton boils down to whether reach and size of the market is more important or sales and profit are the most important in determining the next location. Both are important to the decision making process, but one is more important than the other.
The factors considered in the model include:
Market Size:
- Total sales per county
- Land area per store
- Population per store
Sales:
- Average Sales Per Store
- Average Gross Profit Per Store
Maximum Total Sale vs. Highest Average Sale
Next, a decision had to be made based on whether the maximum total sale or the highest average sale was a better indicator for predicting future sales.
Maximum total sales:
By using maximum total sale as the indicative performance metric for choosing location of a new store, Polk County is the recommended county.
Earlier data explorations reveal the reasons for Polk’s stellar performance:
1) The presence of 2 extreme outlying stores, with annual sale of between $9 to $10mil, 50 times more than 75% of all stores.
2) 208 stores in Polk, twice as many as the next highest county, Linn with 104. 75% of of the counties have 13 or less stores.
A client with strong financials, proven management record and liquor-store expertise, with the desire to gain market share, would be interested in establishing a presence in Polk.
Highest average sales:
Another metric to describe liquor sale performance for a county is to consider the average store sale within the county and to pick the county with the “Highest average” sale. Winnishiek, although having only 5 stores, had the highest average (median) sales.
Two models were created to account for these differences. One emphasizing market size and maximum total sales, while the other emphasizes sales and highest average sales.
Modeling
In predicting sales for both Polk and Winneshiek, both a Lasso Regression and Ridge Regression were used. The 2015 Q1 Iowa Liqour sales for Polk and Winneshiek was represented the training data set to predict 2016 Q1 data, the test set. In order to fit both models, both ridge regularization and cross- validation were implemented.
MODEL FACTORS:
- MEDIAN BOTTLE COST
- MEDIAN BOTTLE PRICE
- MEDIAN BOTTLE PROFIT
- SUM OF BOTTLES SOLD
- SUM VOLUME SOLD
- NO. OF TRANSACTIONS
TARGET:
- SALES
Linear regression, as a model, performed slightly better in Polk than in Winneshiek, possibily due to more data points in Winneshiek compared to Polk (More stores in Winneshiek (5) than in Polk).
POLK REGRESSION MODEL
WINNESHIEK REGRESSION MODEL
The alphas for the Ridge algorithm was 0.2 for regularization and was minimally effective in reducing RMSE. Model performance was considered mediocre with high r^2 (~ 0.95) but had a relatively high Root Mean Squared Error. Coefficients of the models were also significant.
Suggestions for model improvement:
1. Scaling the features using a normalization method is necessary to account for the huge differences in measurements.
2. Adding features that do not have multi- colinearity and not directly related in predicting liquor sales will provide a more meaningful model.
3. Including more historical data for time-series analyses of market trends over the years. This will also give our model the ability to predict Q2 2016 and so on since our model is currently only based on Q1 data.
4. In future iterations of the models, a classification or log regression approach would be more affective in predicting future locations for liquor stores. Using a linear regression proved to have limitations when it came to counties and types of liquors sold because of its discrete, non- continous, nature.
Conclusion:
The liquor store data from the state of IOWA in 2015 and 2016 Q1, to answer the following:
- Identify suitable locations for a liquor store owner's expansion plans
- Create a model to predict future sales
- Recommend target for bottles sold and average bottle price
1) Location recommendations
After examining all 99 counties across the state of IOWA, 2 counties were selected as possible locations depending on the type of client and their needs.
For a typical store expansion plan (annual revenue of ~ $200k), Winneshiek was recommended for the following reasons:
- Top county for average sale per store
- Top county for average bottle sold per store
- 97th percentile for average bottle price
- 97th percentile for average bottle profit
- 68th percentile for average number of transactions per store
- 46th percentile for mean bottles per transaction
- Recommended average price per bottle ~ $15.50
- Recommended annual bottles sold ~ 21,000 bottles
For a large scale expansion (annual revenue of ~ $9mil per store), which requires access to a large market base, we recommended Polk County for the following reasons:
- Top county for total sales, i.e. largest market
- Top county for total bottle sold per store
- Top county for total number of transactions
- 54th percentile for average bottle price and average bottle profit
- Recommended average price per bottle ~ $19
- Recommended annual bottle sold ~ 500,000 bottles