Data Science Salaries
WebScraping & Log Regression
Premise
A contracting firm that’s rapidly expanding wants to leverage data to win more contracts. The firm offers technology and scientific solutions and wants to be competitive in the hiring market. The principal at this firm thinks the best way to gauge salary amounts is to take a look at what industry factors influence the pay scale for these professionals.
Assumptions and Risks
Assumptions and risks related to this data set is the low percentage of job positions with salaries and even then most postings are salries from recruiting agencies. Also, salaries are listed in the form of weekly, monthly, yearly and so on. Since the variance in how salary is listed can be a factor in compensation, only the salaries with yearly salaries were taken (most common salary expression). Another risk associated with the data is the reccurence of similar job postings. For example, one company or recruiting firm can post multiple times for one company with slight differences in job description or job title. Many times the job could possibly be the same position or it could be multiple recruiting agencies promoting a certain company/ job. This can skew the data to generalize according to these specific positions and companies.
Some assumptions we are making is that the mean for each salary range per job listing will serve as our basis for each salary. For example if the salary is between 80K and 100K, the salary used for analysis will be 90K (the mean of the two).
Web Scraping - Indeed
Using Beautiful Soup, a webscraping python library, a request was sent to Indeed.com to scrape content on Indeed. Next a function was created to extract location, job title, company name, salary, and job description and summary for data scientist positions from indeed. Job positions with the keywords “data” and “science/scientist” and position locations in either New York, Chicago, Boston, San Francisco, Los Angeles, Austin, and Atlanta were scraped 1000 max entries per city.
Data Muning/ Data Cleaning
After extracting the data, the data needed to be cleaned. First, jobs without salaries were removed as our predictive analysis and modeling of the data requires data on salaries. Also dollar signs($), commas (,), and ‘nobr’ were removed from the salaries column. Salaries are also paid either yearly, monthly, weekly, and hourly. Since only the yearly salaries are pertinent for analysis and is the most common way of noting salaries, monthly, weekly, and hourly salaries were dropped. In addition, since job positings usually present a range of salaries for each posting, the mean of the highest and lowest number in the range was averaged and was used as the single salary amount for each position.
Exploratory Data Analysis (EDA)
Before performing predictive analysis on the indeed scraped data, general exploring of the salary data is necessary. From the data gathered, the mean salary for data science positions is $103,939 with a standard deviation for $47,204. The highest salary listed is $250,000 and the lowest is $10,000. Next we looked at how many times a value or word shows up in our data. Generic job titles such as Certifying Scientist, Data Scientist, and Research Analyst were job titles that showed up the most often.
Not surprisingly, New York showed up the most at 78 times for cities. Boston, Los Angeles, Chicago, and San Francisco were other cities that showed up often. This can probably be attributted to the nature of the scrapping.
New York and Boston contained the most job listings with salaries. This may mean that the cities on the east coast were more likely to post salaries. The higher salaried listings could also have to do with the fact that the current location of this analysis is being done in New York on the east coast.
Universities, government entities, and recruiting firms seemed to have the highest number of job postings/ positions with salaries in Indeed. Some examples include Averity, Columbia University, Commonwealth of Massachusetts, Center for Disease Control
When looking at the salary range of the top 20 companies with multiple listings, government entities tend to fall on the lower range of the salary range. For cities, Los Angeles seems to have the lowest salaries, while New York City had the highest.
Data Mining
Using the salary data for data science jobs, salaries were divided between above 100K and below 100k. This was determined by finding the mean of all the salaries. The three features used to predict whether the salary for a specific position is above or below 100K are location/ city, job title, and company. A patsy matrix was created since the values were classifications and not linear or numeric. A logistic regression was performed with grid search to find the best parameters and to predict “high” or “low” salaries. The best parameters is a C of 2.5 and a Lasso (or L1) regression.
Looking at the classification report, a precision score of .85, recall score of .85, and a f1 score of .85 was obtained.
A ROC AUC (Area Under Curve) score of .90. The ROC curve plots the true positive rate vs the false positive rate. A score of .9 signifies the model is a good predictor of classifying whether a job position might be above or below 100K.
Conclusion:
Based on the data provided, recruiters dominated job postings for data scientits because of the nature of job postings with salaries being primarily recruiting firms in career sites. So seeing Workbridge Associates, a recruiting firm, be the highest predictor of high salaries is not too surprising. Lighthouse Recruiting is in Smyrna and recruit for “certifying scientists” so it is no wonder they are all in the top 5 predictors for low salary.
Surprisingly, Boston is a higher predictor of salary after San Francisco. I assumed that New York might be the second highest city for being a high predictor of data science salaries. Furthemore, the Commonwealth of Massachusetts, a company located in Boston, is a low predictor of salary, but Boston is high, which was interesting to see.
Below are the top 5 predictors of high salary and top 5 predictors of low salary.
The top 5 Predictors of High Salary
1 Workbridge Associates (Company)
2 San Francisco (City)
3 Data Scientist (Job Title)
4 Boston (City)
5 Center for Disease Control and Prevention (Company)
Top Five Predictors of Low Salary
1 Lighthouse Recruiting (Company)
2 Columbia University (Company)
3 Commonwealth of Massachusetts (Company)
4 Smyrna (City)
5 Certifying Scientist (Job Title)
Considerations and Follow Up
There are multiple considerations to observe when looking at this data. There is significant risk with scraping the terms data and scientist as this brings in “Certifying Scientist” and “Scientists”, unrelated to data science and positions for people trying to become scientists. Therefore, the salary data could be highly skewed as the mean salary is lower because of the lower salaries. Removing duplicates or manually eliminating very similar positions would be another follow up step. For example, Lighthouse Recruiting Company has multiple Certifying Scientists located in Smyrna, GA and this position shows up more than 50+ times in our scrapped data.
Another consideration is the position title of data scientist is constantly in flux and changing. In addition, since the definition of data scientist is not fully understood by some employers and potential job candidates, it may be difficult to fully capture the salaries of “data scientists”.
Also, as mentioned earlier, some of the risk involved with this data is the lack of sufficient data to analyze further all the salaries. More data would need to be scraped for the analysis to be meaningful. Appending and scraping other job sites and salary sites like glassdoor might address some risks with insufficient data.
If this predictive analysis was to be done again, positions and salaries will be vetted carefully to exclude positions unrelated to data science and those that have to do with general scientists. Furthemore, more data will be included to provide a more thorough and accurate prediction for indicators of high salaries.