Predicting prices in the Sneaker Aftermarket
By: Derek Aborde
Introduction
In this notebook we will look at data obtained from Stockx, an online marketplace for shoes. We will focus on transaction data for Adidas Yeezy sneakers, all to hopefully gain some insight into the sneaker aftermarket.
The Sneaker Aftermarket
What is the Sneaker Aftermarket?
Well, let's take a look at the primary market or the retail market first. The primary market is where sneakers are first released for retail to the general public, usually through major retailers like Nike stores, Adidas stores, Footlocker, and many more.
Though, for popular sneakers, you can expect a very limited supply that usually sells out almost immediately. This leaves many shoppers wanting such sold out sneakers that are no longer available in the primary market.
Enter the aftermarket or the secondary market, which is where these shoes are sold once again, except now at a premium. Here, individual sellers who were able to purchase highly coveted sneakers from the primary market can list their sneakers for sale, usually at prices much higher than retail, and resell them to buyers.
Sites like Stockx and FlightClub are major platforms of the sneaker aftermarket.
Motivation
Stockx prides itself on being the 'Stock Market of Things', which got me thinking. We've seen how data science is applied to stock market analysis and stock market predictions, so would it be possible for shoes? Stockx keeps a lot of transaction data and I want to see if it can be used to predict prices on the aftermarket.
Yeezy
The shoes in question we will focus on in this notebook are Yeezys. Founded by both Adidas and Kanye West, this line of sneakers has grown to be one of the most coveted sneakers, which has made them very popular on the sneaker aftermarket, as some models sell upwards to 200-300% above retail.
Goal
Train a model to predict the price that a Yeezy sneaker will sell for on the aftermarket.
1. Data Collection
The code below contains everything needed to get transaction data for every mens Yeezy model. Feel free to modify how much data you want by changing up some of the parameters in 'params'. Let's skip this part and take a look at the CSV file that the code below produces.
import requests
from bs4 import BeautifulSoup
import json
import time
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
params = {'state':'480','currency':'USD','limit':'10000','page':'1','order':'DESC','country':'US'}
# Gets some quantitative and qualitative data about every mens yeezy shoe
identifying_data = pd.DataFrame()
for i in range(1,5):
url = 'https://stockx.com/api/browse?_tags=yeezy,adidas&productCategory=sneakers&gender=men&page={}'.format(i)
try:
r = requests.get(url, headers=headers)
except requests.exceptions.RequestException as e: # This is the correct syntax
raise SystemExit(e)
soup = BeautifulSoup(r.text, 'html.parser')
data = json.loads(str(soup))
identifying_data = identifying_data.append(pd.json_normalize(data['Products']))
# sleep so you don't get rate-limited
identifying_data = identifying_data.dropna(subset=['market.deadstockSold'])
# Function that gets a good number to slice with
def get_split(x):
y = x / 2500
if y <= 1:
return 1
else:
return int(y)
# Gets actual transaction data, most importantly, how much and when a yeezy model sold for
df = pd.DataFrame()
count = 0
for i in identifying_data['urlKey']:
url = 'https://stockx.com/api/products/{}/activity'.format(i)
try:
r = requests.get(url, params=params, headers=headers)
except requests.exceptions.RequestException as e: # This is the correct syntax
raise SystemExit(e)
soup = BeautifulSoup(r.text, 'html.parser')
data = json.loads(soup.string)
x = get_split(identifying_data.iloc[count,62])
df1 = pd.json_normalize(data['ProductActivity'][::x])
df1['Color'] = identifying_data.iloc[count, 7]
df1['Release Date'] = identifying_data.iloc[count, 19]
df1['Retail'] = identifying_data.iloc[count, 22]
df1['Model'] = identifying_data.iloc[count, 23]
df1['Name'] = identifying_data.iloc[count, 27]
df1['Annual High'] = identifying_data.iloc[count, 57]
df1['Annual Low'] = identifying_data.iloc[count, 58]
df1['Volatility'] = identifying_data.iloc[count, 61]
df1['Total Sold'] = identifying_data.iloc[count, 62]
df1['Total Dollars'] = identifying_data.iloc[count, 71]
df = df.append(df1)Í
count += 1
time.sleep(5)
Let's first read in the CSV file as a pandas dataframe and take a look at our dataset
# Reading in the data
import pandas as pd
df = pd.read_csv('stockx_yeezy_data.csv')
df
From our dataset, you can see we have 491,235 rows and 21 columns.
That means that we have data on 491,235 different Yeezy sales.
Each transaction has 21 columns but we will only focus on the following information:
2. Data Cleaning
Taking a quick look at the data, you may notice that some cleaning needs to be done.
# Gets rid of the columns that we do not want
df = df.drop(columns=['Unnamed: 0','chainId', 'amount', 'productId', 'skuUuid', 'state', 'customerId', 'localCurrency',])
# Converts columns to a datetime
df['createdAt'] = pd.to_datetime(df['createdAt'])
df['Release Date'] = pd.to_datetime(df['Release Date'])
# Check which columns have missing values and how many are missing
pd.isnull(df).sum()
# Gets rid of the shoes that have not released yet
df = df.dropna(subset=['Release Date'])
# Replaces NaN elements with 0
df = df.fillna(0)
3. Feature Engineering
Here we will create more influential variables from the existing raw data by creating new columns. This will help us categorize our dataset, and potentially help us gain insight into some good predictors for our training model.
Lets first do some manual one hot encodings, which are a representation of categorical variables as binary values
I believe that whether or not a shoe is white or black, a 350 model (the most popular model), Static (special color of a model that has reflective accents and are much more limited than their non-reflective counterparts), will all be useful predictors of resell prices
# Makes new column checking if White is found in the Color column
df['is_white'] = [1 if 'White' in x else 0 for x in df['Color']]
# Makes new columns checking if Black is found in the Color column
df['is_black'] = [1 if 'Black' in x else 0 for x in df['Color']]
# Makes new columns checking if 350 is found in the Name column
df['350'] = [1 if '350' in x else 0 for x in df['Name']]
# Makes new columns checking if Static is found in the Name column
df['Static'] = [1 if 'Static' in x else 0 for x in df['Name']]
Now lets quantify some columns for our training models to interact with:
# Makes new column by subtracting the two dates, getting only the computated days
df['days_until_sold'] = (df['createdAt'].sub(df['Release Date'])).dt.days
# Makes new column by dividing the two prices
df['price_ratio'] = (df['localAmount'])/(df['Retail'])
df.head()
4. Data analysis
Now take a look at our dataset and examine some interesting relations within the dataset.
Let's first take a look at our target variable, price_ratio. From the histogram below, we see that it is right skewed. Clearly there are some really big outliers within the dataset.
import matplotlib.pyplot as plt
# Histogram plot of our target variable
plt.hist(x=df['price_ratio'], bins=100)
plt.xlabel('Price of sale')
plt.ylabel('Number of Sales')
plt.title('Distribution of Sales')
Now we will look at one of our categorical variables, 'Model', and its relation to our target variable. You may notice that 350 models, on average, tend to have some of the highest price ratios.
# Group our dataframe by each Model
grouped = df.groupby('Model')
x = []
y = []
# Go through each Model i and plot the mean of its price ratio
for i,bin in grouped:
group = grouped.get_group(i)
x.append(i)
y.append(group['price_ratio'].mean())
# Set labels
plt.xticks(rotation=45, ha='right')
plt.title('Price ratios for each model')
plt.xlabel('Model')
plt.ylabel('Price Ratio')
plt.bar(x,y)
Now let's look at our numerical variables and one hot encodings, and their relation to our target variable.
We can visualize this relationship by creating a correlation matrix.
import numpy as np
# Make a new dataframe, consisting of only columns that are numerical
numerical = df.select_dtypes(include=[np.number])
# Creates matrix for each predictor ands its relation to other variables
correlation_matrix = numerical.corr()
correlation_matrix
The heatmap below plots the correlation data between our variables.
You may notice that Annual High, Annual Low, is_black, 350, and days_until_sold have some of the strongest positive correlation with our target variable.
localAmount has a positive correlation, but must be disregarded since it is directly related to price_ratio and thus not a good predictor.
import seaborn as sns
# Creates heatmap
sns.heatmap(correlation_matrix, square=True)
5. Modeling
Now, after all our data cleaning, feature engineering, and data exploration, we will now start our predictive modeling.
We will train a multiple linear regression model to predict our target variable, price_ratio. Multiple linear regression models are like Linear Regression, except it can handle more than 1 feature.
However, because it can handle more than 1 feature, it is imperative that features not be correlated to each other. So to avoid multicollinearity, we must first drop 'localAmount' and 'Retail', which are directly related to our target variable. Anyways, price_ratio is a much better indicator of these 2 values.
'createdAt' and 'Release Date' should be dropped because we have a good unit measuring these 2 values, our 'days_until_sold' value.
'Total Sold' and 'Total Dollars' are also correlated to each other, so we will drop one of them. I decided to keep 'Total Dollars' as a predictor.
Lastly, 'Model', 'Name', and 'Color' will be dropped since we have one hot encoded columns in there place.
df = df.drop(columns=['localAmount','Retail', 'createdAt','Release Date','Total Sold','Model','Name','Color'])
df.head()
With that done, we can start by first splitting our dataset into training and testing data
from sklearn.model_selection import train_test_split
# Seperate the features and target
X = df.drop(columns=['price_ratio'])
y = df['price_ratio']
# Do an 80/20 split of our data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20)
Here we make our linear regression object and train it with our training data.
Then we plot our predictions and our actual test data to compare results. A summary of our model is also shown to highlight our R2 score.
from sklearn import linear_model
from regressors import stats as stats_reg
# Define the multiple linear regression model
lr = linear_model.LinearRegression()
# Fitting the model
lr.fit(X_train,y_train)
# predict with the data
y_pred = lr.predict(X_test)
# Make scatter plot of actual vs predictions
plt.scatter(y_test, y_pred)
plt.title('Actual vs predictions')
plt.xlabel('Actual')
plt.ylabel('Predicted')
# Get summary of our model
stats_reg.summary(lr, X_train, y_train, X_train.columns)
So our R2 value is mediocre, and our graph confirms that our model can definitely be improved.
But, before we do this, lets first check whether or not our model violates the linear regression assumptions.
So lets use statsmodels to do a normality test.
After doing so and looking at the graph, the S-Shape indicates significant non-normality, thus failing the normality test.
import statsmodels.api as sm
# Get residual
Residual = y_pred - y_test
# Plotting the residuals
sm.qqplot(Residual,line="r");
5.1 Modeling
Perhaps we can solve this violation of normality by performing a nonlinear transformation of variables.
Let's see what a natural log of transformations to the variables might do by comparing our original target variable to the natural log of it.
You may notice in the histogram below that the frequencies and spread change drastically after natural log transformation.
# Histogram plot of the original target
plt.hist(df["price_ratio"], bins=100);
# Histogram plot of the log-transformed target
plt.hist(np.log(df["price_ratio"]),bins=100);
plt.legend(["price ratio", "log transformed price ratio"]);
plt.show()
Now lets fit our training model again, this time, with a log transformed price_ratio.
X_train_l = X_train
X_test_l = X_test
y_train_l = np.log(y_train)
y_test_l = np.log(y_test)
# Define the multiple linear regression model
lr = linear_model.LinearRegression()
# Fitting the model
lr.fit(X_train_l,y_train_l)
# predict with the data
y_pred_l = lr.predict(X_test_l)
# Make scatter plot of actual vs predictions
plt.scatter(y_test_l, y_pred_l)
plt.title('Actual vs predictions')
plt.xlabel('Actual')
plt.ylabel('Predicted')
# Get summary of our model
stats_reg.summary(lr, X_train_l, y_train_l, X_train_l.columns)
# Get residual
Residual_l = y_pred_l - y_test_l
# Plotting the residuals
sm.qqplot(Residual_l,line="r");
Although log transforming our target variable helps improve our model, as told by a higher R2 score, it fails to pass the normality test once again.
A good next step would be to train a different model.
Conclusions
Although we were able to train a relatively mediocre multiple linear regression model, we were unable to meet the linear regression requirements. Nonetheless, we were still able to gain some insight into the sneaker aftermarket. Most importantly we learned the features that drive the resell prices on Yeezys, like Annual Highs, differences between sale date and release date, and even color.
Also, I believe that the number of pairs released of a certain shoe is the ultimate dictator of the aftermarket prices. This article explains how Nike controls the sneaker aftermarket through a perfect execution of limiting supply to increase demand, encouraging a cult-like customer devotion to Jordans, like Adidas has done with Yeezys. So with limited supply, you can only expect the sneaker aftermarket prices to match this type of demand. Getting the number of pairs released of each Yeezy model would be difficult to get (most are unknown), but if it were attainable, it would greatly improve our model as another feature.
But, without this, the best next course would be to train another model, here is a great article if you would like to learn some more about this: https://www.quality-control-plan.com/StatGuide/mulreg_alts.htm