Predicting prices in the Sneaker Aftermarket

By: Derek Aborde

Introduction

In this notebook we will look at data obtained from Stockx, an online marketplace for shoes. We will focus on transaction data for Adidas Yeezy sneakers, all to hopefully gain some insight into the sneaker aftermarket.

The Sneaker Aftermarket

What is the Sneaker Aftermarket?

Well, let's take a look at the primary market or the retail market first. The primary market is where sneakers are first released for retail to the general public, usually through major retailers like Nike stores, Adidas stores, Footlocker, and many more.

Though, for popular sneakers, you can expect a very limited supply that usually sells out almost immediately. This leaves many shoppers wanting such sold out sneakers that are no longer available in the primary market.

Enter the aftermarket or the secondary market, which is where these shoes are sold once again, except now at a premium. Here, individual sellers who were able to purchase highly coveted sneakers from the primary market can list their sneakers for sale, usually at prices much higher than retail, and resell them to buyers.

Sites like Stockx and FlightClub are major platforms of the sneaker aftermarket.

Motivation

Stockx prides itself on being the 'Stock Market of Things', which got me thinking. We've seen how data science is applied to stock market analysis and stock market predictions, so would it be possible for shoes? Stockx keeps a lot of transaction data and I want to see if it can be used to predict prices on the aftermarket.

Yeezy

The shoes in question we will focus on in this notebook are Yeezys. Founded by both Adidas and Kanye West, this line of sneakers has grown to be one of the most coveted sneakers, which has made them very popular on the sneaker aftermarket, as some models sell upwards to 200-300% above retail.

Goal
Train a model to predict the price that a Yeezy sneaker will sell for on the aftermarket.

adidas-yeezy-archive.jpg

1. Data Collection
The code below contains everything needed to get transaction data for every mens Yeezy model. Feel free to modify how much data you want by changing up some of the parameters in 'params'. Let's skip this part and take a look at the CSV file that the code below produces.

In [ ]:
import requests
from bs4 import BeautifulSoup
import json
import time
headers = {'user-agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'}
params = {'state':'480','currency':'USD','limit':'10000','page':'1','order':'DESC','country':'US'}

# Gets some quantitative and qualitative data about every mens yeezy shoe
identifying_data = pd.DataFrame()
for i in range(1,5):
    url = 'https://stockx.com/api/browse?_tags=yeezy,adidas&productCategory=sneakers&gender=men&page={}'.format(i)
    try:
        r = requests.get(url, headers=headers)
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        raise SystemExit(e)
    soup = BeautifulSoup(r.text, 'html.parser')
    data = json.loads(str(soup))    
    identifying_data = identifying_data.append(pd.json_normalize(data['Products']))
    # sleep so you don't get rate-limited
identifying_data = identifying_data.dropna(subset=['market.deadstockSold'])

# Function that gets a good number to slice with
def get_split(x):
    y = x / 2500
    if y <= 1:
        return 1
    else:
        return int(y)

# Gets actual transaction data, most importantly, how much and when a yeezy model sold for
df = pd.DataFrame()
count = 0
for i in identifying_data['urlKey']:
    url = 'https://stockx.com/api/products/{}/activity'.format(i)
    try:
        r = requests.get(url, params=params, headers=headers)
    except requests.exceptions.RequestException as e:  # This is the correct syntax
        raise SystemExit(e)
    soup = BeautifulSoup(r.text, 'html.parser')
    data = json.loads(soup.string)
    x = get_split(identifying_data.iloc[count,62])
    df1 = pd.json_normalize(data['ProductActivity'][::x])
    df1['Color'] = identifying_data.iloc[count, 7]
    df1['Release Date'] = identifying_data.iloc[count, 19]
    df1['Retail'] = identifying_data.iloc[count, 22]
    df1['Model'] = identifying_data.iloc[count, 23]
    df1['Name'] = identifying_data.iloc[count, 27]
    df1['Annual High'] = identifying_data.iloc[count, 57]
    df1['Annual Low'] = identifying_data.iloc[count, 58]
    df1['Volatility'] = identifying_data.iloc[count, 61]
    df1['Total Sold'] = identifying_data.iloc[count, 62]
    df1['Total Dollars'] = identifying_data.iloc[count, 71]
    df = df.append(df1)Í
    count += 1
    time.sleep(5)

Let's first read in the CSV file as a pandas dataframe and take a look at our dataset

In [35]:
# Reading in the data
import pandas as pd
df = pd.read_csv('stockx_yeezy_data.csv')
df
Out[35]:
Unnamed: 0 chainId amount createdAt shoeSize productId skuUuid state customerId localAmount ... Color Release Date Retail Model Name Annual High Annual Low Volatility Total Sold Total Dollars
0 0 12479908586289232936 1900.0 2017-02-21T23:52:48+00:00 8.5 NaN 07c69548-8a79-427f-b3b2-14a3b0d463e8 480 NaN 1900 ... White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0
1 1 12484419258910299232 1750.0 2017-02-28T16:45:47+00:00 8.5 NaN 07c69548-8a79-427f-b3b2-14a3b0d463e8 480 NaN 1750 ... White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0
2 2 12482557743936126591 1405.0 2017-03-07T03:56:36+00:00 8.5 NaN 07c69548-8a79-427f-b3b2-14a3b0d463e8 480 NaN 1405 ... White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0
3 3 12486416309791974108 1295.0 2017-03-20T03:36:58+00:00 8.5 NaN 07c69548-8a79-427f-b3b2-14a3b0d463e8 480 NaN 1295 ... White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0
4 4 12518755073694717688 1250.0 2017-04-16T14:30:30+00:00 8.5 NaN 07c69548-8a79-427f-b3b2-14a3b0d463e8 480 NaN 1250 ... White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
491230 12 12845045342143458938 450.0 2018-07-11T03:00:38+00:00 8.5 NaN e330718c-38c7-4fa1-b046-3815a34d9651 480 NaN 450 ... Shadow Black/Shadow Black/Shadow Black 2018-07-07 23:59:59 200.0 adidas Yeezy 500 adidas Yeezy 500 Shadow Black (Friends &amp; F... 368.0 200.0 0.228261 2.0 568.0
491231 13 12846048067616868422 300.0 2018-07-12T04:09:36+00:00 12.0 NaN e6e4dfb6-1ed2-40d1-9614-4c7d8c12539e 480 NaN 300 ... Shadow Black/Shadow Black/Shadow Black 2018-07-07 23:59:59 200.0 adidas Yeezy 500 adidas Yeezy 500 Shadow Black (Friends &amp; F... 368.0 200.0 0.228261 2.0 568.0
491232 14 12864744039539808601 350.0 2018-08-09T17:45:22+00:00 12.0 NaN e6e4dfb6-1ed2-40d1-9614-4c7d8c12539e 480 NaN 350 ... Shadow Black/Shadow Black/Shadow Black 2018-07-07 23:59:59 200.0 adidas Yeezy 500 adidas Yeezy 500 Shadow Black (Friends &amp; F... 368.0 200.0 0.228261 2.0 568.0
491233 15 13126144066807342257 200.0 2019-08-02T15:05:15+00:00 12.0 NaN e6e4dfb6-1ed2-40d1-9614-4c7d8c12539e 480 NaN 200 ... Shadow Black/Shadow Black/Shadow Black 2018-07-07 23:59:59 200.0 adidas Yeezy 500 adidas Yeezy 500 Shadow Black (Friends &amp; F... 368.0 200.0 0.228261 2.0 568.0
491234 0 13378915605124868988 260.0 2020-07-16T09:17:10+00:00 7.0 NaN 03c50378-dafa-4279-93cd-55262f20a5a1 480 NaN 260 ... Eliada/Eliada/Eliada NaN 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Eliada 260.0 260.0 0.000000 1.0 260.0

491235 rows × 21 columns

From our dataset, you can see we have 491,235 rows and 21 columns.

That means that we have data on 491,235 different Yeezy sales.

Each transaction has 21 columns but we will only focus on the following information:

  1. createdAt - Date the shoe was sold
  2. shoeSize - Size of the shoe
  3. localAmount - The premium price paid for the shoe (resell price)
  4. Color - Color of the shoe
  5. Release Date - Date the shoe originally released in the primary market
  6. Retail - The original price of the shoe (retail price)
  7. Model - The Model (Yeezy 350, 380, 500, 700...)
  8. Name - Name of the shoe
  9. Annual High - The maximum yearly price the shoe sold for
  10. Annual Low - The minimum yearly price
  11. Volatility - Rate at which the selling prices of the shoe changed
  12. Total Sold - How many units sold on stockx
  13. Total Dollars - Total Dollar amount of all sales of the shoe on stockx

2. Data Cleaning
Taking a quick look at the data, you may notice that some cleaning needs to be done.

  1. Delete some columns - because some have useless information
In [36]:
# Gets rid of the columns that we do not want
df = df.drop(columns=['Unnamed: 0','chainId', 'amount', 'productId', 'skuUuid', 'state', 'customerId', 'localCurrency',])
  1. Convert columns to an appropriate type - for easier data manipulation
In [37]:
# Converts columns to a datetime
df['createdAt'] = pd.to_datetime(df['createdAt'])
df['Release Date'] = pd.to_datetime(df['Release Date'])
  1. Get rid of or change empty data points - to avoid errors from learning models
In [38]:
# Check which columns have missing values and how many are missing
pd.isnull(df).sum()
Out[38]:
createdAt           0
shoeSize            0
localAmount         0
Color               0
Release Date      159
Retail              0
Model               0
Name                0
Annual High         0
Annual Low       2779
Volatility          0
Total Sold          0
Total Dollars       0
dtype: int64
In [39]:
# Gets rid of the shoes that have not released yet
df = df.dropna(subset=['Release Date'])

# Replaces NaN elements with 0
df = df.fillna(0)

3. Feature Engineering
Here we will create more influential variables from the existing raw data by creating new columns. This will help us categorize our dataset, and potentially help us gain insight into some good predictors for our training model.

Lets first do some manual one hot encodings, which are a representation of categorical variables as binary values

  1. df['is_white'] - 1 if white in 'Color', 0 if not
  2. df['is_black'] - 1 if black in 'Color', 0 if not
  3. df['350'] - 1 if 350 in 'Name', 0 if not
  4. df['Static'] - 1 if reflective in 'Name', 0 if not

I believe that whether or not a shoe is white or black, a 350 model (the most popular model), Static (special color of a model that has reflective accents and are much more limited than their non-reflective counterparts), will all be useful predictors of resell prices

In [40]:
# Makes new column checking if White is found in the Color column
df['is_white'] = [1 if 'White' in x else 0 for x in df['Color']]

# Makes new columns checking if Black is found in the Color column
df['is_black'] = [1 if 'Black' in x else 0 for x in df['Color']]

# Makes new columns checking if 350 is found in the Name column
df['350'] = [1 if '350' in x else 0 for x in df['Name']]

# Makes new columns checking if Static is found in the Name column
df['Static'] = [1 if 'Static' in x else 0 for x in df['Name']]

Now lets quantify some columns for our training models to interact with:

  1. df['days_until_sold'] - Calculated by subtracting the Sale date by Release date. This allows us to compare shoes by how long after release they were sold.
  2. df['price_ratio'] - Calculated by dividing the Sale price by Retail price. This allows us to standardize each shoe to its retail, since Yeezys have varying retail prices.
In [41]:
# Makes new column by subtracting the two dates, getting only the computated days
df['days_until_sold'] = (df['createdAt'].sub(df['Release Date'])).dt.days

# Makes new column by dividing the two prices
df['price_ratio'] = (df['localAmount'])/(df['Retail'])
In [42]:
df.head()
Out[42]:
createdAt shoeSize localAmount Color Release Date Retail Model Name Annual High Annual Low Volatility Total Sold Total Dollars is_white is_black 350 Static days_until_sold price_ratio
0 2017-02-21 23:52:48+00:00 8.5 1900 White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0 1 1 1 0 -5 8.636364
1 2017-02-28 16:45:47+00:00 8.5 1750 White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0 1 1 1 0 2 7.954545
2 2017-03-07 03:56:36+00:00 8.5 1405 White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0 1 1 1 0 9 6.386364
3 2017-03-20 03:36:58+00:00 8.5 1295 White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0 1 1 1 0 22 5.886364
4 2017-04-16 14:30:30+00:00 8.5 1250 White/Core Black/Red 2017-02-25 23:59:59 220.0 adidas Yeezy Boost 350 V2 adidas Yeezy Boost 350 V2 Zebra 795.0 240.0 0.093012 43852.0 14898400.0 1 1 1 0 49 5.681818

4. Data analysis
Now take a look at our dataset and examine some interesting relations within the dataset.

Let's first take a look at our target variable, price_ratio. From the histogram below, we see that it is right skewed. Clearly there are some really big outliers within the dataset.

In [43]:
import matplotlib.pyplot as plt 

# Histogram plot of our target variable
plt.hist(x=df['price_ratio'], bins=100)
plt.xlabel('Price of sale')
plt.ylabel('Number of Sales')
plt.title('Distribution of Sales')
Out[43]:
Text(0.5, 1.0, 'Distribution of Sales')

Now we will look at one of our categorical variables, 'Model', and its relation to our target variable. You may notice that 350 models, on average, tend to have some of the highest price ratios.

In [44]:
# Group our dataframe by each Model
grouped = df.groupby('Model')
x = []
y = []

# Go through each Model i and plot the mean of its price ratio
for i,bin in grouped:
    group = grouped.get_group(i)
    x.append(i)
    y.append(group['price_ratio'].mean())

# Set labels
plt.xticks(rotation=45, ha='right')
plt.title('Price ratios for each model')
plt.xlabel('Model')
plt.ylabel('Price Ratio')
plt.bar(x,y)
Out[44]:
<BarContainer object of 19 artists>

Now let's look at our numerical variables and one hot encodings, and their relation to our target variable.

We can visualize this relationship by creating a correlation matrix.

In [45]:
import numpy as np

# Make a new dataframe, consisting of only columns that are numerical
numerical = df.select_dtypes(include=[np.number])

# Creates matrix for each predictor ands its relation to other variables
correlation_matrix = numerical.corr()
correlation_matrix
Out[45]:
shoeSize localAmount Retail Annual High Annual Low Volatility Total Sold Total Dollars is_white is_black 350 Static days_until_sold price_ratio
shoeSize 1.000000 -0.034040 -0.018138 0.043257 0.053645 -0.003941 -0.058769 -0.054049 0.016919 0.016013 0.007335 -0.032904 0.054933 -0.032731
localAmount -0.034040 1.000000 0.188779 0.684746 0.797130 -0.116144 -0.246750 -0.145108 0.092942 0.288643 0.318114 0.101822 0.282630 0.945043
Retail -0.018138 0.188779 1.000000 0.204439 0.265161 -0.334810 0.093847 0.132985 -0.233395 -0.074635 -0.072043 0.071808 -0.015671 -0.078229
Annual High 0.043257 0.684746 0.204439 1.000000 0.757629 -0.193593 -0.217600 -0.034445 0.042052 0.286530 0.415975 0.180736 0.216902 0.619168
Annual Low 0.053645 0.797130 0.265161 0.757629 1.000000 -0.257214 -0.281419 -0.155236 0.115505 0.437701 0.381372 0.113696 0.325368 0.722766
Volatility -0.003941 -0.116144 -0.334810 -0.193593 -0.257214 1.000000 -0.024666 -0.116580 -0.056458 0.100900 -0.112545 0.037206 0.045494 -0.074649
Total Sold -0.058769 -0.246750 0.093847 -0.217600 -0.281419 -0.024666 1.000000 0.956442 0.070519 -0.065438 0.077054 -0.079571 -0.089143 -0.269435
Total Dollars -0.054049 -0.145108 0.132985 -0.034445 -0.155236 -0.116580 0.956442 1.000000 0.080911 -0.008001 0.165801 0.032354 -0.080686 -0.175199
is_white 0.016919 0.092942 -0.233395 0.042052 0.115505 -0.056458 0.070519 0.080911 1.000000 0.136891 -0.012604 -0.087662 0.158469 0.172526
is_black 0.016013 0.288643 -0.074635 0.286530 0.437701 0.100900 -0.065438 -0.008001 0.136891 1.000000 0.056350 -0.051567 0.261502 0.302886
350 0.007335 0.318114 -0.072043 0.415975 0.381372 -0.112545 0.077054 0.165801 -0.012604 0.056350 1.000000 0.144962 0.041389 0.353919
Static -0.032904 0.101822 0.071808 0.180736 0.113696 0.037206 -0.079571 0.032354 -0.087662 -0.051567 0.144962 1.000000 -0.087191 0.082349
days_until_sold 0.054933 0.282630 -0.015671 0.216902 0.325368 0.045494 -0.089143 -0.080686 0.158469 0.261502 0.041389 -0.087191 1.000000 0.284903
price_ratio -0.032731 0.945043 -0.078229 0.619168 0.722766 -0.074649 -0.269435 -0.175199 0.172526 0.302886 0.353919 0.082349 0.284903 1.000000

The heatmap below plots the correlation data between our variables.

You may notice that Annual High, Annual Low, is_black, 350, and days_until_sold have some of the strongest positive correlation with our target variable.

localAmount has a positive correlation, but must be disregarded since it is directly related to price_ratio and thus not a good predictor.

In [46]:
import seaborn as sns

# Creates heatmap
sns.heatmap(correlation_matrix, square=True)
Out[46]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb03c2f8710>

5. Modeling
Now, after all our data cleaning, feature engineering, and data exploration, we will now start our predictive modeling.

We will train a multiple linear regression model to predict our target variable, price_ratio. Multiple linear regression models are like Linear Regression, except it can handle more than 1 feature.

However, because it can handle more than 1 feature, it is imperative that features not be correlated to each other. So to avoid multicollinearity, we must first drop 'localAmount' and 'Retail', which are directly related to our target variable. Anyways, price_ratio is a much better indicator of these 2 values.

'createdAt' and 'Release Date' should be dropped because we have a good unit measuring these 2 values, our 'days_until_sold' value.

'Total Sold' and 'Total Dollars' are also correlated to each other, so we will drop one of them. I decided to keep 'Total Dollars' as a predictor.

Lastly, 'Model', 'Name', and 'Color' will be dropped since we have one hot encoded columns in there place.

In [47]:
df = df.drop(columns=['localAmount','Retail', 'createdAt','Release Date','Total Sold','Model','Name','Color'])
In [48]:
df.head()
Out[48]:
shoeSize Annual High Annual Low Volatility Total Dollars is_white is_black 350 Static days_until_sold price_ratio
0 8.5 795.0 240.0 0.093012 14898400.0 1 1 1 0 -5 8.636364
1 8.5 795.0 240.0 0.093012 14898400.0 1 1 1 0 2 7.954545
2 8.5 795.0 240.0 0.093012 14898400.0 1 1 1 0 9 6.386364
3 8.5 795.0 240.0 0.093012 14898400.0 1 1 1 0 22 5.886364
4 8.5 795.0 240.0 0.093012 14898400.0 1 1 1 0 49 5.681818

With that done, we can start by first splitting our dataset into training and testing data

In [49]:
from sklearn.model_selection import train_test_split

# Seperate the features and target
X = df.drop(columns=['price_ratio'])
y = df['price_ratio']

# Do an 80/20 split of our data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.20)

Here we make our linear regression object and train it with our training data.

Then we plot our predictions and our actual test data to compare results. A summary of our model is also shown to highlight our R2 score.

In [50]:
from sklearn import linear_model
from regressors import stats as stats_reg

# Define the multiple linear regression model
lr = linear_model.LinearRegression()

# Fitting the model 
lr.fit(X_train,y_train)

# predict with the data
y_pred = lr.predict(X_test)

# Make scatter plot of actual vs predictions
plt.scatter(y_test, y_pred)
plt.title('Actual vs predictions')
plt.xlabel('Actual')
plt.ylabel('Predicted')

# Get summary of our model
stats_reg.summary(lr, X_train, y_train, X_train.columns)
Residuals:
     Min      1Q  Median      3Q     Max
-75.0473 -0.2185  0.1178  0.4102  6.1145


Coefficients:
                 Estimate  Std. Error   t value  p value
_intercept       0.385242    0.006851   56.2305      0.0
shoeSize        -0.046422    0.000360 -128.8901      0.0
Annual High      0.000483    0.000003  192.5320      0.0
Annual Low       0.004795    0.000014  351.6830      0.0
Volatility       1.354547    0.013527  100.1339      0.0
Total Dollars   -0.000000    0.000000 -122.2528      0.0
is_white         0.554789    0.005187  106.9592      0.0
is_black        -0.139551    0.004364  -31.9792      0.0
350              0.269708    0.003219   83.7852      0.0
Static          -0.103517    0.005400  -19.1692      0.0
days_until_sold  0.000220    0.000006   38.8793      0.0
---
R-squared:  0.57767,    Adjusted R-squared:  0.57766
F-statistic: 53735.29 on 10 features

So our R2 value is mediocre, and our graph confirms that our model can definitely be improved.

But, before we do this, lets first check whether or not our model violates the linear regression assumptions.

So lets use statsmodels to do a normality test.

After doing so and looking at the graph, the S-Shape indicates significant non-normality, thus failing the normality test.

In [51]:
import statsmodels.api as sm

# Get residual
Residual = y_pred - y_test

# Plotting the residuals
sm.qqplot(Residual,line="r");

5.1 Modeling
Perhaps we can solve this violation of normality by performing a nonlinear transformation of variables.

Let's see what a natural log of transformations to the variables might do by comparing our original target variable to the natural log of it.

You may notice in the histogram below that the frequencies and spread change drastically after natural log transformation.

In [52]:
# Histogram plot of the original target
plt.hist(df["price_ratio"], bins=100);

# Histogram plot of the log-transformed target
plt.hist(np.log(df["price_ratio"]),bins=100);

plt.legend(["price ratio", "log transformed price ratio"]);
plt.show()

Now lets fit our training model again, this time, with a log transformed price_ratio.

In [53]:
X_train_l = X_train
X_test_l = X_test

y_train_l = np.log(y_train)
y_test_l = np.log(y_test)
In [54]:
# Define the multiple linear regression model
lr = linear_model.LinearRegression()

# Fitting the model 
lr.fit(X_train_l,y_train_l)

# predict with the data
y_pred_l = lr.predict(X_test_l)

# Make scatter plot of actual vs predictions
plt.scatter(y_test_l, y_pred_l)
plt.title('Actual vs predictions')
plt.xlabel('Actual')
plt.ylabel('Predicted')

# Get summary of our model
stats_reg.summary(lr, X_train_l, y_train_l, X_train_l.columns)
Residuals:
    Min      1Q  Median      3Q     Max
-3.2596 -0.1484    0.04  0.1778  1.9169


Coefficients:
                 Estimate  Std. Error   t value  p value
_intercept      -0.021554    0.002206   -9.7707      0.0
shoeSize        -0.018540    0.000116 -159.8704      0.0
Annual High      0.000274    0.000001  339.0615      0.0
Annual Low       0.001371    0.000004  312.2526      0.0
Volatility       0.172036    0.004356   39.4978      0.0
Total Dollars   -0.000000    0.000000 -183.2716      0.0
is_white         0.234277    0.001670  140.2774      0.0
is_black        -0.014263    0.001405  -10.1509      0.0
350              0.192296    0.001036  185.5287      0.0
Static           0.035994    0.001739   20.7007      0.0
days_until_sold  0.000065    0.000002   35.6023      0.0
---
R-squared:  0.66500,    Adjusted R-squared:  0.66499
F-statistic: 77984.51 on 10 features
In [55]:
# Get residual
Residual_l = y_pred_l - y_test_l

# Plotting the residuals
sm.qqplot(Residual_l,line="r");

Although log transforming our target variable helps improve our model, as told by a higher R2 score, it fails to pass the normality test once again.

A good next step would be to train a different model.

Conclusions

Although we were able to train a relatively mediocre multiple linear regression model, we were unable to meet the linear regression requirements. Nonetheless, we were still able to gain some insight into the sneaker aftermarket. Most importantly we learned the features that drive the resell prices on Yeezys, like Annual Highs, differences between sale date and release date, and even color.

Also, I believe that the number of pairs released of a certain shoe is the ultimate dictator of the aftermarket prices. This article explains how Nike controls the sneaker aftermarket through a perfect execution of limiting supply to increase demand, encouraging a cult-like customer devotion to Jordans, like Adidas has done with Yeezys. So with limited supply, you can only expect the sneaker aftermarket prices to match this type of demand. Getting the number of pairs released of each Yeezy model would be difficult to get (most are unknown), but if it were attainable, it would greatly improve our model as another feature.

But, without this, the best next course would be to train another model, here is a great article if you would like to learn some more about this: https://www.quality-control-plan.com/StatGuide/mulreg_alts.htm