ML/Random Forest&SGD Regressor: Vehicle Mileage Prediction¶

Click here to Interact with this code on nbViewer

Data Preprocessing¶

In [ ]:

Copied!

import pandas as pd
import sklearn
import pandas as pd
import sklearn

Pandas is most likey used when missing data / fixing data

In [ ]:

Copied!





url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

df = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

df = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

In [ ]:

Copied!

df.isnull().sum()
df.isnull().sum()

Out[ ]:

MPG             0
Cylinders       0
Displacement    0
Horsepower      6
Weight          0
Acceleration    0
Model Year      0
Origin          0
dtype: int64

In [ ]:

Copied!

df
df

Out[ ]:

	MPG	Cylinders	Displacement	Horsepower	Weight	Acceleration	Model Year	Origin
0	18.0	8	307.0	130.0	3504.0	12.0	70	1
1	15.0	8	350.0	165.0	3693.0	11.5	70	1
2	18.0	8	318.0	150.0	3436.0	11.0	70	1
3	16.0	8	304.0	150.0	3433.0	12.0	70	1
4	17.0	8	302.0	140.0	3449.0	10.5	70	1
...	...	...	...	...	...	...	...	...
393	27.0	4	140.0	86.0	2790.0	15.6	82	1
394	44.0	4	97.0	52.0	2130.0	24.6	82	2
395	32.0	4	135.0	84.0	2295.0	11.6	82	1
396	28.0	4	120.0	79.0	2625.0	18.6	82	1
397	31.0	4	119.0	82.0	2720.0	19.4	82	1

398 rows × 8 columns

In [ ]:

Copied!

df["MPG"].mean()
df["MPG"].mean()

Out[ ]:

23.514572864321607

In [ ]:

Copied!

df = df.dropna()
df = df.dropna()

In [ ]:

Copied!

# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)
# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)

Split data into 3 parts¶

Train
Validation
Test

We are here merging the validation and test due to lack to data entries

In [ ]:

Copied!





from  sklearn.model_selection import train_test_split
import numpy as np

X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test

X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)
from  sklearn.model_selection import train_test_split
import numpy as np

X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test

X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)

Out[ ]:

pandas.core.series.Series

Model Training¶

Random Forest Regressor¶

In [ ]:

Copied!





from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae

model2 = RandomForestRegressor()
model2.fit(X_train, y_train)

y_preds = model2.predict(X_test)
mae(y_test, y_preds), model2.score(X_test, y_test)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae

model2 = RandomForestRegressor()
model2.fit(X_train, y_train)

y_preds = model2.predict(X_test)
mae(y_test, y_preds), model2.score(X_test, y_test)

Out[ ]:

(2.460025316455698, 0.7769961840589887)

SGD Regressor¶

In [ ]:

Copied!





from sklearn.linear_model import SGDRegressor
model3 = SGDRegressor()
model3.fit(X_train, y_train)
model3.score(X_test, y_test)
mae(y_test, model3.predict(X_test))
from sklearn.linear_model import SGDRegressor
model3 = SGDRegressor()
model3.fit(X_train, y_train)
model3.score(X_test, y_test)
mae(y_test, model3.predict(X_test))

Out[ ]:

7688061220403291.0