ML/Random Forest&SGD Regressor: Vehicle Mileage Prediction¶
Click here to Interact with this code on nbViewer
Data Preprocessing¶
In [ ]:
Copied!
import pandas as pd
import sklearn
import pandas as pd
import sklearn
Pandas is most likey used when missing data / fixing data
In [ ]:
Copied!
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
In [ ]:
Copied!
df.isnull().sum()
df.isnull().sum()
Out[ ]:
MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64
In [ ]:
Copied!
df
df
Out[ ]:
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model Year | Origin | |
---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1 |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 2 |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1 |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1 |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1 |
398 rows × 8 columns
In [ ]:
Copied!
df["MPG"].mean()
df["MPG"].mean()
Out[ ]:
23.514572864321607
In [ ]:
Copied!
df = df.dropna()
df = df.dropna()
In [ ]:
Copied!
# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)
# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)
Split data into 3 parts¶
- Train
- Validation
- Test
We are here merging the validation and test due to lack to data entries
In [ ]:
Copied!
from sklearn.model_selection import train_test_split
import numpy as np
X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)
from sklearn.model_selection import train_test_split
import numpy as np
X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)
Out[ ]:
pandas.core.series.Series
Model Training¶
Random Forest Regressor¶
In [ ]:
Copied!
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
model2 = RandomForestRegressor()
model2.fit(X_train, y_train)
y_preds = model2.predict(X_test)
mae(y_test, y_preds), model2.score(X_test, y_test)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
model2 = RandomForestRegressor()
model2.fit(X_train, y_train)
y_preds = model2.predict(X_test)
mae(y_test, y_preds), model2.score(X_test, y_test)
Out[ ]:
(2.460025316455698, 0.7769961840589887)
SGD Regressor¶
In [ ]:
Copied!
from sklearn.linear_model import SGDRegressor
model3 = SGDRegressor()
model3.fit(X_train, y_train)
model3.score(X_test, y_test)
mae(y_test, model3.predict(X_test))
from sklearn.linear_model import SGDRegressor
model3 = SGDRegressor()
model3.fit(X_train, y_train)
model3.score(X_test, y_test)
mae(y_test, model3.predict(X_test))
Out[ ]:
7688061220403291.0