NN/Numerical Regressor: Vehicle Mileage Prediction¶
In [ ]:
Copied!
import pandas as pd
import sklearn
import pandas as pd
import sklearn
Pandas is most likeley used when missing data / fixing data
In [ ]:
Copied!
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
'Acceleration', 'Model Year', 'Origin']
df = pd.read_csv(url, names=column_names,
na_values='?', comment='\t',
sep=' ', skipinitialspace=True)
In [ ]:
Copied!
df.isnull().sum()
df.isnull().sum()
Out[ ]:
MPG 0 Cylinders 0 Displacement 0 Horsepower 6 Weight 0 Acceleration 0 Model Year 0 Origin 0 dtype: int64
In [ ]:
Copied!
df
df
Out[ ]:
MPG | Cylinders | Displacement | Horsepower | Weight | Acceleration | Model Year | Origin | |
---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504.0 | 12.0 | 70 | 1 |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693.0 | 11.5 | 70 | 1 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436.0 | 11.0 | 70 | 1 |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433.0 | 12.0 | 70 | 1 |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449.0 | 10.5 | 70 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790.0 | 15.6 | 82 | 1 |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130.0 | 24.6 | 82 | 2 |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295.0 | 11.6 | 82 | 1 |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625.0 | 18.6 | 82 | 1 |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720.0 | 19.4 | 82 | 1 |
398 rows × 8 columns
In [ ]:
Copied!
df["MPG"].mean()
df["MPG"].mean()
Out[ ]:
23.514572864321607
In [ ]:
Copied!
df = df.dropna()
df = df.dropna()
In [ ]:
Copied!
# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)
# columns = ["Cylinders", "Model Year", "Origin"]
df = df.astype(float)
Split data into 3 parts
- Train
- Validation
- Test
We are here merging the validation and test due to lack to data entries
In [ ]:
Copied!
from sklearn.model_selection import train_test_split
import numpy as np
X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)
from sklearn.model_selection import train_test_split
import numpy as np
X = df.drop('MPG', axis=1) # Capital because that is a matrix -> Input Matrix
y = df['MPG'] # Lowercase since the data is an array -> Output Array
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# 0.2 -> Split ration between train and test
X_train.shape, X_test.shape, y_train.shape, y_test.shape
type(y_train)
Out[ ]:
pandas.core.series.Series
In [ ]:
Copied!
from tensorflow.keras.layers import Dense, SimpleRNN
import tensorflow as tf
from tensorflow.keras.layers import Dense, SimpleRNN
import tensorflow as tf
In [ ]:
Copied!
my_model = tf.keras.Sequential([
# Input((1, 8))
Dense(100, activation="relu"),
# 100 * (8+1)
Dense(10, activation="relu"),
SimpleRNN(units=1, activation='tanh', name='Hidden-Recurrent-Layer'))
# 10*(100+1)
Dense(10, activation="relu"),
# 10*(10+1)
Dense(1)
# 1*(10+1)
])
my_model.compile(loss = tf.keras.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.Adam(), metrics=["mae"])
# accuraccy -> If data is matching
# MSE, MSLE, MAE -> Difference btw prediction and expected
model_history = my_model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
my_model = tf.keras.Sequential([
# Input((1, 8))
Dense(100, activation="relu"),
# 100 * (8+1)
Dense(10, activation="relu"),
SimpleRNN(units=1, activation='tanh', name='Hidden-Recurrent-Layer'))
# 10*(100+1)
Dense(10, activation="relu"),
# 10*(10+1)
Dense(1)
# 1*(10+1)
])
my_model.compile(loss = tf.keras.losses.MeanSquaredError(), optimizer=tf.keras.optimizers.Adam(), metrics=["mae"])
# accuraccy -> If data is matching
# MSE, MSLE, MAE -> Difference btw prediction and expected
model_history = my_model.fit(X_train, y_train, epochs=10, validation_data=(X_test, y_test))
Epoch 1/10 10/10 [==============================] - 9s 45ms/step - loss: 17373.5020 - mae: 121.5429 - val_loss: 1365.5695 - val_mae: 35.7540 Epoch 2/10 10/10 [==============================] - 0s 13ms/step - loss: 698.9968 - mae: 24.9100 - val_loss: 587.5173 - val_mae: 23.0756 Epoch 3/10 10/10 [==============================] - 0s 23ms/step - loss: 377.1506 - mae: 16.9788 - val_loss: 125.7078 - val_mae: 9.3703 Epoch 4/10 10/10 [==============================] - 0s 17ms/step - loss: 180.9805 - mae: 11.3088 - val_loss: 140.1292 - val_mae: 9.8517 Epoch 5/10 10/10 [==============================] - 0s 19ms/step - loss: 151.1945 - mae: 10.1285 - val_loss: 139.6962 - val_mae: 9.6343 Epoch 6/10 10/10 [==============================] - 0s 15ms/step - loss: 149.4257 - mae: 9.9768 - val_loss: 124.9488 - val_mae: 9.3445 Epoch 7/10 10/10 [==============================] - 0s 15ms/step - loss: 142.7760 - mae: 9.9751 - val_loss: 122.1221 - val_mae: 9.2163 Epoch 8/10 10/10 [==============================] - 0s 11ms/step - loss: 139.6175 - mae: 9.7384 - val_loss: 119.7234 - val_mae: 9.0681 Epoch 9/10 10/10 [==============================] - 0s 11ms/step - loss: 135.7001 - mae: 9.5561 - val_loss: 116.0868 - val_mae: 8.9237 Epoch 10/10 10/10 [==============================] - 0s 16ms/step - loss: 134.3057 - mae: 9.5375 - val_loss: 113.3275 - val_mae: 8.7397
val_mae
8.7 meaning the error is close to 8.7 miles per gallon when taking into consideration so many cars, this translates to model being very close to accurate
In [ ]:
Copied!
pd.DataFrame(model_history.history).plot()
pd.DataFrame(model_history.history).plot()
Out[ ]:
<Axes: >