Problem definition¶

We want to predict the path of hurricanes with previous records. I decided to use this final exercise to deploy a time series forecast model (which would be my first). This will be treated as a problem of sequence prediction and faced with RNN. For this the I will closely follow the approach of Alemany et al 2019, where the data are mapped into a grid based on geographic coordinates and the prediction consist of knowing the next grid id where the hurricane is going to be based on previous sequences of grid id, the distance traveled and the bearing. My focus will be replicated and try to understand in more detailed manner possible the mentioned approach, of course making some changes under own considerations.

In [1]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import scipy
import re 
from geopy.distance import great_circle as gcd
import math as Math
import time
from mpl_toolkits.basemap import Basemap
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
import keras
import tensorflow as tf
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation
from keras.layers.recurrent import LSTM
from keras.models import model_from_json

Import and celaning data¶

The data will be loaded, for this case almost all original data values will be deleted under the consideration that most of the data recorded represent older phenomena, with no atmospherical variables captured like wind speed or pressure. We keep the hurriace ID, the dat and the grographic coordinates of the center. Name is also droped because it is only usefull for events after 1950.

In [79]:
data = pd.read_csv('atlantic.csv')

Import and clean data

In [80]:
data
Out[80]:
ID Name Date Time Event Status Latitude Longitude Maximum Wind Minimum Pressure ... Low Wind SW Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW Moderate Wind NW High Wind NE High Wind SE High Wind SW High Wind NW
0 AL011851 UNNAMED 18510625 0 HU 28.0N 94.8W 80 -999 ... -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
1 AL011851 UNNAMED 18510625 600 HU 28.0N 95.4W 80 -999 ... -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
2 AL011851 UNNAMED 18510625 1200 HU 28.0N 96.0W 80 -999 ... -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
3 AL011851 UNNAMED 18510625 1800 HU 28.1N 96.5W 80 -999 ... -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
4 AL011851 UNNAMED 18510625 2100 L HU 28.2N 96.8W 80 -999 ... -999 -999 -999 -999 -999 -999 -999 -999 -999 -999
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
49100 AL122015 KATE 20151112 1200 EX 41.3N 50.4W 55 981 ... 180 120 120 120 60 0 0 0 0 0
49101 AL122015 KATE 20151112 1800 EX 41.9N 49.9W 55 983 ... 180 120 120 120 60 0 0 0 0 0
49102 AL122015 KATE 20151113 0 EX 41.5N 49.2W 50 985 ... 200 220 120 120 60 0 0 0 0 0
49103 AL122015 KATE 20151113 600 EX 40.8N 47.5W 45 985 ... 180 220 0 0 0 0 0 0 0 0
49104 AL122015 KATE 20151113 1200 EX 40.7N 45.4W 45 987 ... 150 220 0 0 0 0 0 0 0 0

49105 rows × 22 columns

The data is analyzed to eliminate variables with less than 50% valid data, undesrantung -999 and UNNAMED as missing values. Following this approach, most of the variables are eliminated conserving only the ID, geographic coordinates and maximum wind velocity. To exemplifi the value frecuency of Name and MAximum pressure is shown.

In [81]:
data['Minimum Pressure'].value_counts(normalize=True)*100
Out[81]:
-999     62.455962
 1005     1.798188
 1008     1.722839
 1006     1.645454
 1009     1.629162
           ...    
 902      0.002036
 899      0.002036
 889      0.002036
 888      0.002036
 907      0.002036
Name: Minimum Pressure, Length: 130, dtype: float64
In [82]:
data['Name'].value_counts(normalize=True)*100
Out[82]:
            UNNAMED    54.094288
            FRANCES     0.645555
             ARLENE     0.576316
             BERTHA     0.545769
             DENNIS     0.519295
                         ...    
              SHARY     0.018328
              TAMMY     0.018328
            SIXTEEN     0.016292
            FERNAND     0.014255
             AMELIA     0.012219
Name: Name, Length: 288, dtype: float64
In [83]:
data=data.drop(['Minimum Pressure','Event','Status','Name','Low Wind NE','Low Wind SE','Low Wind SW','Low Wind NW','Moderate Wind NE','Moderate Wind SE','Moderate Wind SW','Moderate Wind NW','High Wind SW','High Wind NE','High Wind SE','High Wind NW'],axis=1)

We manipulate the data so we can acces to records based on an index created with ID key as basis. Then we count the total number of hurricanes, and obtain a total of 1814 different events recorded.

In [84]:
keys = list(enumerate(pd.unique(data['ID'])))
total_hurricane_count = len(pd.unique(data['ID']))
print("There are "+str(total_hurricane_count)+" hurricanes in the database")
y = np.zeros((total_hurricane_count))
for x in range(0,total_hurricane_count):
    y[x] = len(pd.DataFrame(data[data['ID'] == keys[x][1]], columns = data.keys()).reset_index(drop = True))
hurricane_amount = pd.DataFrame(y)
There are 1814 hurricanes in the database

The maximum number of observations per hurricane is 133 and the minimum is just 1. Percentil 25 is 14 observations, we use this as threshold to eliminate non appropriate observations due to their short length. The mean value of observations per event is 27 and there are few events with an extreme number of records of over 100, being 133 the total number of observations of longes event recorded.

In [85]:
hurricane_amount.describe()
Out[85]:
0
count 1814.000000
mean 27.070011
std 16.983268
min 1.000000
25% 14.000000
50% 23.000000
75% 37.000000
max 133.000000
In [86]:
minObs= (hurricane_amount>20)
keysMinObs=[]
for  i in minObs.index:
    if minObs.loc[i][0]:
        keysMinObs.append(keys[i][1])
data = data[data['ID'].isin(keysMinObs)]
keys = list(enumerate(pd.unique(data['ID'])))

After deleting the short length hurricanes, we obtain a histogram with strong negative asymmetry, as expected most of the observations concentrated between 20 and 40. The long tail of the histogram accounts for the longer hurricanes which are not considered as topics at first. A second model will be created, eliminating also observations outside percentil 75.

In [87]:
total_hurricane_count = len(pd.unique(data['ID']))

print(total_hurricane_count)

y = np.zeros((total_hurricane_count))
for x in range(0,total_hurricane_count):
    y[x] = len(pd.DataFrame(data[data['ID'] == keys[x][1]], columns = data.keys()).reset_index(drop = True))
    
hurricane_amount = pd.DataFrame(y)
hurricane_amount.describe()
dist = hurricane_amount.plot.hist(bins=30, edgecolor='white', histtype='barstacked', color='black', legend=None, linewidth=1.2)
plt.show()
1039

Feature Engineering¶

Distance traveled from the last observation and bearing or travel direction based on current and previous pount will be estimated from the orginal geographic coordinates. For this we trasform the coordinate format into numeric values according to its cuadrant location and from there calculate espherical distance and bearing.

In [88]:
def Nostr(x):
    sp=re.split('(N|W|S|E)',x)
    if sp[1] == 'W' or sp[1] == 'S':
        num=float(sp[0])*-1
    else:
        num=float(sp[0])
    return num
data['Latitude'] = data['Latitude'].apply(Nostr)
data['Longitude'] = data['Longitude'].apply(Nostr)
In [89]:
total_data_count = len(data)
y = np.zeros(total_hurricane_count)
data['distance'] = np.zeros(total_data_count)
data['bearing'] = np.zeros(total_data_count)

for x in range(0, total_hurricane_count):
    t = pd.DataFrame(data[data['ID'] == keys[x][1]], columns = data.keys()).reset_index(drop = False)
    dst = 0
    prev = (0,0)
    
    for p in zip(t['Latitude'], t['Longitude']):
        
        if prev == (0,0):
            prev = p
            continue 
        data.at[t[(t['Latitude'] == p[0]) & (t['Longitude'] == p[1])]['index'].values[0],'distance']=gcd(prev,p).kilometers
        
        dLon = p[1] - prev[1];  
        temp = float(p[0])
        y_x = Math.sin(dLon) * Math.cos(temp)
        
        x_x = Math.cos(p[1]) * Math.sin(temp) - Math.sin(p[1]) * Math.cos(temp) * Math.cos(dLon)
        brng = Math.degrees(Math.atan2(y_x, x_x)) 
        if (brng < 0):
            brng+= 360
        
        data.at[t[(t['Latitude'] == p[0]) & (t['Longitude'] == p[1])]['index'].values[0], 'bearing']= brng
        dst += gcd(prev,p).kilometers
        prev = p
    y[x] = dst

hurricane_distance = pd.DataFrame(y)

We now have a 1039 table that includes the vectorial behavior of hurricane movement. We can also tell that the behavior of distance traveled variable presents great variability. There are hurricanes that travel only 718 kilometers, the mean traveled distance is 4810 and some hurricanes traveled up to 13902 kilometers.

In [90]:
data
Out[90]:
ID Date Time Latitude Longitude Maximum Wind distance bearing
16 AL041851 18510816 0 13.4 -48.0 40 0.000000 0.000000
17 AL041851 18510816 600 13.7 -49.5 40 165.545433 326.268781
18 AL041851 18510816 1200 14.0 -51.0 50 165.342833 349.579563
19 AL041851 18510816 1800 14.4 -52.8 50 199.067179 161.573455
20 AL041851 18510817 0 14.9 -54.6 60 201.466132 121.516233
... ... ... ... ... ... ... ... ...
49080 AL112015 20151014 0 36.0 -9.0 20 67.316208 359.140120
49081 AL112015 20151014 600 35.5 -8.7 20 61.838635 323.341560
49082 AL112015 20151014 1200 35.1 -8.4 20 52.148787 210.415062
49083 AL112015 20151014 1800 35.0 -8.0 15 38.071961 204.811894
49084 AL112015 20151015 0 35.2 -7.7 15 35.205672 195.594607

39255 rows × 8 columns

In [91]:
hurricane_distance.describe()
Out[91]:
0
count 1039.000000
mean 4810.571326
std 2367.998761
min 718.601374
25% 3047.282880
50% 4348.485268
75% 6149.359920
max 13902.037275

In the following table, the ID of the top 6 hurricanes which travel the longest distance is presented along with the distance in kilometers and the number of observations. AS expected, hurricanes with higher number of observations are those which travel the most.

In [92]:
print ('Top 6 Hurricanes with longer paths')
for x in hurricane_distance.nlargest(6, 0).index:
    print (keys[x][1], "-", hurricane_distance.loc[x][0], "kilemters -", hurricane_amount.loc[x][0], " ","number of observations")
Top 6 Hurricanes with longer paths
AL031899 - 13902.037275075161 kilemters - 133.0   number of observations
AL032000 - 13519.340617835665 kilemters - 87.0   number of observations
AL061966 - 13439.368335126639 kilemters - 69.0   number of observations
AL011900 - 12774.077723429358 kilemters - 81.0   number of observations
AL092004 - 12484.604805793771 kilemters - 94.0   number of observations
AL122011 - 12481.711146561658 kilemters - 64.0   number of observations

The positive correlation between distance and number observations can be best illustrated in the following scatter plot, and by calculating Pearsons correlation coefficient, which is 77%, confirming a strong positive correlation.

In [93]:
print( np.corrcoef(hurricane_distance[0], hurricane_amount[0]) )

corr = plt.scatter(hurricane_distance[0], hurricane_amount[0], color='black')
[[1.         0.77093697]
 [0.77093697 1.        ]]

For illustration the folloginf figures plot the path of the 5 longest and 5 shortest hurricanes. It is clear that the behavior depending on the size of the path is extremely different which might difficult the task of regression process.

In [94]:
n = 5

plt.figure(figsize=(10,5))
m = Basemap(llcrnrlon=-110.,llcrnrlat=5.,urcrnrlon=10.,urcrnrlat=60.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',
            projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.drawparallels(np.arange(10,90,20),labels=[1,1,0,1])
m.drawmeridians(np.arange(-180,180,30),labels=[1,1,0,1])

colo=['red','blue','green','black','orange']
i=0
for x in hurricane_amount.nlargest(n,0).index:
    largest_hurr = data[data['ID'] == keys[x][1]]
    lat = largest_hurr['Latitude'].values
    long = largest_hurr['Longitude'].values
    xpt, ypt = m(long, lat)
    m.plot(xpt, ypt, linewidth=2, color=colo[i])
    i+=1
plt.show()
In [95]:
n = 5 

plt.figure(figsize=(10,5))
m = Basemap(llcrnrlon=-110.,llcrnrlat=5.,urcrnrlon=10.,urcrnrlat=60.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',
            projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.drawparallels(np.arange(10,90,20),labels=[1,1,0,1])
m.drawmeridians(np.arange(-180,180,30),labels=[1,1,0,1])

colo=['red','blue','green','black','orange']
i=0
for x in hurricane_amount.nsmallest(n,0).index:
    largest_hurr = data[data['ID'] == keys[x][1]]
    lat = largest_hurr['Latitude'].values
    long = largest_hurr['Longitude'].values
    xpt, ypt = m(long, lat)
    m.plot(xpt, ypt, linewidth=2, color=colo[i])
    i+=1
plt.show()

The distribution of the final features is presented. The distribution of distance traveled in 6 hours interval is has a positive kurtosis will most of the values close to the mean, and is positively skewed for extreme values. The bearing distribution tells the more common directions where hurricanes travel, the strange thing is that 180 values, which represent south presents a peak, this might be biased for extreme observations.

In [96]:
dist = data[data['distance'] > 0]['distance']
ser = pd.Series(dist)
ser.plot(kind='kde', cmap='gray')
Out[96]:
<AxesSubplot:ylabel='Density'>
In [97]:
direc = data[data['bearing'] > 0]
direc = (direc['bearing'])
ser = pd.Series(direc)
ser.plot(kind='kde', cmap='gray')
Out[97]:
<AxesSubplot:ylabel='Density'>
In [98]:
plt.figure(figsize=(10,5))
m = Basemap(llcrnrlon=-110.,llcrnrlat=5.,urcrnrlon=10.,urcrnrlat=60.,
            rsphere=(6378137.00,6356752.3142),
            resolution='l',
            projection='merc',
            lat_0=40.,lon_0=-20.,lat_ts=20.)
m.drawcoastlines()
m.drawcountries()
m.drawmapboundary(fill_color='#99ffff')
m.fillcontinents(color='#cc9966',lake_color='#99ffff')
m.drawparallels(np.arange(10,90,20),labels=[1,1,0,1])
m.drawmeridians(np.arange(-180,180,30),labels=[1,1,0,1])

lat = data['Latitude'].values
long = data['Longitude'].values
xpt, ypt = m(long, lat)
m.scatter(xpt, ypt, .3, color='red')
plt.show()

Now, the label is created. The label will consist on a grid ID which is calculated based on coordinates. there will be a accuaricy lost as values are transformed into integers in several part of the calculation, this is algo a generalization step whre olny relatively big displacements are being captured.

In [99]:
data['gridID'] = np.zeros(total_data_count)
lat_min = 7.2
long_min = -109.3
lat_interval = round(66 - 7.2)
long_interval = round(13.5 + 109.3)

data['gridID'] = np.floor(data['Latitude'] - 7.200)* long_interval  + np.floor(data['Longitude'] + 109.3)
data['gridID'] = round(data['gridID'])
In [100]:
names = data['ID'].unique()
data.drop(['Date', 'Time', 'Latitude', 'Longitude'], axis = 1, inplace = True)
data.head()
Out[100]:
ID Maximum Wind distance bearing gridID
16 AL041851 40 0.000000 0.000000 799.0
17 AL041851 40 165.545433 326.268781 797.0
18 AL041851 50 165.342833 349.579563 796.0
19 AL041851 50 199.067179 161.573455 917.0
20 AL041851 60 201.466132 121.516233 915.0
In [102]:
names
Out[102]:
array(['AL041851', 'AL011852', 'AL041852', ..., 'AL062015', 'AL102015',
       'AL112015'], dtype=object)
In [ ]:
scalers = {}
gridScalers = {}
for name in names:
    scalers[name] = MinMaxScaler(feature_range=(0, 1))
    hurricane = data[data['ID'] == name]
    hurricane.drop('ID', axis = 1, inplace= True)
    
    hurricane = pd.DataFrame(scalers[name].fit_transform(hurricane), columns=['Maximum Wind','distance','bearing','gridID'])
    data.loc[data['ID'] == name, ['Maximum Wind','distance', 'bearing', 'gridID']] = hurricane.values

The following lines create a padded data structure which is then converted into a pad sequence object of the cross, used to manage and predict sequential observations.

In [104]:
data_pad = []
for key in np.unique(data['ID']):
    data_pad += [data[ data.loc[:, 'ID'] == key].loc[:, ['Maximum Wind','distance', 'bearing', 'gridID']]]
In [105]:
padded_data = keras.preprocessing.sequence.pad_sequences(data_pad, maxlen=60, dtype='float', padding='post', truncating='pre', value=0.0)
print(len(padded_data[0]))
print(len(padded_data))
padded_data
60
1039
Out[105]:
array([[[3.33333333e-01, 0.00000000e+00, 0.00000000e+00, 1.21852153e-03],
        [3.33333333e-01, 4.45292032e-01, 6.05519716e-02, 8.12347685e-04],
        [3.33333333e-01, 3.98119661e-01, 5.70225160e-02, 4.06173842e-04],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 7.93650794e-03],
        [0.00000000e+00, 2.84649530e-01, 4.42534039e-01, 6.61375661e-03],
        [1.42857143e-01, 2.05527817e-01, 4.35757391e-01, 6.61375661e-03],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 6.11081810e-01, 3.67786230e-01, 4.55893254e-02],
        [0.00000000e+00, 5.81227832e-01, 1.89498076e-01, 4.52186805e-02],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       ...,

       [[0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.44690517e-01],
        [0.00000000e+00, 1.00079070e-01, 5.01424950e-01, 3.44690517e-01],
        [0.00000000e+00, 1.33050103e-01, 5.01270684e-01, 3.45258376e-01],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[4.00000000e-01, 0.00000000e+00, 0.00000000e+00, 6.91808597e-01],
        [4.00000000e-01, 5.81923475e-01, 7.56319799e-01, 6.90997567e-01],
        [4.00000000e-01, 6.00322425e-01, 8.28346716e-01, 6.90186537e-01],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]],

       [[1.66666667e-01, 0.00000000e+00, 0.00000000e+00, 4.24092409e-01],
        [5.00000000e-01, 2.70432545e-01, 5.83997261e-01, 6.27062706e-01],
        [6.66666667e-01, 3.34688951e-01, 5.73023215e-01, 6.25412541e-01],
        ...,
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
        [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]]])

The padded data is reshaped in an arry wccounting for the 4 features defined.

In [106]:
temp_flat = padded_data.reshape(60*1039,4)
temp_flat
Out[106]:
array([[3.33333333e-01, 0.00000000e+00, 0.00000000e+00, 1.21852153e-03],
       [3.33333333e-01, 4.45292032e-01, 6.05519716e-02, 8.12347685e-04],
       [3.33333333e-01, 3.98119661e-01, 5.70225160e-02, 4.06173842e-04],
       ...,
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00],
       [0.00000000e+00, 0.00000000e+00, 0.00000000e+00, 0.00000000e+00]])

Now the model architecture is defined. As described by the designers: five total layers were employed. An input layer, three hidden vector state layers , and an output layer. The input layer takes in a data tuple. A data tuple is a sequence of features. The output layer contains an LSTM building unit along with the dropout value, a dense layer, and the final activation layer.

In [107]:
def load_data(stock, seq_len, amount_of_features):
    sequence_length = seq_len + 1
    result = []

    for index in range(len(stock) - sequence_length):
        seq = stock[index: index + sequence_length]
        result.append(seq)
    
    result = np.array(result)
    row = len(result) * 0.9
    train = result[:int(row), :]
    x_train = train[:, :-1]
    y_train = train[:, -1][:,-1]
    x_test = result[int(row):, :-1]
    y_test = result[int(row):, -1][:,-1]

    x_train = np.reshape(x_train, (x_train.shape[0], x_train.shape[1], amount_of_features))
    x_test = np.reshape(x_test, (x_test.shape[0], x_test.shape[1], amount_of_features))  

    return [x_train, y_train, x_test, y_test]
def build_model(layers):
    model = Sequential()

    for x in range(0,3):
        model.add(LSTM(units=layers[1], input_shape=(None, layers[0]), return_sequences=True))
        model.add(Dropout(0.1))

    model.add(LSTM(layers[2], return_sequences=False)) 
    model.add(Dropout(0.1))

    model.add(Dense(units=layers[2]))
    model.add(Activation("tanh"))

    start = time.time()
    opt = tf.keras.optimizers.SGD(learning_rate=1)
    model.compile(loss="mse", optimizer=opt,metrics=['accuracy'])
    print("Compilation Time : ", time.time() - start)
    return model

Data is devided into train a test sets

In [109]:
seq_len = 10
feature_count = 4
X_train, y_train, X_test, y_test = load_data(temp_flat[::-1], seq_len, feature_count)
print("X_train", X_train.shape)
print("y_train", y_train.shape)
print("X_test", X_test.shape)
print("y_test", y_test.shape)
X_train (56096, 10, 4)
y_train (56096,)
X_test (6233, 10, 4)
y_test (6233,)

Model is compiled

In [110]:
model = build_model([feature_count, seq_len, 1])
Compilation Time :  0.8611505031585693

And trained. The accuaricy on the validation set does not vary significantly along the training process, although it gets slowly, loss value tends to decrease mainly but it is not quite significative. After proving several combinations of learning rate, activation function and loss function, this was the best result and, for computational capacities no more epoch was carried.

In [111]:
model.fit(X_train, y_train, batch_size=512, epochs=100, validation_split=0.1, verbose=1)
Epoch 1/100
99/99 [==============================] - 41s 321ms/step - loss: 0.1139 - accuracy: 0.3959 - val_loss: 0.1057 - val_accuracy: 0.4262
Epoch 2/100
99/99 [==============================] - 31s 314ms/step - loss: 0.1106 - accuracy: 0.3993 - val_loss: 0.1044 - val_accuracy: 0.4262
Epoch 3/100
99/99 [==============================] - 31s 315ms/step - loss: 0.1175 - accuracy: 0.3905 - val_loss: 0.0834 - val_accuracy: 0.4250
Epoch 4/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0992 - accuracy: 0.3947 - val_loss: 0.0660 - val_accuracy: 0.4271
Epoch 5/100
99/99 [==============================] - 32s 323ms/step - loss: 0.0857 - accuracy: 0.3970 - val_loss: 0.0866 - val_accuracy: 0.4273
Epoch 6/100
99/99 [==============================] - 32s 325ms/step - loss: 0.0693 - accuracy: 0.4003 - val_loss: 0.0429 - val_accuracy: 0.4289
Epoch 7/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0594 - accuracy: 0.4023 - val_loss: 0.0412 - val_accuracy: 0.4299
Epoch 8/100
99/99 [==============================] - 33s 330ms/step - loss: 0.0485 - accuracy: 0.4032 - val_loss: 0.0391 - val_accuracy: 0.4301
Epoch 9/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0379 - accuracy: 0.4039 - val_loss: 0.0412 - val_accuracy: 0.4305
Epoch 10/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0388 - accuracy: 0.4041 - val_loss: 0.0206 - val_accuracy: 0.4307
Epoch 11/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0340 - accuracy: 0.4044 - val_loss: 0.0208 - val_accuracy: 0.4307
Epoch 12/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0332 - accuracy: 0.4049 - val_loss: 0.0242 - val_accuracy: 0.4307
Epoch 13/100
99/99 [==============================] - 31s 308ms/step - loss: 0.0309 - accuracy: 0.4047 - val_loss: 0.0193 - val_accuracy: 0.4307
Epoch 14/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0313 - accuracy: 0.4046 - val_loss: 0.0199 - val_accuracy: 0.4308
Epoch 15/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0318 - accuracy: 0.4046 - val_loss: 0.0234 - val_accuracy: 0.4307
Epoch 16/100
99/99 [==============================] - 32s 327ms/step - loss: 0.0303 - accuracy: 0.4045 - val_loss: 0.0180 - val_accuracy: 0.4307
Epoch 17/100
99/99 [==============================] - 32s 323ms/step - loss: 0.0313 - accuracy: 0.4046 - val_loss: 0.0187 - val_accuracy: 0.4308
Epoch 18/100
99/99 [==============================] - 31s 309ms/step - loss: 0.0301 - accuracy: 0.4049 - val_loss: 0.0196 - val_accuracy: 0.4308
Epoch 19/100
99/99 [==============================] - 32s 320ms/step - loss: 0.0291 - accuracy: 0.4050 - val_loss: 0.0179 - val_accuracy: 0.4307
Epoch 20/100
99/99 [==============================] - 33s 329ms/step - loss: 0.0309 - accuracy: 0.4047 - val_loss: 0.0177 - val_accuracy: 0.4307
Epoch 21/100
99/99 [==============================] - 32s 322ms/step - loss: 0.0300 - accuracy: 0.4047 - val_loss: 0.0179 - val_accuracy: 0.4307
Epoch 22/100
99/99 [==============================] - 31s 315ms/step - loss: 0.0295 - accuracy: 0.4047 - val_loss: 0.0193 - val_accuracy: 0.4310
Epoch 23/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0310 - accuracy: 0.4049 - val_loss: 0.0189 - val_accuracy: 0.4307
Epoch 24/100
99/99 [==============================] - 32s 321ms/step - loss: 0.0324 - accuracy: 0.4047 - val_loss: 0.0230 - val_accuracy: 0.4310
Epoch 25/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0308 - accuracy: 0.4046 - val_loss: 0.0179 - val_accuracy: 0.4307
Epoch 26/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0296 - accuracy: 0.4046 - val_loss: 0.0203 - val_accuracy: 0.4307
Epoch 27/100
99/99 [==============================] - 32s 322ms/step - loss: 0.0298 - accuracy: 0.4048 - val_loss: 0.0176 - val_accuracy: 0.4307
Epoch 28/100
99/99 [==============================] - 32s 327ms/step - loss: 0.0297 - accuracy: 0.4048 - val_loss: 0.0192 - val_accuracy: 0.4307
Epoch 29/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0304 - accuracy: 0.4048 - val_loss: 0.0189 - val_accuracy: 0.4307
Epoch 30/100
99/99 [==============================] - 31s 318ms/step - loss: 0.0304 - accuracy: 0.4049 - val_loss: 0.0186 - val_accuracy: 0.4310
Epoch 31/100
99/99 [==============================] - 31s 318ms/step - loss: 0.0293 - accuracy: 0.4048 - val_loss: 0.0265 - val_accuracy: 0.4307
Epoch 32/100
99/99 [==============================] - 31s 309ms/step - loss: 0.0301 - accuracy: 0.4047 - val_loss: 0.0200 - val_accuracy: 0.4307
Epoch 33/100
99/99 [==============================] - 31s 308ms/step - loss: 0.0310 - accuracy: 0.4047 - val_loss: 0.0176 - val_accuracy: 0.4308
Epoch 34/100
99/99 [==============================] - 31s 313ms/step - loss: 0.0285 - accuracy: 0.4048 - val_loss: 0.0187 - val_accuracy: 0.4310
Epoch 35/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0305 - accuracy: 0.4048 - val_loss: 0.0225 - val_accuracy: 0.4310
Epoch 36/100
99/99 [==============================] - 31s 309ms/step - loss: 0.0289 - accuracy: 0.4048 - val_loss: 0.0177 - val_accuracy: 0.4308
Epoch 37/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0289 - accuracy: 0.4049 - val_loss: 0.0171 - val_accuracy: 0.4307
Epoch 38/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0307 - accuracy: 0.4046 - val_loss: 0.0198 - val_accuracy: 0.4307
Epoch 39/100
99/99 [==============================] - 32s 328ms/step - loss: 0.0314 - accuracy: 0.4048 - val_loss: 0.0177 - val_accuracy: 0.4307
Epoch 40/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0285 - accuracy: 0.4047 - val_loss: 0.0182 - val_accuracy: 0.4307
Epoch 41/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0303 - accuracy: 0.4045 - val_loss: 0.0176 - val_accuracy: 0.4308
Epoch 42/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0290 - accuracy: 0.4049 - val_loss: 0.0199 - val_accuracy: 0.4307
Epoch 43/100
99/99 [==============================] - 33s 330ms/step - loss: 0.0307 - accuracy: 0.4047 - val_loss: 0.0201 - val_accuracy: 0.4310
Epoch 44/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0294 - accuracy: 0.4047 - val_loss: 0.0210 - val_accuracy: 0.4307
Epoch 45/100
99/99 [==============================] - 33s 337ms/step - loss: 0.0305 - accuracy: 0.4048 - val_loss: 0.0213 - val_accuracy: 0.4307
Epoch 46/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0289 - accuracy: 0.4048 - val_loss: 0.0184 - val_accuracy: 0.4310
Epoch 47/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0295 - accuracy: 0.4047 - val_loss: 0.0186 - val_accuracy: 0.4307
Epoch 48/100
99/99 [==============================] - 31s 310ms/step - loss: 0.0308 - accuracy: 0.4046 - val_loss: 0.0265 - val_accuracy: 0.4310
Epoch 49/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0286 - accuracy: 0.4051 - val_loss: 0.0184 - val_accuracy: 0.4307
Epoch 50/100
99/99 [==============================] - 31s 313ms/step - loss: 0.0294 - accuracy: 0.4048 - val_loss: 0.0175 - val_accuracy: 0.4307
Epoch 51/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0317 - accuracy: 0.4045 - val_loss: 0.0189 - val_accuracy: 0.4307
Epoch 52/100
99/99 [==============================] - 31s 310ms/step - loss: 0.0295 - accuracy: 0.4048 - val_loss: 0.0191 - val_accuracy: 0.4310
Epoch 53/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0299 - accuracy: 0.4049 - val_loss: 0.0172 - val_accuracy: 0.4307
Epoch 54/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0292 - accuracy: 0.4048 - val_loss: 0.0176 - val_accuracy: 0.4307
Epoch 55/100
99/99 [==============================] - 32s 325ms/step - loss: 0.0302 - accuracy: 0.4049 - val_loss: 0.0208 - val_accuracy: 0.4307
Epoch 56/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0285 - accuracy: 0.4048 - val_loss: 0.0179 - val_accuracy: 0.4307
Epoch 57/100
99/99 [==============================] - 30s 308ms/step - loss: 0.0306 - accuracy: 0.4050 - val_loss: 0.0216 - val_accuracy: 0.4307
Epoch 58/100
99/99 [==============================] - 31s 309ms/step - loss: 0.0281 - accuracy: 0.4048 - val_loss: 0.0181 - val_accuracy: 0.4310
Epoch 59/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0299 - accuracy: 0.4050 - val_loss: 0.0222 - val_accuracy: 0.4310
Epoch 60/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0300 - accuracy: 0.4047 - val_loss: 0.0198 - val_accuracy: 0.4310
Epoch 61/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0287 - accuracy: 0.4048 - val_loss: 0.0186 - val_accuracy: 0.4307
Epoch 62/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0300 - accuracy: 0.4050 - val_loss: 0.0259 - val_accuracy: 0.4305
Epoch 63/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0293 - accuracy: 0.4049 - val_loss: 0.0171 - val_accuracy: 0.4308
Epoch 64/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0286 - accuracy: 0.4047 - val_loss: 0.0204 - val_accuracy: 0.4310
Epoch 65/100
99/99 [==============================] - 31s 310ms/step - loss: 0.0282 - accuracy: 0.4048 - val_loss: 0.0183 - val_accuracy: 0.4310
Epoch 66/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0295 - accuracy: 0.4045 - val_loss: 0.0185 - val_accuracy: 0.4310
Epoch 67/100
99/99 [==============================] - 31s 315ms/step - loss: 0.0305 - accuracy: 0.4047 - val_loss: 0.0234 - val_accuracy: 0.4305
Epoch 68/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0303 - accuracy: 0.4048 - val_loss: 0.0170 - val_accuracy: 0.4308
Epoch 69/100
99/99 [==============================] - 31s 310ms/step - loss: 0.0285 - accuracy: 0.4046 - val_loss: 0.0180 - val_accuracy: 0.4307
Epoch 70/100
99/99 [==============================] - 32s 323ms/step - loss: 0.0277 - accuracy: 0.4051 - val_loss: 0.0176 - val_accuracy: 0.4307
Epoch 71/100
99/99 [==============================] - 33s 332ms/step - loss: 0.0299 - accuracy: 0.4048 - val_loss: 0.0264 - val_accuracy: 0.4310
Epoch 72/100
99/99 [==============================] - 32s 318ms/step - loss: 0.0283 - accuracy: 0.4050 - val_loss: 0.0169 - val_accuracy: 0.4308
Epoch 73/100
99/99 [==============================] - 32s 319ms/step - loss: 0.0290 - accuracy: 0.4050 - val_loss: 0.0175 - val_accuracy: 0.4307
Epoch 74/100
99/99 [==============================] - 34s 339ms/step - loss: 0.0301 - accuracy: 0.4046 - val_loss: 0.0185 - val_accuracy: 0.4310
Epoch 75/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0286 - accuracy: 0.4047 - val_loss: 0.0220 - val_accuracy: 0.4310
Epoch 76/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0290 - accuracy: 0.4049 - val_loss: 0.0179 - val_accuracy: 0.4310
Epoch 77/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0289 - accuracy: 0.4049 - val_loss: 0.0200 - val_accuracy: 0.4310
Epoch 78/100
99/99 [==============================] - 31s 309ms/step - loss: 0.0285 - accuracy: 0.4049 - val_loss: 0.0209 - val_accuracy: 0.4310
Epoch 79/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0288 - accuracy: 0.4048 - val_loss: 0.0203 - val_accuracy: 0.4307
Epoch 80/100
99/99 [==============================] - 31s 310ms/step - loss: 0.0302 - accuracy: 0.4047 - val_loss: 0.0175 - val_accuracy: 0.4310
Epoch 81/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0288 - accuracy: 0.4049 - val_loss: 0.0193 - val_accuracy: 0.4310
Epoch 82/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0302 - accuracy: 0.4049 - val_loss: 0.0171 - val_accuracy: 0.4307
Epoch 83/100
99/99 [==============================] - 31s 315ms/step - loss: 0.0269 - accuracy: 0.4048 - val_loss: 0.0173 - val_accuracy: 0.4308
Epoch 84/100
99/99 [==============================] - 31s 315ms/step - loss: 0.0293 - accuracy: 0.4050 - val_loss: 0.0170 - val_accuracy: 0.4308
Epoch 85/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0276 - accuracy: 0.4049 - val_loss: 0.0219 - val_accuracy: 0.4307
Epoch 86/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0288 - accuracy: 0.4048 - val_loss: 0.0219 - val_accuracy: 0.4307
Epoch 87/100
99/99 [==============================] - 31s 316ms/step - loss: 0.0288 - accuracy: 0.4049 - val_loss: 0.0185 - val_accuracy: 0.4307
Epoch 88/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0286 - accuracy: 0.4047 - val_loss: 0.0192 - val_accuracy: 0.4307
Epoch 89/100
99/99 [==============================] - 31s 311ms/step - loss: 0.0282 - accuracy: 0.4048 - val_loss: 0.0177 - val_accuracy: 0.4310
Epoch 90/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0283 - accuracy: 0.4048 - val_loss: 0.0188 - val_accuracy: 0.4310
Epoch 91/100
99/99 [==============================] - 31s 313ms/step - loss: 0.0298 - accuracy: 0.4048 - val_loss: 0.0254 - val_accuracy: 0.4305
Epoch 92/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0288 - accuracy: 0.4050 - val_loss: 0.0175 - val_accuracy: 0.4307
Epoch 93/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0284 - accuracy: 0.4047 - val_loss: 0.0185 - val_accuracy: 0.4307
Epoch 94/100
99/99 [==============================] - 31s 317ms/step - loss: 0.0290 - accuracy: 0.4048 - val_loss: 0.0171 - val_accuracy: 0.4308
Epoch 95/100
99/99 [==============================] - 32s 322ms/step - loss: 0.0286 - accuracy: 0.4049 - val_loss: 0.0181 - val_accuracy: 0.4307
Epoch 96/100
99/99 [==============================] - 32s 322ms/step - loss: 0.0292 - accuracy: 0.4049 - val_loss: 0.0174 - val_accuracy: 0.4308
Epoch 97/100
99/99 [==============================] - 32s 328ms/step - loss: 0.0285 - accuracy: 0.4045 - val_loss: 0.0182 - val_accuracy: 0.4307
Epoch 98/100
99/99 [==============================] - 31s 312ms/step - loss: 0.0287 - accuracy: 0.4047 - val_loss: 0.0199 - val_accuracy: 0.4310
Epoch 99/100
99/99 [==============================] - 31s 313ms/step - loss: 0.0291 - accuracy: 0.4048 - val_loss: 0.0221 - val_accuracy: 0.4307
Epoch 100/100
99/99 [==============================] - 31s 314ms/step - loss: 0.0278 - accuracy: 0.4048 - val_loss: 0.0172 - val_accuracy: 0.4307
Out[111]:
<keras.callbacks.History at 0x1bb9efd3b20>

After evaluating the model on the test dataset, the results of MSE and RMSE slightly decrease but mantain the overall reults. Accuaricy achived was aarroung 41 percent for test data. The reported accuariccy in the original publication was of avout 60 percent. the diffence migth be due to the number of samples used, the tresholds for eliminating events, or the features used.

In [112]:
trainScore = model.evaluate(X_train, y_train, verbose=0)
print('Train Score: %.4f MSE (%.4f RMSE)' % (trainScore[0], Math.sqrt(trainScore[0])))

testScore = model.evaluate(X_test, y_test, verbose=0)
print('Test Score: %.4f MSE (%.4f RMSE)' % (testScore[0], Math.sqrt(testScore[0])))
Train Score: 0.0173 MSE (0.1314 RMSE)
Test Score: 0.0174 MSE (0.1318 RMSE)

The model is saved and imported to make some predictions over test data.

In [113]:
model_json = model.to_json()
with open("model.json", "w") as json_file:
    json_file.write(model_json)

model.save_weights("model.h5")
print("Saved Model to Disk")
Saved Model to Disk
In [114]:
# Load json and create model
json_file = open('model.json', 'r')
loaded_model_json = json_file.read()
json_file.close()
loaded_model = model_from_json(loaded_model_json)

# Load weights into new model
loaded_model.load_weights("model.h5")
print("Loaded Model from Disk")
model = loaded_model
WARNING:tensorflow:Layer lstm will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_1 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_2 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
WARNING:tensorflow:Layer lstm_3 will not use cuDNN kernels since it doesn't meet the criteria. It will use a generic GPU kernel as fallback when running on GPU.
Loaded Model from Disk
In [115]:
pred = model.predict(X_test)

In this graph the real and predicted grid is compared. Overall, the fit is appropriate, except in the critical zones where grid values are low or high. As this value are normalized ones, the values near 0 and 1 present the bigger error. By zooming a little (blue line graph) it is clear that small errors also exist in every location but the trend maintains, extreme values of grid ID present the bigger shifts.

In [116]:
plt.figure(figsize=(15, 4), dpi=100)
plt.plot(pred[:1000],
         linewidth=1, 
         color='red', 
         label='Predicted Grid Locations')
plt.plot(y_test[:1000],
         linewidth=1, 
         color='black',
         marker='+',
         markersize=4,
         label='Real Grid Locations')
plt.xlabel('Data Tuple')
plt.ylabel('Grid Locations')

plt.legend(loc='upper left')
plt.show()
In [38]:
plt.figure(figsize=(15, 4), dpi=100)
plt.plot(pred[:150],
         linewidth=1, 
         color='blue', 
         label='Predicted Grid Locations')
plt.plot(y_test[:150],
         linewidth=1, 
         color='black',
         marker='+',
         markersize=4,
         label='Real Grid Locations')
plt.xlabel('Data Tuple')
plt.ylabel('Grid Locations')

plt.legend(loc='upper left')
plt.show()

As conclusions:

  • I have found a way to predict sequence of data with RNN, explore a little a bout LSTM architecture and other layers used to improve results in this kind of problems. The followed approach is also in line with an aidea i present on first assigment where a grid definition came to my mind. But, when reviewing the literature, it is possible to predict multi dimension data as for example x and y coordinates. Not sure if that accounts for spaciallity but is a matter of furhet review for me.
  • Differences in feature deffinition, as small as they might be, drasticaly change the result of a prediction problem. Between the processing presented here and the guide publication i based on the changes might seem not relevant as elimination of some features or keepong of hurracines with certain number of observations. For me the result was 40% accuariy against arroung 60% reported in the publucation.
  • As true as it might be that there is a wide range of architectures defined to account for several problems, a deep knowledge is indeed needed to modify even the smallest parameters of the model. After trying to improve the results by changing some stuff I realize that even knowing some basics of the proposed architecture, modifying it would not be possible for me for now.

Alemany, S., Beltran, J., Perez, A., & Ganzfried, S. (2019, July). Predicting hurricane trajectories using a recurrent neural network. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 33, No. 01, pp. 468-475).