Modeling


In order to predict the number of COVID-19 cases in each county in California, we decided to build 3 models: a Long short-term Memory(LSTM) neural network using Tensorflow, an ARIMA model using pmdarima, and a SARIMA model using statsmodels. As neural networks tend to overfit, we solved part of this issue by using dropout layers, L2 regularization in our hidden layers, as well as early stopping. The neural network structure we decided to go with was 5 layers in the following order: A 128 node LSTM input layer, a 64 node LSTM layer, a 32 node dense layer with an L2 regularization of 0.001, a 16 node dense layer with an L2 regularization of 0.001, and finally an output layer of 1 containing the predictions of the model. Each hidden layer had a dropout layers of 0.2. One additional issue with using an LSTM neural network was we were only able to use one feature to generate our predictions, so we decided to use the number of hospitalized COVID-19 patients to predict the number of cases as the 2 are highly correlated. Also, as we had to do modeling for all 58 counties, we did not gridsearch any hyperparameters to improve our models. Using the following topology we were able to achieve a root mean squared error of 8.51 for our training data and a root mean squared error 17.39 for our testing data. Below is a graph of the training vs testing loss of our model over 124 epochs for Los Angeles county.

Of course, one of the challenges we faced while trying to make predictions for each county is being able to predict drastic changes in the number of COVID-19 cases in a county. A perfect example of this is LA County where we can see that our model predicts early data quite well, however it is late on the spike between November and December and doesn't adapt well to the sudden decrease in the past few datas as seen below.

In the case of our ARIMA model, we used the auto arima function to automatically pick the best parameters for us. We found that the best parameters were 0 time lags for our p, subtracting 2 times from past values of our data for d, and a moving average of 2 for our q. Here we have our diagnostics for our ARIMA model. We see with the residuals that they are fairly steady in the beginning but there is a lot more varience in our forcast as time moves on. For our Q-Q plot, optimally we are looking for a straight line which we almost have, however the linearity is problem at the extremities of the plot. Of course, the story is similar for our histogram plus estimated density and correlogram where we see that we can capture about 70-80% of the data accurately, but can't really predict extremes.



Unfortunately, we did not see much improvements between our LSTM neural network and ARIMA as our predictions and the actual results are still differ a lot as can be seen below using LA County as an example again. Also the root mean squared error is 22.89 for our testing set using ARIMA, so it is slightly worse than our LSTM model.



Next we moved on to testing our a SARIMAX model to see if adding in a seasonality factor to our time series model could improve it. Again, we tested a varity of parameters for p, d, and q to see which parameters would be the best to use. In this case our parameters are slight different from our ARIMA model where our p, d, and q are 0, 1, and 1 respectively. Looking at the diagnostics for our SARIMAX model, we again have a similar story compared to our ARIMA model. In most cases, it unfortunately looks worse that our ARIMA model as can be seen below.



Again we see that our predictions are much worse than the actual values, and actually do worse than both ARIMA and LSTM. Our RMSE when predicting cases for LA County was 41.43 using SARIMAX.



In the end, we see that LSTM was able to predict one of the counties with the most varience the best so we decided to use it to predict the number of COVID-19 Cases in every county using LSTM.


Obviously, looking at these predictions for 58 different counties and making a decision isn't easy, so I've decided to display these predictions graphically on a map: Click here to see the map


If you want to see in depth how all our models were built, as well as our LSTM model for all 58 counties check out the following notebooks:


Making an LSTM Model to Predict COVID-19 Cases
Making an ARIMA Model to Predict COVID-19 Cases
Making a SARIMAX Model to Predict COVID-19 Cases