Adjustments for Time Series Video Lecture Transcript
This transcript was automatically generated, so there may be discrepancies between the video and the text.

Hi everybody. Welcome back. In this video we're going to talk about some adjustments that need to be made for time series data when making forecasts. So in this notebook will define the horizon of a forecast and then also discuss the adjustments that we need to make to data splitting techniques for sequential data, particularly time series data in the setting so.00:00:26 SPK_1The temporal nature of time series changes the way that we have to make an approach data split. So when we create, you know when we created data splits before, we would just do a uniform random split or maybe some sort of.00:00:39 SPK_1A random split that took into account something like a categorical variable. But when we create splits for time series, we have to respect the sequential nature of the data. So that means that our splits can't include data from future values, if that makes sense. So random splits.00:00:58 SPK_1If we do this, we would end up having a completely random split. We would end up having in our training set maybe features that occur after features in the holdout or test set. And why that's not good for time series is that would mean that we're going to be training a model on observations that happened after the ones that we're going to try and predict on and assess.00:01:23 SPK_1Performance on and this is problematic when we're assuming with the time series that later observations may be dependent upon earlier observations. So we're sort of getting our causality, if that makes sense, backwards. So we don't want to use the future to predict the past when we already in a some sense know what happened in the past with the future.00:01:46 SPK_1So we're going to talk about what are the types of adjustments we have to make for train, test and validation splits as well as cross validation. So to help us understand what we should do, we're going to define this idea of a horizon. So in a forecast, we're not typically trying to predict all values into the future at one time. We usually set a small window into the future that we would like to know.00:02:13 SPK_1So for instance, if we think back to the weather example, we usually say like there's the 10 day forecast and then maybe even some sites or weather services will give you a month into the future. But after a certain time point they stop predicting that far ahead because it's just going to be a nonsense prediction. So the amount of time steps into the future that you're trying to make a forecast for, let's say that this is little H time steps. This is known as your forecast horizon. So for instance.00:02:43 SPK_1In the 10 day weather example, the horizon of that weather forecast would be 10 days. So in train test splits or validation splits for time series data, what you do is you set aside not a random split, but the last 1-2 or possibly 3 horizons worth of data that you would like to predict out, predict out to.00:03:08 SPK_1As your test or validation set. So let's look at this example with. Let's say our full time series are these 12 blue dots here. Then when we do our train test split, or if this was a training set now we're doing a validation set, we would split it like this. So the first eight observations are the training sets, and then the final four observations are the test set or validation set.00:03:34 SPK_1So here we could have maybe a horizon of two, so that would give us two horizons worth of data to evaluate on. Or maybe we have a horizon of four, which would now give us one horizon of data to evaluate on okay. So that's train test splits and validation splits with time series data.00:03:52 SPK_1In Python, we usually don't need to write any or import any special functions or objects or any sort of method to do this. Typically we can just subset using indexing if it's a Numpy array, or using the the Loc or iloc functions if it's a panda's data frame. Cross validation works in much the same way as the split described above.00:04:16 SPK_1So for each of the K splits, you're going to incrementally add the next H or maybe 2H observations. So here's an example of what I mean that should hopefully help illustrate.00:04:28 SPK_1Let's say that we want our horizon to be three time steps and we want to do fivefold cross validation. So our entire training set has 123456789101112131415161718 total observations. And so the way we will do cross validation splits is sequentially. So the first thing you do is we're going to set aside.00:04:54 SPK_1In this case it worked out to be the first three observations as our cross validation split one training set followed by the next three observations as the holdout set. Then for the second split all six points from the first split are now considered the training sets. And then we add in the next three as a holdout set, and then for the third split the same thing. So all the points from cross validation split two are now considered a training set, and then we add in the next three points for our holdout set.00:05:23 SPK_1And then we just keep going in this way until eventually we build all the way up to our entire training set As for for the last split, so everything that was in cross validation split four and this example is now a training set. And then the last three observations in the original training set are my holdout. And so you know if let's say that H was smaller, so if H was two, we would still do the sort of starting at.00:05:51 SPK_1And in essence, the way we build it is starting at the entire data set where we would remove 2 observations at a time. So we may end up with a training set, for instance, where we, you know, first training set where we do have more than just a single horizon to train on. That's just the way that this image worked out. Typically the way it's done is you'll just take sequentially the amount that you want for your holdout set off one at a time.00:06:19 SPK_1So in Python we know we could implement this by hand, and that's what we used to do. But SK learn has added a nice time series cross validation object within the past year or so and we can use this. It's called Time series split and it's stored in the same place as the K fold. So we're going to do from SK learn dot model selection, we'll import time series is split.00:06:48 SPK_1And now we can implement this on this fake time series. So here is the time series that I've created that maybe it represents the the the price of some stock.