Brisbane City Council (BCC) is considering upgrades to the bikeway networks. They are using data they have gathered from sensors placed along the bike paths, which record the number of cyclists, to plan the upgrades, however there have been a number of sensor failures which have resulted in their dataset missing a number of entries. BCC have requested that you investigate if it possible to predict missing data from data gathered from other sensors on the bike path network.
You have been provided with three years’ worth of data (Bike-Ped-Auto-Counts-2014.csv, Bike-Ped-Auto-Counts-2015.csv, and Bike-Ped-Auto-Counts-2016.csv), and the corresponding three years of weather data (the files named IDCJAC00XX_040913_201X_Data.csv). As an initial investigation, you have been asked to consider only these data series from the bikeway data:
BicentennialBikewayCyclistsInbound
GoBetweenBridgeCyclistsInbound
KangarooPointBikewayCyclistsInbound
NormanParkCyclistsInbound
RiverwalkCyclistsInbound
StoryBridgeWestCyclistsInbound
Using the three years data, you are to:
Determine which counters are best suited to predicting the missing data in others (i.e. which, if any counters, could be used to predict BicentennialBikewayCyclistsInbound).
Investigate if weather data can be used to help support this prediction, and if so, indicate what weather data is most helpful.
Predict missing data where appropriate to generate a more “complete” database.
Comment on the resulting corrected dataset. In particular:
What problems, if any, may arise from this approach?
How effective has this been in reducing missing data?
How trustworthy are the predicted values?
You should draw on the unit content concerning correlation and regression to answer this question. Note that you are not expected to use training/validation/testing data splits, although you are welcome to do so. No marks will be lost/gained for using/not using data splits.
Sample Solution