Analysis and visualisation
Time series/time-based analysis
def dataset (tc/dataset qa/clean-dataset-file-name
(:key-fn keyword
{:parser-fn {:datetime [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))
We’ll start by breaking our date column into year/month/day/hour so we can start to examine some trends:
def get-hour (memfn getHour)) (
def with-temporal-components
(-> dataset
(:year :datetime (comp jt/value jt/year))
(tc/map-columns :month :datetime (comp jt/value jt/month))
(tc/map-columns :day-of-week :datetime jt/day-of-week)
(tc/map-columns :date :datetime jt/local-date)
(tc/map-columns :hour :datetime get-hour)
(tc/map-columns :datetime))) (tc/order-by
Yearly trends
defn- calculate-daily-trends [ds]
(-> ds
(:date])
(tc/group-by [:total-count (comp int tcc/sum :count)
(tc/aggregate {:avg-count (comp int tcc/mean :count)
:stations (comp count distinct :station-id)})
:date))) (tc/order-by
#'notebooks.7-analyse-and-visualise/calculate-daily-trends
-> (calculate-daily-trends with-temporal-components)
(:epoch-day :date #(jt/as % :epoch-day))
(tc/map-columns 1000
(plotly/base {:=width "Total average bicycle traffic trend over time"
:=title :epoch-day
:=x :temporal
:=x-type :avg-count})
:=y "Daily average across all stations"})
(plotly/layer-point {:=name "Trend"})) (plotly/layer-smooth {:=name
Monthly trends
defn calculate-monthly-average [ds]
(-> ds
(:year :month])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {:datetime [:year :month] (fn [year month]
(tc/map-columns
(jt/local-date year month))):datetime))) (tc/order-by
-> (calculate-monthly-average with-temporal-components)
(1000
(plotly/base {:=width "Monthly bicycle traffic trend"})
:=title :datetime
(plotly/layer-line {:=x :avg-count})) :=y
How has bike traffic changed from season to season over time?
-> with-temporal-components
(:year :month])
(tc/group-by [:monthly-total (comp int tcc/sum :count)
(tc/aggregate {:daily-avg (comp int tcc/mean :count)})
:season :month {12 "Winter" 1 "Winter" 2 "Winter"
(tc/map-columns 3 "Spring" 4 "Spring" 5 "Spring"
6 "Summer" 7 "Summer" 8 "Summer"
9 "Fall" 10 "Fall" 11 "Fall"})
1000
(plotly/base {:=width "Seasonal daily average counts"})
:=title :year
(plotly/layer-bar {:=x "Year"
:=x-title :daily-avg
:=y "Daily average count across all stations"
:=y-title :season
:=color 0.8})) :=mark-opacity
How does weekend traffic compare to weekday traffic?
-> with-temporal-components
(:is-weekend :datetime jt/weekend?)
(tc/map-columns :year :is-weekend])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {1000})
(plotly/base {:=width :year
(plotly/layer-bar {:=x :avg-count
:=y :is-weekend})) :=color
What is the busiest day of the week?
-> with-temporal-components
(:day-of-week])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {:day-of-week)
(tc/order-by "Average bicycle traffic by day of week"
(plotly/base {:=title 1000})
:=width :day-of-week
(plotly/layer-bar {:=x :avg-count})) :=y
What is the busiest time of day?
-> with-temporal-components
(:hour])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {1000
(plotly/base {:=width "Hourly traffic average"})
:=title :hour
(plotly/layer-bar {:=x "Hour of day"
:=x-title :avg-count
:=y "Average count across all stations"})) :=y-title
How did the pandemic impact bicycle traffic?
def pandemic-periods
(let [march-2020 (jt/local-date-time 2020 03 01)
(-2023 (jt/local-date-time 2023 05 01)
may:datetime %) march-2020))
pre-pandemic (tc/select-rows with-temporal-components #(jt/< (-2020 (:datetime %) may-2023))
pandemic (tc/select-rows with-temporal-components #(jt/<= march:datetime %) may-2023))]
post-pandemic (tc/select-rows with-temporal-components #(jt/> (:pre-pandemic pre-pandemic
{:pandemic pandemic
:post-pandemic post-pandemic}))
-> (calculate-monthly-average (:pre-pandemic pandemic-periods))
(1000
(plotly/base {:=width "Impact of the pandemic on bicycle traffic"})
:=title :datetime
(plotly/layer-line {:=x :avg-count
:=y "Pre pandemic"})
:=name :pandemic pandemic-periods))
(plotly/layer-line {:=dataset (calculate-monthly-average (:datetime
:=x :avg-count
:=y "Pandemic"})
:=name :post-pandemic pandemic-periods))
(plotly/layer-line {:=dataset (calculate-monthly-average (:datetime
:=x :avg-count
:=y "Post-pandemic"})) :=name
-> (calculate-monthly-average (:pre-pandemic pandemic-periods))
(1000
(plotly/base {:=width "Impact of the pandemic on bicycle traffic"})
:=title :datetime
(plotly/layer-bar {:=x :avg-count
:=y "Pre pandemic"})
:=name :pandemic pandemic-periods))
(plotly/layer-bar {:=dataset (calculate-monthly-average (:datetime
:=x :avg-count
:=y "Pandemic"})
:=name :post-pandemic pandemic-periods))
(plotly/layer-bar {:=dataset (calculate-monthly-average (:datetime
:=x :avg-count
:=y "Post-pandemic"})) :=name
What are the busiest times for cyclists in Berlin?
-> with-temporal-components
(:day-of-week :hour])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {:day-of-week)
(tc/order-by :hour
(plotly/layer-heatmap {:=x :day-of-week
:=y :avg-count})) :=z
Find anomalous days (very high or low traffic) across all stations
-> with-temporal-components
(:date])
(tc/group-by [:daily-total (comp int tcc/sum :count)
(tc/aggregate {:daily-avg (comp int tcc/mean :count)
:std-dev (comp int tcc/standard-deviation :count)})
comp zero? :std-dev))
(tc/drop-rows (:z-score [:daily-total :daily-avg :std-dev]
(tc/map-columns fn [total avg stdev]
(/ (- total avg) stdev)))
(1200 :=title "Days with unusual bicycle traffic"})
(plotly/base {:=width :date
(plotly/layer-point {:=x :temporal
:=x-type 4
:=mark-size :z-score})) :=y
Spatial analysis
For this we’ll make use of our other dataset that includes metadata about the stations.
How does bicycle traffic vary across different locations in Berlin? Station comparison - traffic volume
-> with-temporal-components
(:station-id])
(tc/group-by [:avg-count (comp int tcc/mean :count)})
(tc/aggregate {:station-id)
(tc/inner-join qa/location-info-ds :avg-count :desc])
(tc/order-by [1000 :=title "Station traffic comparison"})
(plotly/base {:=width :direction :=y :avg-count})) (plotly/layer-bar {:=x
Station growth over time (year-over-year)
-> with-temporal-components
(:year :station-id])
(tc/group-by [:yearly-avg (comp int tcc/mean :count)})
(tc/aggregate {:station-id)
(tc/inner-join qa/location-info-ds 1000 :=title "Station growth over time"})
(plotly/base {:=width :year :=y :yearly-avg :=color :direction})) (plotly/layer-line {:=x