Quality control
Bonus: Data quality analysis and cleaning
In any real-world dataset, there are always data quality issues that we should address before proceeding with an analysis. If we don’t, we risk drawing incorrect conclusions because the source data may be unreliable or unrealistic. We need to expose any issues like this ahead of time so that we can correctly interpret our results.
In this case we have some very helpful advice in the dataset description pointing out several things we should look for, including:
- Gaps in the data from technical issues (battery failures, defective equipment, transmission issues, etc.)
- Different station start dates
- Environmental factors affecting measurement
- Infrastructure changes affecting traffic (detours, construction, network updates)
- Mis-measurement due to incorrect bike lane usage (cyclists travelling in the wrong direction)
We’ll explore and fix these data quality issues so that we can conduct an accurate analysis.
Missing and inconsistent data
First we’ll calculate the uptime percentage for each station. To do this we’ll
- Look at the location metadata dataset to determine when a station came online
- Determine how many measurements it should have, if it had been working 24/7
- Examine how many measurements we actually have, and compare
Then we’ll flag suspicious values by checking for ones that appear significantly higher or lower than the mean.
Looking at our entire dataset together, we can start by grouping by station ID, since many of quality metrics we care about will be most relevant at a station level. Note that tablecloth knows how to handle a gzipped file, there is no extra processing needed.
def combined-dataset (tc/dataset transform/combined-dataset-file-name
(:key-fn keyword
{:parser-fn {:datetime [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))
def grouped-by-station-id (tc/group-by combined-dataset :station-id)) (
grouped-by-station-id
_unnamed [38 3]:
:name | :group-id | :data |
---|---|---|
12-PA-SCH | 0 | Group: 12-PA-SCH [121287 3]: |
26-LI-PUP | 1 | Group: 26-LI-PUP [83341 3]: |
27-RE-MAR | 2 | Group: 27-RE-MAR [84784 3]: |
03-MI-SAN-O | 3 | Group: 03-MI-SAN-O [84011 3]: |
02-MI-JAN-S | 4 | Group: 02-MI-JAN-S [85106 3]: |
05-FK-OBB-W | 5 | Group: 05-FK-OBB-W [71380 3]: |
02-MI-JAN-N | 6 | Group: 02-MI-JAN-N [85504 3]: |
18-TS-YOR-W | 7 | Group: 18-TS-YOR-W [85504 3]: |
03-MI-SAN-W | 8 | Group: 03-MI-SAN-W [84032 3]: |
19-TS-MON | 9 | Group: 19-TS-MON [84756 3]: |
… | … | … |
14-CW-JU-W | 27 | Group: 14-CW-JU-W [21334 3]: |
14-CW-JU-O | 28 | Group: 14-CW-JU-O [21334 3]: |
01-MI-AL-W | 29 | Group: 01-MI-AL-W [17537 3]: |
03-SP-NO-O | 30 | Group: 03-SP-NO-O [948 3]: |
04-MI-NO | 31 | Group: 04-MI-NO [10617 3]: |
03-SP-NO-W | 32 | Group: 03-SP-NO-W [948 3]: |
11-PA-SE-N | 33 | Group: 11-PA-SE-N [17542 3]: |
16-SP-NO-W | 34 | Group: 16-SP-NO-W [8783 3]: |
16-SP-NO-O | 35 | Group: 16-SP-NO-O [8783 3]: |
09-PA-SA-S | 36 | Group: 09-PA-SA-S [6084 3]: |
07-FK-ST | 37 | Group: 07-FK-ST [6096 3]: |
There are 38 groups, but only 35 rows in our location metadata sheet:
def location-info-ds
(:key-fn keyword
(tc/dataset explore/location-info-file-name {:parser-fn {:installed [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))
(tc/row-count location-info-ds)
35
First, we should reconcile this. Clojure has some useful built-in helpers. clojure.data/diff
returns a tuple of [things-only-in-a things-only-in-b things-in-both] which we can use to spot the discrepancy:
set (:name grouped-by-station-id)) (set (:station-id location-info-ds))) (data/diff (
"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}
[#{"17-SK-BRE-O" "17-SK-BRE-W"}
#{"01-MI-AL-W"
#{"16-SP-NO-W"
"09-PA-SA-S"
"05-FK-OBB-W"
"15-SP-KLO-S"
"21-NK-MAY"
"16-SP-NO-O"
"13-CW-PRI"
"04-MI-NO"
"11-PA-SE-N"
"14-CW-JU-O"
"20-TS-MAR-S"
"10-PA-BER-S"
"05-FK-OBB-O"
"27-RE-MAR"
"15-SP-KLO-N"
"02-MI-JAN-S"
"12-PA-SCH"
"26-LI-PUP"
"02-MI-JAN-N"
"14-CW-JU-W"
"07-FK-ST"
"06-FK-FRA-W"
"03-MI-SAN-O"
"23-TK-KAI"
"10-PA-BER-N"
"24-MH-ALB"
"18-TS-YOR-O"
"06-FK-FRA-O"
"20-TS-MAR-N"
"18-TS-YOR-W"
"19-TS-MON"
"03-MI-SAN-W"}]
Here we can easily spot the problem: it appears that there are two stations that probably got mis-named. "17-SK-BRE-O"
and "17-SK-BRE-W"
are in the location-info dataset, but do not appear at all in the timeseries data. However, in the timeseries we do have these five station ids that are not in the location-info dataset:
"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"} #{
"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"} #{
It seems likely that "17-SZ-BRE-W"
and "17-SZ-BRE-O"
are meant to be "17-SK-BRE-W"
and "17-SK-BRE-O"
. We could be more sure by inspecting the first row of the raw datasets we had, which included the installation date in the same cell as the station id:
->> explore/raw-datasets
(map tc/column-names)
(map (partial filter #(str/includes? % "17")))) (
(()
()
()"17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
("17-SZ-BRE-O\n01.05.2016" "17-SZ-BRE-W\n01.05.2016")) (
Comparing these to the rows from the location-info dataset, we can see the installation dates are the same:
-> location-info-ds
(comp #{"17-SK-BRE-O" "17-SK-BRE-W"} :station-id))) (tc/select-rows (
data/prepared/location-info.csv [2 5]:
:station-id | :direction | :lat | :long | :installed |
---|---|---|---|---|
17-SK-BRE-O | Breitenbachplatz Ost | 52.4668 | 13.3092 | 2016-05-01T00:00 |
17-SK-BRE-W | Breitenbachplatz West | 52.4672 | 13.3084 | 2016-05-01T00:00 |
So it’s safe to conclude that these two sets of IDs are actually the same.
There are still 3 other station ids in our timeseries data that do not appear in the location info dataset: "02-MI-AL-W"
, "03-SP-NO-W"
, and "03-SP-NO-O"
. We can check if there’s anything similar there:
-> location-info-ds
(re-find #"MI-AL|SP-NO" (:station-id %)))) (tc/select-rows #(
data/prepared/location-info.csv [3 5]:
:station-id | :direction | :lat | :long | :installed |
---|---|---|---|---|
01-MI-AL-W | Karl-Marx-Allee | 52.5219 | 13.4175 | 2021-12-16T00:00 |
16-SP-NO-O | Nonnendammallee Ost | 52.5379 | 13.2479 | 2023-11-22T00:00 |
16-SP-NO-W | Nonnendammallee West | 52.5384 | 13.2470 | 2023-11-22T00:00 |
We can check again how these compare to the raw data to see if we can infer anything about what may be going on here:
defn- show-column-names-with-substring [datasets val]
(->> datasets
(map tc/column-names)
(map (partial filter #(str/includes? % val))))) (
#'notebooks.6-quality-control/show-column-names-with-substring
"MI-AL") (show-column-names-with-substring explore/raw-datasets
(()
()
()
()
()
()
()
()
()
()
()
()"02-MI-AL-W 16.12.2021")
("02-MI-AL-W 16.12.2021")
("01-MI-AL-W 16.12.2021")
("01-MI-AL-W\n16.12.2021")) (
"SP-NO") (show-column-names-with-substring explore/raw-datasets
(()
()
()
()
()
()
()
()
()
()
()
()
()
()"03-SP-NO-O 22.11.2023" "03-SP-NO-W 22.11.2023")
("16-SP-NO-O\n22.11.2023" "16-SP-NO-W\n22.11.2023")) (
Again we can see this is most likely a case of station ids changing over time. The original column headers reveal that the stations are most likely the same ones, but have different IDs in different years. We’ll update this in our combined dataset. To review, we have this mapping of station-ids present in the raw data to station-ids in the location info. We will use the location info ones as the canonical ones:
def station-id-mapping
("17-SZ-BRE-W" "17-SK-BRE-W"
{"17-SZ-BRE-O" "17-SK-BRE-O"
"02-MI-AL-W" "01-MI-AL-W"
"03-SP-NO-W" "16-SP-NO-W"
"03-SP-NO-O" "16-SP-NO-O"})
We’ll apply this mapping to our station-id
column, then verify that it worked. We can see that after the update we have the exact same collection of unique station-ids in our timeseries data as we do in the location-info metadata we were given.
defn correct-station-ids [ds]
(:station-id (partial replace station-id-mapping))) (tc/update-columns ds
def corrected-station-ids
( (correct-station-ids combined-dataset))
-> corrected-station-ids (tc/group-by :station-id) :name set)
(set/difference (set (:station-id location-info-ds))) (
#{}
Station reliability
Now we’ll figure out each station’s reliability. This is where we can really see the power of tablecloth’s magic handling of grouped datasets. We can just use all of the regular tablecloth functions and they will magically work on the grouped datasets.
First we get the latest measurement time. Also double check it’s the same for every station:
-> corrected-station-ids
(:station-id])
(tc/group-by [:latest-date #(apply jt/max (:datetime %))})
(tc/aggregate {:latest-date
distinct)
time.LocalDateTime 0x273ca37f "2024-12-31T23:00"]) (#object[java.
Yes, all stations have their last measurement at 2024-12-31T23:00, so this is what we’ll use at the expected last measurement value:
def latest
(apply jt/max (:datetime corrected-station-ids))) (
latest
time.LocalDateTime 0x4d7fda4e "2024-12-31T23:00"] #object[java.
Then get the total number of measurements for each station:
defn- add-location-info-and-station-count [ds location-info]
(-> ds
(:station-id])
(tc/group-by [;; Count the total rows minus the header row per grouped dataset
:total-measurement-count tc/row-count})
(tc/aggregate {;; Then join this with the location-info dataset
:station-id))) (tc/inner-join location-info
#'notebooks.6-quality-control/add-location-info-and-station-count
defn- add-uptime-col [ds end-time]
(-> ds
(:expected-measurement-count
(tc/map-columns :installed]
[fn [installation-date]
(:hours)))
(jt/time-between installation-date end-time :uptime
(tc/map-columns :total-measurement-count :expected-measurement-count]
[fn [total expected]
(-> total
(/ expected)
(* 100.0)))))) (
#'notebooks.6-quality-control/add-uptime-col
-> corrected-station-ids
(
(add-location-info-and-station-count location-info-ds) (add-uptime-col latest))
inner-join [35 8]:
:station-id | :total-measurement-count | :direction | :lat | :long | :installed | :expected-measurement-count | :uptime |
---|---|---|---|---|---|---|---|
12-PA-SCH | 121287 | Schwedter Steg | 52.5491 | 13.4004 | 2012-03-01T00:00 | 112535 | 107.77713600 |
02-MI-JAN-N | 85504 | Jannowitzbrücke Nord | 52.5139 | 13.4178 | 2015-04-01T00:00 | 85511 | 99.99181392 |
02-MI-JAN-S | 85106 | Jannowitzbrücke Süd | 52.5139 | 13.4176 | 2015-04-01T00:00 | 85511 | 99.52637672 |
13-CW-PRI | 85489 | Prinzregentenstraße | 52.4881 | 13.3331 | 2015-04-01T00:00 | 85511 | 99.97427232 |
18-TS-YOR-O | 85493 | Yorckstraße Ost | 52.4919 | 13.3735 | 2015-04-01T00:00 | 85511 | 99.97895008 |
18-TS-YOR-W | 85504 | Yorkstraße West | 52.4923 | 13.3732 | 2015-04-01T00:00 | 85511 | 99.99181392 |
19-TS-MON | 84756 | Monumentenstraße | 52.4881 | 13.3698 | 2015-05-01T00:00 | 84791 | 99.95872203 |
27-RE-MAR | 84784 | Markstraße | 52.5582 | 13.3649 | 2015-05-01T00:00 | 84791 | 99.99174441 |
03-MI-SAN-O | 84011 | Invalidenstraße Ost | 52.5272 | 13.3720 | 2015-06-01T00:00 | 84047 | 99.95716682 |
03-MI-SAN-W | 84032 | Invalidenstraße West | 52.5277 | 13.3731 | 2015-06-01T00:00 | 84047 | 99.98215284 |
… | … | … | … | … | … | … | … |
06-FK-FRA-O | 75190 | Frankfurter Allee Ost | 52.5136 | 13.4742 | 2016-06-01T00:00 | 75263 | 99.90300679 |
06-FK-FRA-W | 75256 | Frankfurter Allee West | 52.5138 | 13.4744 | 2016-06-01T00:00 | 75263 | 99.99069928 |
14-CW-JU-W | 21334 | Straße des 17. Juni West | 52.5131 | 13.3268 | 2021-07-27T00:00 | 30095 | 70.88885197 |
14-CW-JU-O | 21334 | Straße des 17. Juni Ost | 52.5126 | 13.3266 | 2021-07-27T00:00 | 30095 | 70.88885197 |
01-MI-AL-W | 17905 | Karl-Marx-Allee | 52.5219 | 13.4175 | 2021-12-16T00:00 | 26687 | 67.09259190 |
11-PA-SE-N | 17542 | Senefelder Platz Nord | 52.5314 | 13.4124 | 2022-10-06T00:00 | 19631 | 89.35866741 |
04-MI-NO | 10617 | Nordufer | 52.5139 | 13.4179 | 2023-10-16T00:00 | 10631 | 99.86830966 |
16-SP-NO-O | 9731 | Nonnendammallee Ost | 52.5379 | 13.2479 | 2023-11-22T00:00 | 9743 | 99.87683465 |
16-SP-NO-W | 9731 | Nonnendammallee West | 52.5384 | 13.2470 | 2023-11-22T00:00 | 9743 | 99.87683465 |
07-FK-ST | 6096 | Strausberger Platz | 52.5189 | 13.4261 | 2024-04-22T00:00 | 6095 | 100.01640689 |
09-PA-SA-S | 6084 | Schönhauser Allee | 52.5430 | 13.4121 | 2024-04-22T00:00 | 6095 | 99.81952420 |
These look mostly good. We can see 2 have an uptime of more than 100%. The most likely cause, and something we want to check for anyway, is duplicate measurements. In our combined dataset, there should only ever be one count per station/time combination. We can double check this:
Count all rows:
(tc/row-count corrected-station-ids)
2224785
Count unique rows by date/station-id combination:
-> corrected-station-ids
(:datetime :station-id])
(tc/unique-by [ tc/row-count)
2216025
So there are some dupes. We’ll find them and see what we’re dealing with:
comment
(;; Note, this is how you could do this using tablecloth's grouping and aggregating functionality, but `tcutils` (which is included in noj) includes a helper for finding duplicates that is more performant, which we'll use here.
def duplicate-rows-by-date-station-id
(-> corrected-station-ids
(:datetime :station-id])
(tc/group-by [:row-count tc/row-count})
(tc/aggregate {comp (partial < 1) :row-count))))) (tc/select-rows (
Get the combinations of date/station-id that have more than one count:
def duplicate-rows (tcu/duplicate-rows corrected-station-ids)) (
We can see they’re all for a single station, which is great:
-> duplicate-rows
(:station-id
distinct)
"12-PA-SCH") (
And since these rows are exact duplicates, we can just do a very naive de-dedupe on our data. From here we have a clean dat, then we’ll add back our uptime stats and inspect:
defn dedupe [ds]
(:datetime :station-id])) (tc/unique-by ds [
def clean-dataset
(-> corrected-station-ids
( dedupe))
-> clean-dataset
(
(add-location-info-and-station-count location-info-ds) (add-uptime-col latest))
inner-join [35 8]:
:station-id | :total-measurement-count | :direction | :lat | :long | :installed | :expected-measurement-count | :uptime |
---|---|---|---|---|---|---|---|
12-PA-SCH | 112527 | Schwedter Steg | 52.5491 | 13.4004 | 2012-03-01T00:00 | 112535 | 99.99289110 |
02-MI-JAN-N | 85504 | Jannowitzbrücke Nord | 52.5139 | 13.4178 | 2015-04-01T00:00 | 85511 | 99.99181392 |
02-MI-JAN-S | 85106 | Jannowitzbrücke Süd | 52.5139 | 13.4176 | 2015-04-01T00:00 | 85511 | 99.52637672 |
13-CW-PRI | 85489 | Prinzregentenstraße | 52.4881 | 13.3331 | 2015-04-01T00:00 | 85511 | 99.97427232 |
18-TS-YOR-O | 85493 | Yorckstraße Ost | 52.4919 | 13.3735 | 2015-04-01T00:00 | 85511 | 99.97895008 |
18-TS-YOR-W | 85504 | Yorkstraße West | 52.4923 | 13.3732 | 2015-04-01T00:00 | 85511 | 99.99181392 |
19-TS-MON | 84756 | Monumentenstraße | 52.4881 | 13.3698 | 2015-05-01T00:00 | 84791 | 99.95872203 |
27-RE-MAR | 84784 | Markstraße | 52.5582 | 13.3649 | 2015-05-01T00:00 | 84791 | 99.99174441 |
03-MI-SAN-O | 84011 | Invalidenstraße Ost | 52.5272 | 13.3720 | 2015-06-01T00:00 | 84047 | 99.95716682 |
03-MI-SAN-W | 84032 | Invalidenstraße West | 52.5277 | 13.3731 | 2015-06-01T00:00 | 84047 | 99.98215284 |
… | … | … | … | … | … | … | … |
06-FK-FRA-O | 75190 | Frankfurter Allee Ost | 52.5136 | 13.4742 | 2016-06-01T00:00 | 75263 | 99.90300679 |
06-FK-FRA-W | 75256 | Frankfurter Allee West | 52.5138 | 13.4744 | 2016-06-01T00:00 | 75263 | 99.99069928 |
14-CW-JU-W | 21334 | Straße des 17. Juni West | 52.5131 | 13.3268 | 2021-07-27T00:00 | 30095 | 70.88885197 |
14-CW-JU-O | 21334 | Straße des 17. Juni Ost | 52.5126 | 13.3266 | 2021-07-27T00:00 | 30095 | 70.88885197 |
01-MI-AL-W | 17905 | Karl-Marx-Allee | 52.5219 | 13.4175 | 2021-12-16T00:00 | 26687 | 67.09259190 |
11-PA-SE-N | 17542 | Senefelder Platz Nord | 52.5314 | 13.4124 | 2022-10-06T00:00 | 19631 | 89.35866741 |
04-MI-NO | 10617 | Nordufer | 52.5139 | 13.4179 | 2023-10-16T00:00 | 10631 | 99.86830966 |
16-SP-NO-O | 9731 | Nonnendammallee Ost | 52.5379 | 13.2479 | 2023-11-22T00:00 | 9743 | 99.87683465 |
16-SP-NO-W | 9731 | Nonnendammallee West | 52.5384 | 13.2470 | 2023-11-22T00:00 | 9743 | 99.87683465 |
07-FK-ST | 6096 | Strausberger Platz | 52.5189 | 13.4261 | 2024-04-22T00:00 | 6095 | 100.01640689 |
09-PA-SA-S | 6084 | Schönhauser Allee | 52.5430 | 13.4121 | 2024-04-22T00:00 | 6095 | 99.81952420 |
And now the data looks better. We don’t have any uptimes above 100%.
We’ll do one last check, looking for suspicious values. Like counts that are not whole numbers:
-> clean-dataset
(:count (partial map #(mod % 1)))
(tc/update-columns :count
distinct)
0.0) (
Here we’re good. With our clean dataset, we can add some visualizations to make it easier to assess our data quality at a glance.
-> clean-dataset
(
(add-location-info-and-station-count location-info-ds)
(add-uptime-col latest):uptime)
(tc/order-by 600
(plotly/base {:=width "Uptime by station"})
:=title :uptime
(plotly/layer-bar {:=x :station-id})
:=y
plotly/plotassoc-in [:data 0 :orientation] "h")) (
So we’ll store the results of this cleaned up data and use this in our analysis going forward:
def clean-dataset-file-name "data/prepared/cleaned-dataset.csv.gz") (
let [csv-file-name "data/prepared/cleaned-dataset.csv"]
(-> corrected-station-ids
(:datetime :station-id])
(tc/unique-by [
(tc/write-csv! csv-file-name)) (transform/compress-file csv-file-name clean-dataset-file-name))
true
All of this illustrates the importance of defining and using consistent IDs as a data publisher. Failing to do this makes it very hard for downstream consumers of your data to have confidence that they are interpreting it correctly.