Quality control

Bonus: Data quality analysis and cleaning

In any real-world dataset, there are always data quality issues that we should address before proceeding with an analysis. If we don’t, we risk drawing incorrect conclusions because the source data may be unreliable or unrealistic. We need to expose any issues like this ahead of time so that we can correctly interpret our results.

In this case we have some very helpful advice in the dataset description pointing out several things we should look for, including:

  1. Gaps in the data from technical issues (battery failures, defective equipment, transmission issues, etc.)
  2. Different station start dates
  3. Environmental factors affecting measurement
  4. Infrastructure changes affecting traffic (detours, construction, network updates)
  5. Mis-measurement due to incorrect bike lane usage (cyclists travelling in the wrong direction)

We’ll explore and fix these data quality issues so that we can conduct an accurate analysis.

Missing and inconsistent data

First we’ll calculate the uptime percentage for each station. To do this we’ll

  1. Look at the location metadata dataset to determine when a station came online
  2. Determine how many measurements it should have, if it had been working 24/7
  3. Examine how many measurements we actually have, and compare

Then we’ll flag suspicious values by checking for ones that appear significantly higher or lower than the mean.

Looking at our entire dataset together, we can start by grouping by station ID, since many of quality metrics we care about will be most relevant at a station level. Note that tablecloth knows how to handle a gzipped file, there is no extra processing needed.

(def combined-dataset (tc/dataset transform/combined-dataset-file-name
                                  {:key-fn keyword
                                   :parser-fn {:datetime [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))
(def grouped-by-station-id (tc/group-by combined-dataset :station-id))
grouped-by-station-id

_unnamed [38 3]:

:name :group-id :data
12-PA-SCH 0 Group: 12-PA-SCH [121287 3]:
26-LI-PUP 1 Group: 26-LI-PUP [83341 3]:
27-RE-MAR 2 Group: 27-RE-MAR [84784 3]:
03-MI-SAN-O 3 Group: 03-MI-SAN-O [84011 3]:
02-MI-JAN-S 4 Group: 02-MI-JAN-S [85106 3]:
05-FK-OBB-W 5 Group: 05-FK-OBB-W [71380 3]:
02-MI-JAN-N 6 Group: 02-MI-JAN-N [85504 3]:
18-TS-YOR-W 7 Group: 18-TS-YOR-W [85504 3]:
03-MI-SAN-W 8 Group: 03-MI-SAN-W [84032 3]:
19-TS-MON 9 Group: 19-TS-MON [84756 3]:
14-CW-JU-W 27 Group: 14-CW-JU-W [21334 3]:
14-CW-JU-O 28 Group: 14-CW-JU-O [21334 3]:
01-MI-AL-W 29 Group: 01-MI-AL-W [17537 3]:
03-SP-NO-O 30 Group: 03-SP-NO-O [948 3]:
04-MI-NO 31 Group: 04-MI-NO [10617 3]:
03-SP-NO-W 32 Group: 03-SP-NO-W [948 3]:
11-PA-SE-N 33 Group: 11-PA-SE-N [17542 3]:
16-SP-NO-W 34 Group: 16-SP-NO-W [8783 3]:
16-SP-NO-O 35 Group: 16-SP-NO-O [8783 3]:
09-PA-SA-S 36 Group: 09-PA-SA-S [6084 3]:
07-FK-ST 37 Group: 07-FK-ST [6096 3]:

There are 38 groups, but only 35 rows in our location metadata sheet:

(def location-info-ds
  (tc/dataset explore/location-info-file-name {:key-fn keyword
                                               :parser-fn {:installed [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))
(tc/row-count location-info-ds)
35

First, we should reconcile this. Clojure has some useful built-in helpers. clojure.data/diff returns a tuple of [things-only-in-a things-only-in-b things-in-both] which we can use to spot the discrepancy:

(data/diff (set (:name grouped-by-station-id)) (set (:station-id location-info-ds)))
[#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}
 #{"17-SK-BRE-O" "17-SK-BRE-W"}
 #{"01-MI-AL-W"
   "16-SP-NO-W"
   "09-PA-SA-S"
   "05-FK-OBB-W"
   "15-SP-KLO-S"
   "21-NK-MAY"
   "16-SP-NO-O"
   "13-CW-PRI"
   "04-MI-NO"
   "11-PA-SE-N"
   "14-CW-JU-O"
   "20-TS-MAR-S"
   "10-PA-BER-S"
   "05-FK-OBB-O"
   "27-RE-MAR"
   "15-SP-KLO-N"
   "02-MI-JAN-S"
   "12-PA-SCH"
   "26-LI-PUP"
   "02-MI-JAN-N"
   "14-CW-JU-W"
   "07-FK-ST"
   "06-FK-FRA-W"
   "03-MI-SAN-O"
   "23-TK-KAI"
   "10-PA-BER-N"
   "24-MH-ALB"
   "18-TS-YOR-O"
   "06-FK-FRA-O"
   "20-TS-MAR-N"
   "18-TS-YOR-W"
   "19-TS-MON"
   "03-MI-SAN-W"}]

Here we can easily spot the problem: it appears that there are two stations that probably got mis-named. "17-SK-BRE-O" and "17-SK-BRE-W" are in the location-info dataset, but do not appear at all in the timeseries data. However, in the timeseries we do have these five station ids that are not in the location-info dataset:

#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}
#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}

It seems likely that "17-SZ-BRE-W" and "17-SZ-BRE-O" are meant to be "17-SK-BRE-W" and "17-SK-BRE-O". We could be more sure by inspecting the first row of the raw datasets we had, which included the installation date in the same cell as the station id:

(->> explore/raw-datasets
     (map tc/column-names)
     (map (partial filter #(str/includes? % "17"))))
(()
 ()
 ()
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O\n01.05.2016" "17-SZ-BRE-W\n01.05.2016"))

Comparing these to the rows from the location-info dataset, we can see the installation dates are the same:

(-> location-info-ds
    (tc/select-rows (comp #{"17-SK-BRE-O" "17-SK-BRE-W"} :station-id)))

data/prepared/location-info.csv [2 5]:

:station-id :direction :lat :long :installed
17-SK-BRE-O Breitenbachplatz Ost 52.4668 13.3092 2016-05-01T00:00
17-SK-BRE-W Breitenbachplatz West 52.4672 13.3084 2016-05-01T00:00

So it’s safe to conclude that these two sets of IDs are actually the same.

There are still 3 other station ids in our timeseries data that do not appear in the location info dataset: "02-MI-AL-W", "03-SP-NO-W", and "03-SP-NO-O". We can check if there’s anything similar there:

(-> location-info-ds
    (tc/select-rows #(re-find #"MI-AL|SP-NO" (:station-id %))))

data/prepared/location-info.csv [3 5]:

:station-id :direction :lat :long :installed
01-MI-AL-W Karl-Marx-Allee 52.5219 13.4175 2021-12-16T00:00
16-SP-NO-O Nonnendammallee Ost 52.5379 13.2479 2023-11-22T00:00
16-SP-NO-W Nonnendammallee West 52.5384 13.2470 2023-11-22T00:00

We can check again how these compare to the raw data to see if we can infer anything about what may be going on here:

(defn- show-column-names-with-substring [datasets val]
  (->> datasets
       (map tc/column-names)
       (map (partial filter #(str/includes? % val)))))
#'notebooks.6-quality-control/show-column-names-with-substring
(show-column-names-with-substring explore/raw-datasets "MI-AL")
(()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ("02-MI-AL-W 16.12.2021")
 ("02-MI-AL-W 16.12.2021")
 ("01-MI-AL-W 16.12.2021")
 ("01-MI-AL-W\n16.12.2021"))
(show-column-names-with-substring explore/raw-datasets "SP-NO")
(()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ("03-SP-NO-O 22.11.2023" "03-SP-NO-W 22.11.2023")
 ("16-SP-NO-O\n22.11.2023" "16-SP-NO-W\n22.11.2023"))

Again we can see this is most likely a case of station ids changing over time. The original column headers reveal that the stations are most likely the same ones, but have different IDs in different years. We’ll update this in our combined dataset. To review, we have this mapping of station-ids present in the raw data to station-ids in the location info. We will use the location info ones as the canonical ones:

(def station-id-mapping
  {"17-SZ-BRE-W" "17-SK-BRE-W"
   "17-SZ-BRE-O" "17-SK-BRE-O"
   "02-MI-AL-W" "01-MI-AL-W"
   "03-SP-NO-W" "16-SP-NO-W"
   "03-SP-NO-O" "16-SP-NO-O"})

We’ll apply this mapping to our station-id column, then verify that it worked. We can see that after the update we have the exact same collection of unique station-ids in our timeseries data as we do in the location-info metadata we were given.

(defn correct-station-ids [ds]
  (tc/update-columns ds :station-id (partial replace station-id-mapping)))
(def corrected-station-ids
  (correct-station-ids combined-dataset))
(set/difference (-> corrected-station-ids (tc/group-by :station-id) :name set)
                (set (:station-id location-info-ds)))
#{}

Station reliability

Now we’ll figure out each station’s reliability. This is where we can really see the power of tablecloth’s magic handling of grouped datasets. We can just use all of the regular tablecloth functions and they will magically work on the grouped datasets.

First we get the latest measurement time. Also double check it’s the same for every station:

(-> corrected-station-ids
    (tc/group-by [:station-id])
    (tc/aggregate {:latest-date #(apply jt/max (:datetime %))})
    :latest-date
    distinct)
(#object[java.time.LocalDateTime 0x273ca37f "2024-12-31T23:00"])

Yes, all stations have their last measurement at 2024-12-31T23:00, so this is what we’ll use at the expected last measurement value:

(def latest
  (apply jt/max (:datetime corrected-station-ids)))
latest
#object[java.time.LocalDateTime 0x4d7fda4e "2024-12-31T23:00"]

Then get the total number of measurements for each station:

(defn- add-location-info-and-station-count [ds location-info]
  (-> ds
      (tc/group-by [:station-id])
      ;; Count the total rows minus the header row per grouped dataset
      (tc/aggregate {:total-measurement-count tc/row-count})
      ;; Then join this with the location-info dataset
      (tc/inner-join location-info :station-id)))
#'notebooks.6-quality-control/add-location-info-and-station-count
(defn- add-uptime-col [ds end-time]
  (-> ds
      (tc/map-columns :expected-measurement-count
                      [:installed]
                      (fn [installation-date]
                        (jt/time-between installation-date end-time :hours)))
      (tc/map-columns :uptime
                      [:total-measurement-count :expected-measurement-count]
                      (fn [total expected]
                        (-> total
                            (/ expected)
                            (* 100.0))))))
#'notebooks.6-quality-control/add-uptime-col
(-> corrected-station-ids
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest))

inner-join [35 8]:

:station-id :total-measurement-count :direction :lat :long :installed :expected-measurement-count :uptime
12-PA-SCH 121287 Schwedter Steg 52.5491 13.4004 2012-03-01T00:00 112535 107.77713600
02-MI-JAN-N 85504 Jannowitzbrücke Nord 52.5139 13.4178 2015-04-01T00:00 85511 99.99181392
02-MI-JAN-S 85106 Jannowitzbrücke Süd 52.5139 13.4176 2015-04-01T00:00 85511 99.52637672
13-CW-PRI 85489 Prinzregentenstraße 52.4881 13.3331 2015-04-01T00:00 85511 99.97427232
18-TS-YOR-O 85493 Yorckstraße Ost 52.4919 13.3735 2015-04-01T00:00 85511 99.97895008
18-TS-YOR-W 85504 Yorkstraße West 52.4923 13.3732 2015-04-01T00:00 85511 99.99181392
19-TS-MON 84756 Monumentenstraße 52.4881 13.3698 2015-05-01T00:00 84791 99.95872203
27-RE-MAR 84784 Markstraße 52.5582 13.3649 2015-05-01T00:00 84791 99.99174441
03-MI-SAN-O 84011 Invalidenstraße Ost 52.5272 13.3720 2015-06-01T00:00 84047 99.95716682
03-MI-SAN-W 84032 Invalidenstraße West 52.5277 13.3731 2015-06-01T00:00 84047 99.98215284
06-FK-FRA-O 75190 Frankfurter Allee Ost 52.5136 13.4742 2016-06-01T00:00 75263 99.90300679
06-FK-FRA-W 75256 Frankfurter Allee West 52.5138 13.4744 2016-06-01T00:00 75263 99.99069928
14-CW-JU-W 21334 Straße des 17. Juni West 52.5131 13.3268 2021-07-27T00:00 30095 70.88885197
14-CW-JU-O 21334 Straße des 17. Juni Ost 52.5126 13.3266 2021-07-27T00:00 30095 70.88885197
01-MI-AL-W 17905 Karl-Marx-Allee 52.5219 13.4175 2021-12-16T00:00 26687 67.09259190
11-PA-SE-N 17542 Senefelder Platz Nord 52.5314 13.4124 2022-10-06T00:00 19631 89.35866741
04-MI-NO 10617 Nordufer 52.5139 13.4179 2023-10-16T00:00 10631 99.86830966
16-SP-NO-O 9731 Nonnendammallee Ost 52.5379 13.2479 2023-11-22T00:00 9743 99.87683465
16-SP-NO-W 9731 Nonnendammallee West 52.5384 13.2470 2023-11-22T00:00 9743 99.87683465
07-FK-ST 6096 Strausberger Platz 52.5189 13.4261 2024-04-22T00:00 6095 100.01640689
09-PA-SA-S 6084 Schönhauser Allee 52.5430 13.4121 2024-04-22T00:00 6095 99.81952420

These look mostly good. We can see 2 have an uptime of more than 100%. The most likely cause, and something we want to check for anyway, is duplicate measurements. In our combined dataset, there should only ever be one count per station/time combination. We can double check this:

Count all rows:

(tc/row-count corrected-station-ids)
2224785

Count unique rows by date/station-id combination:

(-> corrected-station-ids
    (tc/unique-by [:datetime :station-id])
    tc/row-count)
2216025

So there are some dupes. We’ll find them and see what we’re dealing with:

(comment
  ;; Note, this is how you could do this using tablecloth's grouping and aggregating functionality, but `tcutils` (which is included in noj) includes a helper for finding duplicates that is more performant, which we'll use here.
  (def duplicate-rows-by-date-station-id
    (-> corrected-station-ids
        (tc/group-by [:datetime :station-id])
        (tc/aggregate {:row-count tc/row-count})
        (tc/select-rows (comp (partial < 1) :row-count)))))

Get the combinations of date/station-id that have more than one count:

(def duplicate-rows (tcu/duplicate-rows corrected-station-ids))

We can see they’re all for a single station, which is great:

(-> duplicate-rows
    :station-id
    distinct)
("12-PA-SCH")

And since these rows are exact duplicates, we can just do a very naive de-dedupe on our data. From here we have a clean dat, then we’ll add back our uptime stats and inspect:

(defn dedupe [ds]
  (tc/unique-by ds [:datetime :station-id]))
(def clean-dataset
  (-> corrected-station-ids
      dedupe))
(-> clean-dataset
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest))

inner-join [35 8]:

:station-id :total-measurement-count :direction :lat :long :installed :expected-measurement-count :uptime
12-PA-SCH 112527 Schwedter Steg 52.5491 13.4004 2012-03-01T00:00 112535 99.99289110
02-MI-JAN-N 85504 Jannowitzbrücke Nord 52.5139 13.4178 2015-04-01T00:00 85511 99.99181392
02-MI-JAN-S 85106 Jannowitzbrücke Süd 52.5139 13.4176 2015-04-01T00:00 85511 99.52637672
13-CW-PRI 85489 Prinzregentenstraße 52.4881 13.3331 2015-04-01T00:00 85511 99.97427232
18-TS-YOR-O 85493 Yorckstraße Ost 52.4919 13.3735 2015-04-01T00:00 85511 99.97895008
18-TS-YOR-W 85504 Yorkstraße West 52.4923 13.3732 2015-04-01T00:00 85511 99.99181392
19-TS-MON 84756 Monumentenstraße 52.4881 13.3698 2015-05-01T00:00 84791 99.95872203
27-RE-MAR 84784 Markstraße 52.5582 13.3649 2015-05-01T00:00 84791 99.99174441
03-MI-SAN-O 84011 Invalidenstraße Ost 52.5272 13.3720 2015-06-01T00:00 84047 99.95716682
03-MI-SAN-W 84032 Invalidenstraße West 52.5277 13.3731 2015-06-01T00:00 84047 99.98215284
06-FK-FRA-O 75190 Frankfurter Allee Ost 52.5136 13.4742 2016-06-01T00:00 75263 99.90300679
06-FK-FRA-W 75256 Frankfurter Allee West 52.5138 13.4744 2016-06-01T00:00 75263 99.99069928
14-CW-JU-W 21334 Straße des 17. Juni West 52.5131 13.3268 2021-07-27T00:00 30095 70.88885197
14-CW-JU-O 21334 Straße des 17. Juni Ost 52.5126 13.3266 2021-07-27T00:00 30095 70.88885197
01-MI-AL-W 17905 Karl-Marx-Allee 52.5219 13.4175 2021-12-16T00:00 26687 67.09259190
11-PA-SE-N 17542 Senefelder Platz Nord 52.5314 13.4124 2022-10-06T00:00 19631 89.35866741
04-MI-NO 10617 Nordufer 52.5139 13.4179 2023-10-16T00:00 10631 99.86830966
16-SP-NO-O 9731 Nonnendammallee Ost 52.5379 13.2479 2023-11-22T00:00 9743 99.87683465
16-SP-NO-W 9731 Nonnendammallee West 52.5384 13.2470 2023-11-22T00:00 9743 99.87683465
07-FK-ST 6096 Strausberger Platz 52.5189 13.4261 2024-04-22T00:00 6095 100.01640689
09-PA-SA-S 6084 Schönhauser Allee 52.5430 13.4121 2024-04-22T00:00 6095 99.81952420

And now the data looks better. We don’t have any uptimes above 100%.

We’ll do one last check, looking for suspicious values. Like counts that are not whole numbers:

(-> clean-dataset
    (tc/update-columns :count (partial map #(mod % 1)))
    :count
    distinct)
(0.0)

Here we’re good. With our clean dataset, we can add some visualizations to make it easier to assess our data quality at a glance.

(-> clean-dataset
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest)
    (tc/order-by :uptime)
    (plotly/base {:=width 600
                  :=title "Uptime by station"})
    (plotly/layer-bar {:=x :uptime
                       :=y :station-id})
    plotly/plot
    (assoc-in [:data 0 :orientation] "h"))

So we’ll store the results of this cleaned up data and use this in our analysis going forward:

(def clean-dataset-file-name "data/prepared/cleaned-dataset.csv.gz")
(let [csv-file-name "data/prepared/cleaned-dataset.csv"]
  (-> corrected-station-ids
      (tc/unique-by [:datetime :station-id])
      (tc/write-csv! csv-file-name))
  (transform/compress-file csv-file-name clean-dataset-file-name))
true

All of this illustrates the importance of defining and using consistent IDs as a data publisher. Failing to do this makes it very hard for downstream consumers of your data to have confidence that they are interpreting it correctly.

source: bobkonf_2025/src/notebooks/6_quality_control.clj