Quality control

Bonus: Data quality analysis and cleaning

In any real-world dataset, there are always data quality issues that we should address before proceeding with an analysis. If we don’t, we risk drawing incorrect conclusions because the source data may be unreliable or unrealistic. We need to expose any issues like this ahead of time so that we can correctly interpret our results.

In this case we have some very helpful advice in the dataset description pointing out several things we should look for, including:

Gaps in the data from technical issues (battery failures, defective equipment, transmission issues, etc.)
Different station start dates
Environmental factors affecting measurement
Infrastructure changes affecting traffic (detours, construction, network updates)
Mis-measurement due to incorrect bike lane usage (cyclists travelling in the wrong direction)

We’ll explore and fix these data quality issues so that we can conduct an accurate analysis.

Missing and inconsistent data

First we’ll calculate the uptime percentage for each station. To do this we’ll

Look at the location metadata dataset to determine when a station came online
Determine how many measurements it should have, if it had been working 24/7
Examine how many measurements we actually have, and compare

Then we’ll flag suspicious values by checking for ones that appear significantly higher or lower than the mean.

Looking at our entire dataset together, we can start by grouping by station ID, since many of quality metrics we care about will be most relevant at a station level. Note that tablecloth knows how to handle a gzipped file, there is no extra processing needed.

(def combined-dataset (tc/dataset transform/combined-dataset-file-name
                                  {:key-fn keyword
                                   :parser-fn {:datetime [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))

(def grouped-by-station-id (tc/group-by combined-dataset :station-id))

grouped-by-station-id

_unnamed [38 3]:

:name	:group-id	:data
12-PA-SCH	0	Group: 12-PA-SCH [121287 3]:
26-LI-PUP	1	Group: 26-LI-PUP [83341 3]:
27-RE-MAR	2	Group: 27-RE-MAR [84784 3]:
03-MI-SAN-O	3	Group: 03-MI-SAN-O [84011 3]:
02-MI-JAN-S	4	Group: 02-MI-JAN-S [85106 3]:
05-FK-OBB-W	5	Group: 05-FK-OBB-W [71380 3]:
02-MI-JAN-N	6	Group: 02-MI-JAN-N [85504 3]:
18-TS-YOR-W	7	Group: 18-TS-YOR-W [85504 3]:
03-MI-SAN-W	8	Group: 03-MI-SAN-W [84032 3]:
19-TS-MON	9	Group: 19-TS-MON [84756 3]:
…	…	…
14-CW-JU-W	27	Group: 14-CW-JU-W [21334 3]:
14-CW-JU-O	28	Group: 14-CW-JU-O [21334 3]:
01-MI-AL-W	29	Group: 01-MI-AL-W [17537 3]:
03-SP-NO-O	30	Group: 03-SP-NO-O [948 3]:
04-MI-NO	31	Group: 04-MI-NO [10617 3]:
03-SP-NO-W	32	Group: 03-SP-NO-W [948 3]:
11-PA-SE-N	33	Group: 11-PA-SE-N [17542 3]:
16-SP-NO-W	34	Group: 16-SP-NO-W [8783 3]:
16-SP-NO-O	35	Group: 16-SP-NO-O [8783 3]:
09-PA-SA-S	36	Group: 09-PA-SA-S [6084 3]:
07-FK-ST	37	Group: 07-FK-ST [6096 3]:

There are 38 groups, but only 35 rows in our location metadata sheet:

(def location-info-ds
  (tc/dataset explore/location-info-file-name {:key-fn keyword
                                               :parser-fn {:installed [:local-date-time "yyyy-MM-dd'T'HH:mm"]}}))

(tc/row-count location-info-ds)

First, we should reconcile this. Clojure has some useful built-in helpers. clojure.data/diff returns a tuple of [things-only-in-a things-only-in-b things-in-both] which we can use to spot the discrepancy:

(data/diff (set (:name grouped-by-station-id)) (set (:station-id location-info-ds)))

[#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}
 #{"17-SK-BRE-O" "17-SK-BRE-W"}
 #{"01-MI-AL-W"
   "16-SP-NO-W"
   "09-PA-SA-S"
   "05-FK-OBB-W"
   "15-SP-KLO-S"
   "21-NK-MAY"
   "16-SP-NO-O"
   "13-CW-PRI"
   "04-MI-NO"
   "11-PA-SE-N"
   "14-CW-JU-O"
   "20-TS-MAR-S"
   "10-PA-BER-S"
   "05-FK-OBB-O"
   "27-RE-MAR"
   "15-SP-KLO-N"
   "02-MI-JAN-S"
   "12-PA-SCH"
   "26-LI-PUP"
   "02-MI-JAN-N"
   "14-CW-JU-W"
   "07-FK-ST"
   "06-FK-FRA-W"
   "03-MI-SAN-O"
   "23-TK-KAI"
   "10-PA-BER-N"
   "24-MH-ALB"
   "18-TS-YOR-O"
   "06-FK-FRA-O"
   "20-TS-MAR-N"
   "18-TS-YOR-W"
   "19-TS-MON"
   "03-MI-SAN-W"}]

Here we can easily spot the problem: it appears that there are two stations that probably got mis-named. "17-SK-BRE-O" and "17-SK-BRE-W" are in the location-info dataset, but do not appear at all in the timeseries data. However, in the timeseries we do have these five station ids that are not in the location-info dataset:

#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}

#{"17-SZ-BRE-W" "02-MI-AL-W" "03-SP-NO-W" "03-SP-NO-O" "17-SZ-BRE-O"}

It seems likely that "17-SZ-BRE-W" and "17-SZ-BRE-O" are meant to be "17-SK-BRE-W" and "17-SK-BRE-O". We could be more sure by inspecting the first row of the raw datasets we had, which included the installation date in the same cell as the station id:

(->> explore/raw-datasets
     (map tc/column-names)
     (map (partial filter #(str/includes? % "17"))))

(()
 ()
 ()
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O 01.05.2016" "17-SZ-BRE-W 01.05.2016")
 ("17-SZ-BRE-O\n01.05.2016" "17-SZ-BRE-W\n01.05.2016"))

Comparing these to the rows from the location-info dataset, we can see the installation dates are the same:

(-> location-info-ds
    (tc/select-rows (comp #{"17-SK-BRE-O" "17-SK-BRE-W"} :station-id)))

data/prepared/location-info.csv [2 5]:

:station-id	:direction	:lat	:long	:installed
17-SK-BRE-O	Breitenbachplatz Ost	52.4668	13.3092	2016-05-01T00:00
17-SK-BRE-W	Breitenbachplatz West	52.4672	13.3084	2016-05-01T00:00

So it’s safe to conclude that these two sets of IDs are actually the same.

There are still 3 other station ids in our timeseries data that do not appear in the location info dataset: "02-MI-AL-W", "03-SP-NO-W", and "03-SP-NO-O". We can check if there’s anything similar there:

(-> location-info-ds
    (tc/select-rows #(re-find #"MI-AL|SP-NO" (:station-id %))))

data/prepared/location-info.csv [3 5]:

:station-id	:direction	:lat	:long	:installed
01-MI-AL-W	Karl-Marx-Allee	52.5219	13.4175	2021-12-16T00:00
16-SP-NO-O	Nonnendammallee Ost	52.5379	13.2479	2023-11-22T00:00
16-SP-NO-W	Nonnendammallee West	52.5384	13.2470	2023-11-22T00:00

We can check again how these compare to the raw data to see if we can infer anything about what may be going on here:

(defn- show-column-names-with-substring [datasets val]
  (->> datasets
       (map tc/column-names)
       (map (partial filter #(str/includes? % val)))))

#'notebooks.6-quality-control/show-column-names-with-substring

(show-column-names-with-substring explore/raw-datasets "MI-AL")

(()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ("02-MI-AL-W 16.12.2021")
 ("02-MI-AL-W 16.12.2021")
 ("01-MI-AL-W 16.12.2021")
 ("01-MI-AL-W\n16.12.2021"))

(show-column-names-with-substring explore/raw-datasets "SP-NO")

(()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ()
 ("03-SP-NO-O 22.11.2023" "03-SP-NO-W 22.11.2023")
 ("16-SP-NO-O\n22.11.2023" "16-SP-NO-W\n22.11.2023"))

Again we can see this is most likely a case of station ids changing over time. The original column headers reveal that the stations are most likely the same ones, but have different IDs in different years. We’ll update this in our combined dataset. To review, we have this mapping of station-ids present in the raw data to station-ids in the location info. We will use the location info ones as the canonical ones:

(def station-id-mapping
  {"17-SZ-BRE-W" "17-SK-BRE-W"
   "17-SZ-BRE-O" "17-SK-BRE-O"
   "02-MI-AL-W" "01-MI-AL-W"
   "03-SP-NO-W" "16-SP-NO-W"
   "03-SP-NO-O" "16-SP-NO-O"})

We’ll apply this mapping to our station-id column, then verify that it worked. We can see that after the update we have the exact same collection of unique station-ids in our timeseries data as we do in the location-info metadata we were given.

(defn correct-station-ids [ds]
  (tc/update-columns ds :station-id (partial replace station-id-mapping)))

(def corrected-station-ids
  (correct-station-ids combined-dataset))

(set/difference (-> corrected-station-ids (tc/group-by :station-id) :name set)
                (set (:station-id location-info-ds)))

#{}

Station reliability

Now we’ll figure out each station’s reliability. This is where we can really see the power of tablecloth’s magic handling of grouped datasets. We can just use all of the regular tablecloth functions and they will magically work on the grouped datasets.

First we get the latest measurement time. Also double check it’s the same for every station:

(-> corrected-station-ids
    (tc/group-by [:station-id])
    (tc/aggregate {:latest-date #(apply jt/max (:datetime %))})
    :latest-date
    distinct)

(#object[java.time.LocalDateTime 0x273ca37f "2024-12-31T23:00"])

Yes, all stations have their last measurement at 2024-12-31T23:00, so this is what we’ll use at the expected last measurement value:

(def latest
  (apply jt/max (:datetime corrected-station-ids)))

latest

#object[java.time.LocalDateTime 0x4d7fda4e "2024-12-31T23:00"]

Then get the total number of measurements for each station:

(defn- add-location-info-and-station-count [ds location-info]
  (-> ds
      (tc/group-by [:station-id])
      ;; Count the total rows minus the header row per grouped dataset
      (tc/aggregate {:total-measurement-count tc/row-count})
      ;; Then join this with the location-info dataset
      (tc/inner-join location-info :station-id)))

#'notebooks.6-quality-control/add-location-info-and-station-count

(defn- add-uptime-col [ds end-time]
  (-> ds
      (tc/map-columns :expected-measurement-count
                      [:installed]
                      (fn [installation-date]
                        (jt/time-between installation-date end-time :hours)))
      (tc/map-columns :uptime
                      [:total-measurement-count :expected-measurement-count]
                      (fn [total expected]
                        (-> total
                            (/ expected)
                            (* 100.0))))))

#'notebooks.6-quality-control/add-uptime-col

(-> corrected-station-ids
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest))

inner-join [35 8]:

:station-id	:total-measurement-count	:direction	:lat	:long	:installed	:expected-measurement-count	:uptime
12-PA-SCH	121287	Schwedter Steg	52.5491	13.4004	2012-03-01T00:00	112535	107.77713600
02-MI-JAN-N	85504	Jannowitzbrücke Nord	52.5139	13.4178	2015-04-01T00:00	85511	99.99181392
02-MI-JAN-S	85106	Jannowitzbrücke Süd	52.5139	13.4176	2015-04-01T00:00	85511	99.52637672
13-CW-PRI	85489	Prinzregentenstraße	52.4881	13.3331	2015-04-01T00:00	85511	99.97427232
18-TS-YOR-O	85493	Yorckstraße Ost	52.4919	13.3735	2015-04-01T00:00	85511	99.97895008
18-TS-YOR-W	85504	Yorkstraße West	52.4923	13.3732	2015-04-01T00:00	85511	99.99181392
19-TS-MON	84756	Monumentenstraße	52.4881	13.3698	2015-05-01T00:00	84791	99.95872203
27-RE-MAR	84784	Markstraße	52.5582	13.3649	2015-05-01T00:00	84791	99.99174441
03-MI-SAN-O	84011	Invalidenstraße Ost	52.5272	13.3720	2015-06-01T00:00	84047	99.95716682
03-MI-SAN-W	84032	Invalidenstraße West	52.5277	13.3731	2015-06-01T00:00	84047	99.98215284
…	…	…	…	…	…	…	…
06-FK-FRA-O	75190	Frankfurter Allee Ost	52.5136	13.4742	2016-06-01T00:00	75263	99.90300679
06-FK-FRA-W	75256	Frankfurter Allee West	52.5138	13.4744	2016-06-01T00:00	75263	99.99069928
14-CW-JU-W	21334	Straße des 17. Juni West	52.5131	13.3268	2021-07-27T00:00	30095	70.88885197
14-CW-JU-O	21334	Straße des 17. Juni Ost	52.5126	13.3266	2021-07-27T00:00	30095	70.88885197
01-MI-AL-W	17905	Karl-Marx-Allee	52.5219	13.4175	2021-12-16T00:00	26687	67.09259190
11-PA-SE-N	17542	Senefelder Platz Nord	52.5314	13.4124	2022-10-06T00:00	19631	89.35866741
04-MI-NO	10617	Nordufer	52.5139	13.4179	2023-10-16T00:00	10631	99.86830966
16-SP-NO-O	9731	Nonnendammallee Ost	52.5379	13.2479	2023-11-22T00:00	9743	99.87683465
16-SP-NO-W	9731	Nonnendammallee West	52.5384	13.2470	2023-11-22T00:00	9743	99.87683465
07-FK-ST	6096	Strausberger Platz	52.5189	13.4261	2024-04-22T00:00	6095	100.01640689
09-PA-SA-S	6084	Schönhauser Allee	52.5430	13.4121	2024-04-22T00:00	6095	99.81952420

These look mostly good. We can see 2 have an uptime of more than 100%. The most likely cause, and something we want to check for anyway, is duplicate measurements. In our combined dataset, there should only ever be one count per station/time combination. We can double check this:

Count all rows:

(tc/row-count corrected-station-ids)

Count unique rows by date/station-id combination:

(-> corrected-station-ids
    (tc/unique-by [:datetime :station-id])
    tc/row-count)

So there are some dupes. We’ll find them and see what we’re dealing with:

(comment
  ;; Note, this is how you could do this using tablecloth's grouping and aggregating functionality, but `tcutils` (which is included in noj) includes a helper for finding duplicates that is more performant, which we'll use here.
  (def duplicate-rows-by-date-station-id
    (-> corrected-station-ids
        (tc/group-by [:datetime :station-id])
        (tc/aggregate {:row-count tc/row-count})
        (tc/select-rows (comp (partial < 1) :row-count)))))

Get the combinations of date/station-id that have more than one count:

(def duplicate-rows (tcu/duplicate-rows corrected-station-ids))

We can see they’re all for a single station, which is great:

(-> duplicate-rows
    :station-id
    distinct)

("12-PA-SCH")

And since these rows are exact duplicates, we can just do a very naive de-dedupe on our data. From here we have a clean dat, then we’ll add back our uptime stats and inspect:

(defn dedupe [ds]
  (tc/unique-by ds [:datetime :station-id]))

(def clean-dataset
  (-> corrected-station-ids
      dedupe))

(-> clean-dataset
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest))

inner-join [35 8]:

:station-id	:total-measurement-count	:direction	:lat	:long	:installed	:expected-measurement-count	:uptime
12-PA-SCH	112527	Schwedter Steg	52.5491	13.4004	2012-03-01T00:00	112535	99.99289110
02-MI-JAN-N	85504	Jannowitzbrücke Nord	52.5139	13.4178	2015-04-01T00:00	85511	99.99181392
02-MI-JAN-S	85106	Jannowitzbrücke Süd	52.5139	13.4176	2015-04-01T00:00	85511	99.52637672
13-CW-PRI	85489	Prinzregentenstraße	52.4881	13.3331	2015-04-01T00:00	85511	99.97427232
18-TS-YOR-O	85493	Yorckstraße Ost	52.4919	13.3735	2015-04-01T00:00	85511	99.97895008
18-TS-YOR-W	85504	Yorkstraße West	52.4923	13.3732	2015-04-01T00:00	85511	99.99181392
19-TS-MON	84756	Monumentenstraße	52.4881	13.3698	2015-05-01T00:00	84791	99.95872203
27-RE-MAR	84784	Markstraße	52.5582	13.3649	2015-05-01T00:00	84791	99.99174441
03-MI-SAN-O	84011	Invalidenstraße Ost	52.5272	13.3720	2015-06-01T00:00	84047	99.95716682
03-MI-SAN-W	84032	Invalidenstraße West	52.5277	13.3731	2015-06-01T00:00	84047	99.98215284
…	…	…	…	…	…	…	…
06-FK-FRA-O	75190	Frankfurter Allee Ost	52.5136	13.4742	2016-06-01T00:00	75263	99.90300679
06-FK-FRA-W	75256	Frankfurter Allee West	52.5138	13.4744	2016-06-01T00:00	75263	99.99069928
14-CW-JU-W	21334	Straße des 17. Juni West	52.5131	13.3268	2021-07-27T00:00	30095	70.88885197
14-CW-JU-O	21334	Straße des 17. Juni Ost	52.5126	13.3266	2021-07-27T00:00	30095	70.88885197
01-MI-AL-W	17905	Karl-Marx-Allee	52.5219	13.4175	2021-12-16T00:00	26687	67.09259190
11-PA-SE-N	17542	Senefelder Platz Nord	52.5314	13.4124	2022-10-06T00:00	19631	89.35866741
04-MI-NO	10617	Nordufer	52.5139	13.4179	2023-10-16T00:00	10631	99.86830966
16-SP-NO-O	9731	Nonnendammallee Ost	52.5379	13.2479	2023-11-22T00:00	9743	99.87683465
16-SP-NO-W	9731	Nonnendammallee West	52.5384	13.2470	2023-11-22T00:00	9743	99.87683465
07-FK-ST	6096	Strausberger Platz	52.5189	13.4261	2024-04-22T00:00	6095	100.01640689
09-PA-SA-S	6084	Schönhauser Allee	52.5430	13.4121	2024-04-22T00:00	6095	99.81952420

And now the data looks better. We don’t have any uptimes above 100%.

We’ll do one last check, looking for suspicious values. Like counts that are not whole numbers:

(-> clean-dataset
    (tc/update-columns :count (partial map #(mod % 1)))
    :count
    distinct)

(0.0)

Here we’re good. With our clean dataset, we can add some visualizations to make it easier to assess our data quality at a glance.

(-> clean-dataset
    (add-location-info-and-station-count location-info-ds)
    (add-uptime-col latest)
    (tc/order-by :uptime)
    (plotly/base {:=width 600
                  :=title "Uptime by station"})
    (plotly/layer-bar {:=x :uptime
                       :=y :station-id})
    plotly/plot
    (assoc-in [:data 0 :orientation] "h"))

So we’ll store the results of this cleaned up data and use this in our analysis going forward:

(def clean-dataset-file-name "data/prepared/cleaned-dataset.csv.gz")

(let [csv-file-name "data/prepared/cleaned-dataset.csv"]
  (-> corrected-station-ids
      (tc/unique-by [:datetime :station-id])
      (tc/write-csv! csv-file-name))
  (transform/compress-file csv-file-name clean-dataset-file-name))

true

All of this illustrates the importance of defining and using consistent IDs as a data publisher. Failing to do this makes it very hard for downstream consumers of your data to have confidence that they are interpreting it correctly.

source: bobkonf_2025/src/notebooks/6_quality_control.clj