Explore and understand

Once we have access to our raw data file, we’ll want to explore it and see what we need to do to clean it up. This requires some initial understanding of what our dataset contains.

We’re starting with an excel workbook of bicycle traffic data in Berlin. To learn more about where we found this data and how to download it, see the “Extract” chapter that we skipped over. In the interest of time, we’ll start here and learn:

how to load an excel file as a tablecloth dataset
how to explore and work with Excel workbooks that contain multiple sheets
how to handle Excel’s date format in Clojure
how to extract metadata from reference sheets
how to clean, transform, and pivot messy data to make tidy data

Loading xlsx data in Clojure

To load xlsx data we’ll need to require fastexcel:

(require '[tech.v3.libs.fastexcel :as xlsx])

For consistency, we’ll use the same file name as the one specified in the previous notebook, which means we need to require it, too:

(require '[notebooks.3-extract :as extract])

Attempting to load the excel file as a dataset directly shows us that there are actually 16 datasets contained in this workbook:

(tc/dataset extract/raw-data-file-name)

_unnamed [1 2]:

:\(value \| :\)error
data/prepared/raw-data.xlsx	Multiple (16) datasets found in file

So instead we’ll make a seq of datasets, loading each sheet as its own dataset:

(def raw-datasets
  (xlsx/workbook->datasets extract/raw-data-file-name))

Exploring the Excel sheets

We can see what our sheets are named like this:

(map tc/dataset-name raw-datasets)

("Hinweise"
 "Legende"
 "Standortdaten"
 "Jahresdatei 2012"
 "Jahresdatei 2013"
 "Jahresdatei 2014"
 "Jahresdatei 2015"
 "Jahresdatei 2016"
 "Jahresdatei 2017"
 "Jahresdatei 2018"
 "Jahresdatei 2019"
 "Jahresdatei 2020"
 "Jahresdatei 2021"
 "Jahresdatei 2022"
 "Jahresdatei 2023"
 "Jahresdatei 2024")

The first three are reference datasets and the rest are a time series. It’s pretty common to see Excel workbooks set up like this, with a few sheets that are reference datasets and the rest containing the actual data. We’ll extract some metadata out of these first three datasets to use in our analysis. The first dataset just contains a big description, formatted for aesthetics rather than machine-readability, so we’ll have to dig in to the one random column containing the text to get the content out.

(def info-ds (first raw-datasets))

First we’ll drop all the columns where every value is nil, which we can see leaves us with a single column:

(tc/drop-columns info-ds (fn [col] (every? nil? (info-ds col))))

Hinweise [40 1]:

column-1
Datenbereitsteller:
Senatsverwaltung für Mobilität, Verkehr, Klimaschutz und Umwelt
Abt. VI - Verkehrsmanagement
E-Mail: Verkehrsmanagement@SenMVKU.Berlin.de


Hinweise:

Auf den folgenden Tabellenblättern finden Sie die Rohdaten der Zählstellen für den Radverkehr in Berlin. Rohdaten sind Daten, die direkt aus den Zählgeräten der Dauerzählstellen ausgelesen und nicht weiter bearbeitet werden.
Die Daten werden einmal jährlich nach Abschluss des Zähljahres zusammengestellt und veröffentlicht.

Bei der Auswertung und Interpretation der Daten sind folgende Randbedingungen zu beachten:
…
3. in interne und externe Geschäftsprozesse, Produkte und Anwendungen in öffentlichen und nicht öffentlichen elektronischen
Netzwerken eingebunden werden.

(2) Die Daten unterliegen der folgenden Lizenz:

Datenlizenz Deutschland – Zero – Version 2.0 (dl-de-zero-2.0)
https://www.govdata.de/dl-de/zero-2-0


(3) Veränderungen, Bearbeitungen, neue Gestaltungen oder sonstige Abwandlungen sind im Quellenvermerk mit dem Hinweis
zu versehen, dass die Daten geändert wurden.

(4) Die Bereitstellung der Daten erfolgt unter Ausschluss jeder Haftung für direkte und indirekte Schäden durch die Datennutzung. Der Eigentümer der Daten bemüht sich, aussagekräftige Informationen zur Qualtität der Daten, z.B. hinsichtlich Aktualität,
Richtigkeit, Vollständigkeit und Genauigkeit der veröffentlichen Daten bereitzustellen, wobei keine Garantie hinsichtlich der Qualität übernommen wird. Der Datenabnehmer haftet für alle Schäden, die durch die Verarbeitung oder Nutzung der Daten durch den Datenabnehmer entstehen. Der Datenabnehmer stellt den Datenbereitsteller von allen Ansprüchen Dritter frei.

Then we’ll join all the non-nil values in this column into a single wall of text, and store it as the description for this dataset.

(->> (tc/drop-columns info-ds (fn [col] (every? nil? (info-ds col))))
     tc/rows
     flatten
     (remove nil?)
     (map str/trim)
     (str/join "\n")
     (spit "data/prepared/description-de.txt"))

nil

You’ll notice this description is in German. As a quick bonus step, we can translate it to English using a free online translation API. To follow this particular example you’ll need an API key from DeepL. This code expects the key to be GPG encrypted in secrets/translate-api-key.txt.gpg.

(comment
  (require '[clj-http.client :as http]
           '[clojure.data.json :as json])

  (let [key (u/decrypt-gpg-key "secrets/translate-api-key.txt.gpg")
        response (http/post "https://api-free.deepl.com/v2/translate"
                            {:body (json/write-str {:text [(slurp "data/prepared/description-de.txt")]
                                                    :target_lang "EN"})
                             :headers {"Authorization" (str "DeepL-Auth-Key " key)}
                             :content-type :json})
        english (-> response :body json/read-str (get-in ["translations" 0 "text"]))]
    (spit "data/prepared/description-en.txt"
          (str "NOTE: THIS CONTENT IS A MACHINE TRANSLATION OF THE ORIGINAL IN data/prepared/description-de.txt\n"
               "provided only for convenience with no guarantee of accuracy.\n\n"
               english))))

Next we’ll look at the “Legende” sheet to see what we can learn about our dataset from it.

(second raw-datasets)

Legende [3 3]:

column-0	0.0	Zeiträume, in denen keine (0) oder unplausible Zähldaten an den Zählstellenaufgetreten sind

		Zeiträume, in denen keine Zähldaten an den Zählstellen vorhanden sind
	123.0	Zeiträume, in denen Datenlücken/fehlerhafte Daten durch Rekonstruktion aufgefüllt wurden

This does not appear to be very useful. We’ll come back to it if it seems like it might be interesting later on.

Next is the “Standortdaten” sheet:

(nth raw-datasets 2)

Standortdaten [35 5]:

Zählstelle	Beschreibung - Fahrtrichtung	Breitengrad	Längengrad	Installationsdatum
12-PA-SCH	Schwedter Steg	52.54907100	13.40036684	40969.0
02-MI-JAN-N	Jannowitzbrücke Nord	52.51393216	13.41783461	42095.0
02-MI-JAN-S	Jannowitzbrücke Süd	52.51394299	13.41761107	42095.0
13-CW-PRI	Prinzregentenstraße	52.48813558	13.33311991	42095.0
18-TS-YOR-O	Yorckstraße Ost	52.49193796	13.37347227	42095.0
18-TS-YOR-W	Yorkstraße West	52.49228000	13.37321000	42095.0
19-TS-MON	Monumentenstraße	52.48811700	13.36978500	42125.0
27-RE-MAR	Markstraße	52.55819000	13.36494350	42125.0
03-MI-SAN-O	Invalidenstraße Ost	52.52717690	13.37201589	42156.0
03-MI-SAN-W	Invalidenstraße West	52.52768602	13.37310486	42156.0
…	…	…	…	…
06-FK-FRA-O	Frankfurter Allee Ost	52.51358400	13.47424200	42522.0
06-FK-FRA-W	Frankfurter Allee West	52.51379000	13.47441400	42522.0
14-CW-JU-W	Straße des 17. Juni West	52.51313259	13.32684517	44404.0
14-CW-JU-O	Straße des 17. Juni Ost	52.51259000	13.32663000	44404.0
01-MI-AL-W	Karl-Marx-Allee	52.52190716	13.41753602	44546.0
11-PA-SE-N	Senefelder Platz Nord	52.53137962	13.41235473	44840.0
04-MI-NO	Nordufer	52.51393300	13.41786400	45215.0
16-SP-NO-O	Nonnendammallee Ost	52.53790100	13.24791400	45252.0
16-SP-NO-W	Nonnendammallee West	52.53841400	13.24699600	45252.0
07-FK-ST	Strausberger Platz	52.51894100	13.42613900	45404.0
09-PA-SA-S	Schönhauser Allee	52.54304400	13.41206700	45404.0

This is useful information about the different tracker locations. It also reveals our first encounter with one very annoying thing about working with Excel files – dates. Very annoyingly, dates in excel are stored as the number of days since 1900-1-1. We’ll fix this and do some other basic transformations to this dataset, then store it as a CSV file.

(def location-info-file-name "data/prepared/location-info.csv")

(-> raw-datasets
    (nth 2)
    ;; It's idiomatic to use keywords as column names in Clojure, so we'll translate these to English and make them keywords
    (tc/rename-columns {"Zählstelle" :station-id
                        "Beschreibung - Fahrtrichtung" :direction
                        "Breitengrad" :lat
                        "Längengrad" :long
                        "Installationsdatum" :installed})
    ;; We'll parse the installation date values as dates based on Excel's idiosyncratic date handling
    (tc/update-columns :installed (partial map dates/parse-excel-cell-as-date))
    ;; Also update the lat/long precision to be realistic (https://xkcd.com/2170/)
    (tc/update-columns [:lat :long] #(-> % (tcc/* 10000.0) tcc/round (tcc// 10000.0)))
    ;; Save this transformed data
    (tc/write-csv! location-info-file-name))

We can also quickly plot this to visualize the locations of the stations as our first introduction to tableplot, the visualization library we’ll be using throughout this workshop. To overlay the points on a map of Berlin, we’ll need a Mapbox token.

Here’s how we can see the locations of the stations on a map, starting from the cleaned up dataset we just saved:

Some basic visualization

(let [token (u/decrypt-gpg-key "secrets/mapbox-api-token.txt.gpg")]
  (-> location-info-file-name
      (tc/dataset {:key-fn keyword}) ;; Note this is how we convert the column names to keywords on load
      (plotly/base {:=coordinates :geo
                    :=lat :lat
                    :=lon :long
                    :=width 1000
                    :=height 700})
      (plotly/layer-point {:=name "Bike traffic counting stations"
                           :=mark-color "orange"
                           :=mark-size 10})
      plotly/plot
      (assoc-in [:data 0 :type] "scattermapbox")
      (assoc-in [:layout :mapbox]
                {:style "carto-positron"          ;; Choose a Mapbox style
                 :center {:lat 52.51, :lon 13.37} ;; Center the map on Berlin
                 :zoom 10.5                       ;; Adjust zoom level
                 :accesstoken token}))            ;; Use your Mapbox token
  )

Cleaning up the timeseries data

Now we get into the interesting work. Inspecting one of the annual data files we can see that this dataset is stored in a very wide format with bad column names

(def data-2012 (nth raw-datasets 3))

data-2012

Jahresdatei 2012 [17544 27]:

Zählstelle Inbetriebnahme	02-MI-JAN-N 01.04.2015	02-MI-JAN-S 01.04.2015	03-MI-SAN-O 01.06.2015	03-MI-SAN-W 01.06.2015	05-FK-OBB-O 01.06.2015	05-FK-OBB-W 01.06.2015	06-FK-FRA-O 01.06.2016	06-FK-FRA-W 01.06.2016	10-PA-BER-N 01.05.2016	10-PA-BER-S 01.05.2016	12-PA-SCH 01.03.2012	13-CW-PRI 01.04.2015	15-SP-KLO-N 01.06.2016	15-SP-KLO-S 01.06.2016	17-SZ-BRE-O 01.05.2016	17-SZ-BRE-W 01.05.2016	18-TS-YOR-O 01.04.2015	18-TS-YOR-W 01.04.2015	19-TS-MON 01.05.2015	20-TS-MAR-N 01.05.2016	20-TS-MAR-S 01.05.2016	21-NK-MAY 01.05.2016	23-TK-KAI 01.05.2016	24-MH-ALB 01.07.2015	26-LI-PUP 01.06.2015	27-RE-MAR 01.05.2015
40909.00000000
40909.04166667
40909.08333333
40909.12500000
40909.16666667
40909.20833333
40909.25000000
40909.29166667
40909.33333333
40909.37500000
…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…	…
41639.54166667											108.0
41639.58333333											110.0
41639.62500000											83.0
41639.66666667											65.0
41639.70833333											26.0
41639.75000000											22.0
41639.79166667											28.0
41639.83333333											26.0
41639.87500000											12.0
41639.91666667											18.0
41639.95833333											20.0

This data is a mess. The one saving grace is that all years of data are messy in the same way. We can see all of the timeseries datasets have the same column names:

(map tc/column-names raw-datasets)

(("column-0" "column-1" "column-2" "column-3")
 ("column-0"
  0.0
  "Zeiträume, in denen keine (0) oder unplausible Zähldaten an den Zählstellenaufgetreten sind")
 ("Zählstelle"
  "Beschreibung - Fahrtrichtung"
  "Breitengrad"
  "Längengrad"
  "Installationsdatum")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015"
  "14-CW-JU-O 27.07.2021"
  "14-CW-JU-W 27.07.2021"
  "02-MI-AL-W 16.12.2021"
  "column-30"
  "column-31")
 ("Zählstelle        Inbetriebnahme"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015"
  "14-CW-JU-O 27.07.2021"
  "14-CW-JU-W 27.07.2021"
  "02-MI-AL-W 16.12.2021"
  "02-PA-SE-N\n06.10.2022")
 ("Zählstelle        Inbetriebnahme"
  "01-MI-AL-W 16.12.2021"
  "02-MI-JAN-N 01.04.2015"
  "02-MI-JAN-S 01.04.2015"
  "03-MI-SAN-O 01.06.2015"
  "03-MI-SAN-W 01.06.2015"
  "03-SP-NO-O 22.11.2023"
  "03-SP-NO-W 22.11.2023"
  "04-MI-NO\n16.10.2023"
  "05-FK-OBB-O 01.06.2015"
  "05-FK-OBB-W 01.06.2015"
  "06-FK-FRA-O 01.06.2016"
  "06-FK-FRA-W 01.06.2016"
  "10-PA-BER-N 01.05.2016"
  "10-PA-BER-S 01.05.2016"
  "11-PA-SE-N\n06.10.2022"
  "12-PA-SCH 01.03.2012"
  "13-CW-PRI 01.04.2015"
  "14-CW-JU-O 27.07.2021"
  "14-CW-JU-W 27.07.2021"
  "15-SP-KLO-N 01.06.2016"
  "15-SP-KLO-S 01.06.2016"
  "17-SZ-BRE-O 01.05.2016"
  "17-SZ-BRE-W 01.05.2016"
  "18-TS-YOR-O 01.04.2015"
  "18-TS-YOR-W 01.04.2015"
  "19-TS-MON 01.05.2015"
  "20-TS-MAR-N  01.05.2016"
  "20-TS-MAR-S 01.05.2016"
  "21-NK-MAY 01.05.2016"
  "23-TK-KAI 01.05.2016"
  "24-MH-ALB 01.07.2015"
  "26-LI-PUP 01.06.2015"
  "27-RE-MAR 01.05.2015"
  "column-34"
  "column-35"
  "column-36"
  "column-37")
 ("Zählstelle        Inbetriebnahme"
  "01-MI-AL-W\n16.12.2021"
  "02-MI-JAN-N\n01.04.2015"
  "02-MI-JAN-S\n01.04.2015"
  "03-MI-SAN-O\n01.06.2015"
  "03-MI-SAN-W\n01.06.2015"
  "04-MI-NO\n16.10.2023"
  "05-FK-OBB-O\n01.06.2015"
  "05-FK-OBB-W\n01.06.2015"
  "06-FK-FRA-O\n01.06.2016"
  "06-FK-FRA-W\n01.06.2016"
  "07-FK-ST\n22.04.2024"
  "09-PA-SA-S\n22.04.2024"
  "10-PA-BER-N\n01.05.2016"
  "10-PA-BER-S\n01.05.2016"
  "11-PA-SE-N\n06.10.2022"
  "12-PA-SCH\n01.03.2012"
  "13-CW-PRI\n01.04.2015"
  "14-CW-JU-O\n27.07.2021"
  "14-CW-JU-W\n27.07.2021"
  "15-SP-KLO-N\n01.06.2016"
  "15-SP-KLO-S\n01.06.2016"
  "16-SP-NO-O\n22.11.2023"
  "16-SP-NO-W\n22.11.2023"
  "17-SZ-BRE-O\n01.05.2016"
  "17-SZ-BRE-W\n01.05.2016"
  "18-TS-YOR-O\n01.04.2015"
  "18-TS-YOR-W\n01.04.2015"
  "19-TS-MON\n01.05.2015"
  "20-TS-MAR-N\n01.05.2016"
  "20-TS-MAR-S\n01.05.2016"
  "21-NK-MAY\n01.05.2016"
  "23-TK-KAI\n01.05.2016"
  "24-MH-ALB\n01.07.2015"
  "26-LI-PUP\n01.06.2015"
  "27-RE-MAR\n01.05.2015"))

We’ll figure out how to parse a single dataset first here, then make it more robust and apply the transformations to every year.

We can see the data is very sparse, there are columns for every station in every year, even though most stations were not installed in the early years.

We’ll clean up the data by dropping the empty columns, renaming the columns, and making it tidy (i.e. one variable per column and one observation per row):

(let [;; Get the names of all empty columns that are not empty
      non-empty-columns (-> data-2012
                            tc/info
                            (tc/select-rows (comp pos? :n-valid))
                            :col-name)]
  (-> data-2012
      ;; Select only those
      (tc/select-columns (set non-empty-columns))
      ;; Now fix the column names
      ;; First column is always the date
      (tc/rename-columns {"Zählstelle        Inbetriebnahme" :datetime})
      ;; Now we'll deal with the excel dates
      (tc/update-columns :datetime (partial map dates/parse-excel-cell-as-date))
      ;; Then we'll strip the date part off the end of the rest of the column names. Regex isn't super efficient, but that's fine we're only ever working with a small number of columns. Since these are domain-specific IDs, we'll leave them as-is, rather than converting them to keywords or something else.
      (tc/rename-columns (complement #{:datetime})
                         #(->> % (re-matches #"^(.+)\d{2}\.\d{2}\.\d{4}") second str/trim))

      ;; Now we'll make the data "tidy". It's not super obvious why this is necessary with a small dataset like this, but it will be for the ones that have more columns
      (tc/pivot->longer (complement #{:datetime}) {:target-columns :station-id
                                                   :value-column-name :count})))

Jahresdatei 2012 [16102 3]:

:datetime	:station-id	:count
2012-03-01T00:00	12-PA-SCH	11.0
2012-03-01T01:00	12-PA-SCH	10.0
2012-03-01T02:00	12-PA-SCH	2.0
2012-03-01T03:00	12-PA-SCH	2.0
2012-03-01T04:00	12-PA-SCH	0.0
2012-03-01T05:00	12-PA-SCH	6.0
2012-03-01T06:00	12-PA-SCH	20.0
2012-03-01T07:00	12-PA-SCH	82.0
2012-03-01T08:00	12-PA-SCH	108.0
2012-03-01T09:00	12-PA-SCH	81.0
…	…	…
2013-12-31T13:00	12-PA-SCH	108.0
2013-12-31T14:00	12-PA-SCH	110.0
2013-12-31T15:00	12-PA-SCH	83.0
2013-12-31T16:00	12-PA-SCH	65.0
2013-12-31T17:00	12-PA-SCH	26.0
2013-12-31T18:00	12-PA-SCH	22.0
2013-12-31T19:00	12-PA-SCH	28.0
2013-12-31T20:00	12-PA-SCH	26.0
2013-12-31T21:00	12-PA-SCH	12.0
2013-12-31T22:00	12-PA-SCH	18.0
2013-12-31T23:00	12-PA-SCH	20.0

Now that we’ve seen what kind of data we’re working with, we’ll look at how to apply these transformations systematically to the rest of the data.

source: bobkonf_2025/src/notebooks/4_explore_and_understand.clj