We're migrating our streaming ETL application from Python into Clojure, the hottest part of the code isn't yet performing as well as our existing implementation. This looks to be down to the algorithmic complexity of our flattening function.
Take a nested map, flatten the data structure and rename keys. We take data from multiple sources and have to smudge the data to fit a legacy format. For this all of the keys/values need to be at the root level of the map with the correct keys. Any 'new' keys can be left as is and not renamed.
The transformation was originally a threading macro, with a flatten and a set/rename-keys, this has been improved to rename variables while looping through each level of the map. The flatten however still accounts for ~80% of the time spent processing a record.
An example input would be as follows:
{:foo {:foo-host "host-1" :user {:foo-user "user-1" :foo-id "id-1"}}
:bar {:bar-var "potato"}}
With an expected output of:
{:host "host-1" :user "user-1" :id "id-1" :var "potato"}
We've gone through and removed any java reflection, and type hinted the hottest parts of our code so no more gains can be found there. The last slow pieces of code are as follows:
(def match-table {:foo-host :host
:foo-user :user
:foo-id :id
:bar-var :var})
(defn flatten-record
"Take a nested record and recursively flatten."
[^clojure.lang.PersistentArrayMap record]
(into {}
(for [[k v] record]
(if (map? v)
(flatten-record v)
;; If the key's not found return itself as the default.
{(get match-table k k) v}))))
The flatten-record function consumes the stack as it's lacking any tail call optimisation, trampoline may be a better solution? The records transformed are a maximum of 5 deep, so there's no risk of a recursion overflow.
I'm mainly concerned about performance here, so am open to any hints!
Larger examples are available here.
^clojure.lang.PersistentArrayMap
. For any map with more than 16 entries, this fails.^clojure.lang.APersistentMap
might be better. \$\endgroup\$^flatland.protobuf.PersistentProtocolBufferMap
in reality as it's protobuf we're deserialising. I just grabbed that as an example. Cheers for that though as I didn't know that behaviour. \$\endgroup\$