Tell me more ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I have this code to create an inverted index from a directory of text files:

(def p (re-pattern #"[^\p{L}&&[^\p{M}]]"))

(defn invert[file]
  (let [f (.getName file)
        tokens (.split p (lower-case (slurp file)))]
        (into {} (mapcat #(hash-map % #{f}) tokens))))

(defn build-index[dirname]
  (reduce #(merge-with union %1 %2) (map invert (.listFiles (java.io.File. dirname)))))

Can you improve it / make it more idiomatic? My concern is performance. I'm testing it with 10k files of 10k in size, and I can make it about 20-30% faster if I use transients like this:

(defn add![idx file]
  (let [f (.getName file)]
    (loop [idx idx
           tokens (.split p (lower-case (slurp file)))]
      (if-not (seq tokens) idx
              (recur (assoc! idx (first tokens) (union (idx (first tokens)) #{f})) (rest tokens))))))

(defn build-index[dirname]
  (loop [files (.listFiles (java.io.File. dirname))
         idx (transient {})]
    (if-not (seq files) (persistent! idx)
            (recur (rest files) (add! idx (first files))))))

Full code including test file generator here:

https://github.com/dbasch/closearch

Any feedback is welcome.

share|improve this question

Know someone who can answer? Share a link to this question via email, Google+, Twitter, or Facebook.

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.