Sign up ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

If you use capturing parenthesis in the regular expression pattern in Python's re.split() function, it will include the matching groups in the result (Python's documentation).

I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenize [re text]
  (let [matcher (re-matcher re text)]
    (defn inner [last-index result]
      (if (.find matcher)
        (let [start-index (.start matcher)
              end-index (.end matcher)
              match (.group matcher)
              insert (subs text last-index start-index)]
          (if (string/blank? insert)
            (recur end-index (conj result match))
            (recur end-index (conj result insert match))))
        (conj result (subs text last-index))))
    (inner 0 [])))

Example:

(re-tokenize #"(\W+)" "...words, words...")
  => ["..." "words" ", " "words" "..." ""]

How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?

share|improve this question
3  
For what it's worth, clojure.contrib.string/partition does this exactly. –  Dave Yarwood Mar 26 '14 at 20:33

1 Answer 1

You can adjust you implementation to be a lazy-seq for some added performance:

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenizer [re text]
  (let [matcher (re-matcher re text)]
    ((fn step [last-index]
       (when (re-find matcher)
         (let [start-index (.start matcher)
               end-index (.end matcher)
               match (.group matcher)
               insert (subs text last-index start-index)]
           (if (string/blank? insert)
             (cons match (lazy-seq (step end-index)))
             (cons insert (cons match (lazy-seq (step end-index))))))))
     0)))

This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:

(take 10 (re-tokenize #"(\W+)" really-long-string)

and only the first 10 elements will be computed.

share|improve this answer

protected by Malachi Mar 26 '14 at 19:49

Thank you for your interest in this question. Because it has attracted low-quality answers, posting an answer now requires 10 reputation on this site.

Would you like to answer one of these unanswered questions instead?

Not the answer you're looking for? Browse other questions tagged or ask your own question.