Implementation of Python's re.split in Clojure (with capturing parentheses)

Question

If you use capturing parenthesis in the regular expression pattern in Python's re.split() function, it will include the matching groups in the result (Python's documentation).

I need this in my Clojure code and I didn't find an implementation of this, nor a Java method for achieving the same result.

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenize [re text]
  (let [matcher (re-matcher re text)]
    (defn inner [last-index result]
      (if (.find matcher)
        (let [start-index (.start matcher)
              end-index (.end matcher)
              match (.group matcher)
              insert (subs text last-index start-index)]
          (if (string/blank? insert)
            (recur end-index (conj result match))
            (recur end-index (conj result insert match))))
        (conj result (subs text last-index))))
    (inner 0 [])))

Example:

(re-tokenize #"(\W+)" "...words, words...")
  => ["..." "words" ", " "words" "..." ""]

How could I make this simpler and / or more efficient (maybe also more Clojure-ish)?

For what it's worth, clojure.contrib.string/partition does this exactly. — Dave Yarwood, Mar 26 '14 at 20:33

jjcomer · Answer 1 · 2012-09-30 10:47:19Z

You can adjust you implementation to be a lazy-seq for some added performance:

(use '[clojure.string :as string :only [blank?]])

(defn re-tokenizer [re text]
  (let [matcher (re-matcher re text)]
    ((fn step [last-index]
       (when (re-find matcher)
         (let [start-index (.start matcher)
               end-index (.end matcher)
               match (.group matcher)
               insert (subs text last-index start-index)]
           (if (string/blank? insert)
             (cons match (lazy-seq (step end-index)))
             (cons insert (cons match (lazy-seq (step end-index))))))))
     0)))

This implementation will be more efficient as the results will only be calculated as needed. For instance if you only needed the first 10 results from a really long string you can use:

(take 10 (re-tokenize #"(\W+)" really-long-string)

and only the first 10 elements will be computed.

asked	3 years ago
viewed	296 times
active	1 year ago

current community

your communities

more stack exchange communities

Implementation of Python's re.split in Clojure (with capturing parentheses)

1 Answer 1

protected by Malachi Mar 26 '14 at 19:49

Not the answer you're looking for? Browse other questions tagged python regex clojure or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Implementation of Python's re.split in Clojure (with capturing parentheses)

1 Answer 1

protected by Malachi Mar 26 '14 at 19:49

Not the answer you're looking for? Browse other questions tagged python regex clojure or ask your own question.

Related

Hot Network Questions