Clojure core.async web crawler

Question

I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.

What I have works, but I am looking for feedback on the following points:

How can I avoid using massive buffers when I don't want to lose values?
Am I using go blocks efficiently? Are there places where a thread would be more appropriate?
How can I better determine when I can finish crawling? Currently I have a timeout of 3 seconds on taking from the urls-chan and if the timeout wins, I assume we're done. This doesn't seem very efficient.

Here is the main part of the code:

(def visited-urls (atom #{}))
(def site-map (atom {}))

;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;;   Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;;   (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))

(def exit-chan (chan 1))

(defn get-doc
  "Fetches a parsed html page from the given url and places onto a channel"
  [url]
  (go (let [{:keys [error body opts headers]} (<! (async-get url))
            content-type (:content-type headers)]
        (if (or error (not (.startsWith content-type "text/html")))
          (do (log "error fetching" url)
              false)
          (Jsoup/parse body (base-url (:url opts)))))))

;; Main event loop
(defn start-consumers
  "Spins up n go blocks to take a url from urls-chan, store its assets and then
  puts its links onto urls-chan, repeating until there are no more urls to take"
  [n domain]
  (dotimes [_ n]
    (go-loop [url (<! urls-chan)]
             (when-not (@visited-urls url)
               (log "crawling" url)
               (swap! visited-urls conj url)
               (when-let [doc (<! (get-doc url))]
                 (swap! site-map assoc url (get-assets doc))
                 (doseq [url (get-links doc domain)]
                   (go (>! urls-chan url)))))
             ;; Take the next url off the q, if 3 secs go by assume no more are coming
             (let [[value channel] (alts! [urls-chan (timeout 3000)])]
               (if (= channel urls-chan)
                 (recur value)
                 (>! exit-chan true))))))

(defn -main
  "Crawls [domain] for links to assets"
  [domain]
  (let [start-time (System/currentTimeMillis)]
    (start-logger)
    (log "Begining crawl of" domain)
    (start-consumers 40 domain)
    ;; Kick off with the first url
    (>!! urls-chan domain)
    (<!! exit-chan)
    (println (json/write-str @site-map))
    (<!! (log "Completed after" (seconds-since start-time) "seconds"))))

Heslacher · Answer 1 · 2014-12-31 11:01:58Z

(when-not (@visited-urls url)more than 1 consumers may be looking at the same unvisited url at this point. They will be crawling the same url which is unexpected but it doesn't seem to break anything.

I don't see any better way with clojure atom. Actually atom doesn't buy you anything here because all it does is to mutate a global state. I think java.util.concurrent.ConcurrentHashMap works better. visited-urls can be a map of URL to boolean representing if the URL has been visited. the condition should be .putIfAbsent(url, true) == null

asked	6 months ago
viewed	380 times
active	1 month ago

current community

your communities

more stack exchange communities

Clojure core.async web crawler

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged asynchronous clojure web-scraping or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Clojure core.async web crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged asynchronous clojure web-scraping or ask your own question.

Related

Hot Network Questions