I'm currently a beginner with clojure and I thought I'd try building a web crawler with core.async.
What I have works, but I am looking for feedback on the following points:
- How can I avoid using massive buffers when I don't want to lose values?
- Am I using
go
blocks efficiently? Are there places where athread
would be more appropriate? - How can I better determine when I can finish crawling? Currently I have a timeout of 3 seconds on taking from the
urls-chan
and if the timeout wins, I assume we're done. This doesn't seem very efficient.
Here is the main part of the code:
(def visited-urls (atom #{}))
(def site-map (atom {}))
;; I've given massive buffers my two channels here because I don't want to drop
;; values. I'm not quite sure why they need to be so big, but anything smaller gives me:
;; Exception in thread "async-dispatch-1626" java.lang.AssertionError:
;; Assert failed: No more than 1024 pending puts are allowed on a single channel. Consider using a windowed buffer.
;; (< (.size puts) impl/MAX-QUEUE-SIZE)
(def urls-chan (chan 102400))
(def log-chan (chan 102400))
(def exit-chan (chan 1))
(defn get-doc
"Fetches a parsed html page from the given url and places onto a channel"
[url]
(go (let [{:keys [error body opts headers]} (<! (async-get url))
content-type (:content-type headers)]
(if (or error (not (.startsWith content-type "text/html")))
(do (log "error fetching" url)
false)
(Jsoup/parse body (base-url (:url opts)))))))
;; Main event loop
(defn start-consumers
"Spins up n go blocks to take a url from urls-chan, store its assets and then
puts its links onto urls-chan, repeating until there are no more urls to take"
[n domain]
(dotimes [_ n]
(go-loop [url (<! urls-chan)]
(when-not (@visited-urls url)
(log "crawling" url)
(swap! visited-urls conj url)
(when-let [doc (<! (get-doc url))]
(swap! site-map assoc url (get-assets doc))
(doseq [url (get-links doc domain)]
(go (>! urls-chan url)))))
;; Take the next url off the q, if 3 secs go by assume no more are coming
(let [[value channel] (alts! [urls-chan (timeout 3000)])]
(if (= channel urls-chan)
(recur value)
(>! exit-chan true))))))
(defn -main
"Crawls [domain] for links to assets"
[domain]
(let [start-time (System/currentTimeMillis)]
(start-logger)
(log "Begining crawl of" domain)
(start-consumers 40 domain)
;; Kick off with the first url
(>!! urls-chan domain)
(<!! exit-chan)
(println (json/write-str @site-map))
(<!! (log "Completed after" (seconds-since start-time) "seconds"))))