Take the 2-minute tour ×
Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

I'm scraping some comments from Reddit using Reddit JSON API and R. Since the data does not have a flat structure, extracting it is a little tricky, but I've found a way.

To give you a flavour of what I'm having to do, here is a brief example:

x = "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json" # example url
rawdat   = readLines(x,warn=F) # reading in the data
rawdat   = fromJSON(rawdat) # formatting
dat_list = repl = rawdat[[2]][[2]][[2]] # this will be used later
sq       = seq(dat_list)[-1]-1 # number of comments
txt      = unlist(lapply(sq,function(x)dat_list[[x]][[2]][[14]])) # comments (not replies)

# loop time:

for(a in sq){
  repl  = tryCatch(repl[[a]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting replies all replies to comment a

  if(length(repl)>0){ # in case there are no replies
    sq  = seq(repl)[-1]-1 # number of replies
    txt    = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]]))) # this is what I want

    # next level down
    for(b in sq){
      repl  = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL) # getting all replies to reply b of comment a

      if(length(repl)>0){
        sq  = seq(repl)[-1]-1
        txt    = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))   
      }
    }
  }
}

In the above example, get all comments, the first level of replies to each of these comments and the second level of replies (i.e. replies to each of the replies), but this could go down much deeper, so I'm trying to figure out an efficient way of handling this. To achieve this manually, what I'm having to do is this:

  1. Copy the following code from the last loop:

    for(b in sq){
      repl  = tryCatch(repl[[b]][[2]][[5]][[2]][[2]],error=function(e) NULL)
    
      if(length(repl)>0){
        sq  = seq(repl)[-1]-1
        txt = c(txt,unlist(lapply(sq,function(x)repl[[x]][[2]][[14]])))   
      }
    }
    
  2. Paste that code right after the line that starts with txt = ... and change b in the loop to c.

  3. Repeat this procedure approximately 20 times or so, to make sure everything is captured, which as you can imagine creates a huge loop. I was hoping that there must be a way to fold this loop somehow and make it more elegant...

If you have any ideas on how this loop could be improved, I'd really appreciate if you could share your thoughts.

EDIT:

Following flodel's solution, you may find in the below discussion a problem with not being able to obtain all comments, but only 198 of them. This is indeed an API limitation, but you can make some minor alterations to capture more observations: append the following to the URL: "?limit=500", so that the example URL would look like: http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json?limit=500 and this will get you up to 500 comments. Unfortunately, 500 is the upper limit. Hope this helps.

Thanks again to flodel for answering this question.

share

1 Answer 1

up vote 4 down vote accepted

Here are my main recommendations:

  1. use recursion
  2. use names instead of list indices, for example node$data$reply$data$children reads much better than node[[2]][[5]][[2]][[2]] and it is also more robust to data changes.
  3. use well-named variables so you code reads easily

Now for the code:

url       <- "http://www.reddit.com/r/funny/comments/2eerfs/fifa_glitch_cosplay/.json"
rawdat    <- fromJSON(readLines(url, warn = FALSE))
main.node <- rawdat[[2]]$data$children

get.comments <- function(node) {
   comment     <- node$data$body
   replies     <- node$data$replies
   reply.nodes <- if (is.list(replies)) replies$data$children else NULL
   return(list(comment, lapply(reply.nodes, get.comments)))
}

txt <- unlist(lapply(main.node, get.comments))
length(txt)
# [1] 199
share
    
is there a difference between unlist(lapply(items, fun)) and sapply(items, fun) ? –  janos Aug 31 at 18:34
1  
Yes. If fun were to return a vector of the same length for each item, then sapply would put all the output vectors into a matrix, otherwise in a list. lapply on the other hand always returns a list. unlist(lapply(...)) will unwrap the list into a vector. –  flodel Aug 31 at 19:07
    
nice, thanks flodel! –  janos Aug 31 at 19:10
    
Great, thanks @flodel, that is exactly what I'm after! The only thing that concerns me is that there are over 400 responses in total, but the code only returns 199 observations. I thought the API was capped at 500 results. Would you happen to have an idea why this may be so? Because if this is all I'm going to get, I might as well scrape the HTML which could give me up to 500 comments... –  de1pher Sep 3 at 6:47
    
If I browse to the url and search for "body:", chrome tells me there are 198 hits (not sure where the one-off comes from) so I'm leaning towards an API limitation. –  flodel Sep 3 at 11:23

This site is currently not accepting new answers.

Not the answer you're looking for? Browse other questions tagged .