Import data from XML files into data.frame

Question

I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:

<?xml version='1.0' encoding='UTF-8'?>
<build>
  <actions>
    ...
  </actions>
  <queueId>1276</queueId>
  <timestamp>1447062398490</timestamp>
  <startTime>1447062398538</startTime>
  <result>ABORTED</result>
  <duration>539722</duration>
  <charset>UTF-8</charset>
  <keepLog>false</keepLog>
  <builtOn></builtOn>
  <workspace>/var/lib/jenkins/workspace/clean-caches</workspace>
  <hudsonVersion>1.624</hudsonVersion>
  <scm class="hudson.scm.NullChangeLogParser"/>
  <culprits class="com.google.common.collect.EmptyImmutableSortedSet"/>
</build>

These are build.xml files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml

I would like to create a data frame contain job name, build number, duration, and result.

My code to achieve this is the following:

library(XML)

filenames <- list.files("C:/Users/jenkins/", recursive=TRUE, full.names=TRUE, pattern="build.xml")

job <- unlist(lapply(filenames, function(f) {
    s <- unlist(strsplit(f, split=.Platform$file.sep))
    s[length(s) - 3]
}))

build <- unlist(lapply(filenames, function(f) {
    s <- unlist(strsplit(f, split=.Platform$file.sep))
    s[length(s) - 1]
}))

duration <- unlist(lapply(filenames, function(f) {
    xml <- xmlParse(f)
    xpathSApply(xml, "//duration", xmlValue)
}))

result <- unlist(lapply(filenames, function(f) {
    xml <- xmlParse(f)
    x <- xpathSApply(xml, "//result", xmlValue)
    return(x)
}))

build.data <- data.frame(job, build, result, duration)

Which gives me a data frame that looks like this:

  job             build result     duration
1 clean-caches    37    SUCCESS    248701
2 clean-caches    38    FAILURE    1200049
3 clean-caches    39    FAILURE    1200060
4 clean-caches    40    FAILURE    1200123
5 clean-caches    41    SUCCESS    358024
6 clean-caches    42    SUCCESS    130462

This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.

My concerns:

Repeated code:

The code blocks to generate the job and build vectors are identical. Same for duration and result. If I decide to import more nodes from XML, I'll end up repeating even more code.
- Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.

flodel · Accepted Answer · 2016-01-12 12:02:29Z

With a few XML files to read you could have done something like this to address your concerns, where the files are only read at once, but all loaded into memory at the same time:

subdirs <- strsplit(dirname(filenames),
                    split = .Platform$file.sep)
subdirs <- lapply(subdirs, rev)
job   <- sapply(subdirs, `[[`, 3)
build <- sapply(subdirs, `[[`, 1)
xmls <- lapply(filenames, xmlParse)
duration <- sapply(xmls, xpathSApply, "//duration", xmlValue)
result   <- sapply(xmls, xpathSApply, "//result", xmlValue)
build.data <- data.frame(job, build, result, duration)

With thousands of files though, it makes more sense to process the files one by one and only keep the useful information before moving from one file to the next. It also makes sense to write a function to process each file. It could be:

build.info <- function(file, xml_fields = c("duration", "result")) {
   res <- list()
   # process filepath
   subdirs <- rev(unlist(strsplit(dirname(file),
                                  split = .Platform$file.sep)))
   res$job   <- subdirs[[3]]
   res$build <- subdirs[[1]]
   # process xml data
   doc <- xmlTreeParse(file)
   build <- doc$doc$children$build
   res[xml_fields] <- lapply(build[xml_fields], xmlValue)
   # return as a data.frame
   as.data.frame(res)
}

See how the function returns a one row data.frame. Then you can call the function on all files via lapply and bind all the outputs together:

build.data <- do.call(rbind, lapply(filenames, build.info))

With a few changes, you can write a more general function that will take one or more files and do the binding itself (like file.info does)

build.info <- function(file, xml_fields = c("duration", "result")) {
   stopifnot(length(file) > 0L)
   if (length(file) == 1L) {
      res <- list()
      # process filepath
      subdirs <- rev(unlist(strsplit(dirname(file),
                                     split = .Platform$file.sep)))
      res$job   <- subdirs[[1]]
      res$build <- subdirs[[3]]
      # process xml data
      doc <- xmlTreeParse(file)
      build <- doc$doc$children$build
      res[xml_fields] <- lapply(build[xml_fields], xmlValue)
      # return data.frame
      as.data.frame(res)
   } else {
      do.call(rbind, lapply(file, build.info))
   }
}

build.data <- build.info(filenames)

That's great. It really helps me understand how to use the different methods of array access: [], [[]], and $. I assume that you mean to refer to xml_fields in the body of the function, not fields? — laffoyb, Jan 12 '16 at 11:02

asked	1 year ago
viewed	249 times
active	1 year ago

current community

your communities

more stack exchange communities

Import data from XML files into data.frame

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged xml r or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Import data from XML files into data.frame

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged xml r or ask your own question.

Related

Hot Network Questions