I have a number of XML files containing data I would like to analyse. Each XML contains data in a format similar to this:
<?xml version='1.0' encoding='UTF-8'?>
<build>
<actions>
...
</actions>
<queueId>1276</queueId>
<timestamp>1447062398490</timestamp>
<startTime>1447062398538</startTime>
<result>ABORTED</result>
<duration>539722</duration>
<charset>UTF-8</charset>
<keepLog>false</keepLog>
<builtOn></builtOn>
<workspace>/var/lib/jenkins/workspace/clean-caches</workspace>
<hudsonVersion>1.624</hudsonVersion>
<scm class="hudson.scm.NullChangeLogParser"/>
<culprits class="com.google.common.collect.EmptyImmutableSortedSet"/>
</build>
These are build.xml
files generated by the continuous integration server, Jenkins. The files themselves don't have some important data that I would like, like the Jenkins job name, or the build number that created the xml. The job and build ids are encoded into the path of each file like .\jenkins\jobs\${JOB_NAME}\builds\${BUILD_NUMBER}\build.xml
I would like to create a data frame contain job name, build number, duration, and result.
My code to achieve this is the following:
library(XML)
filenames <- list.files("C:/Users/jenkins/", recursive=TRUE, full.names=TRUE, pattern="build.xml")
job <- unlist(lapply(filenames, function(f) {
s <- unlist(strsplit(f, split=.Platform$file.sep))
s[length(s) - 3]
}))
build <- unlist(lapply(filenames, function(f) {
s <- unlist(strsplit(f, split=.Platform$file.sep))
s[length(s) - 1]
}))
duration <- unlist(lapply(filenames, function(f) {
xml <- xmlParse(f)
xpathSApply(xml, "//duration", xmlValue)
}))
result <- unlist(lapply(filenames, function(f) {
xml <- xmlParse(f)
x <- xpathSApply(xml, "//result", xmlValue)
return(x)
}))
build.data <- data.frame(job, build, result, duration)
Which gives me a data frame that looks like this:
job build result duration 1 clean-caches 37 SUCCESS 248701 2 clean-caches 38 FAILURE 1200049 3 clean-caches 39 FAILURE 1200060 4 clean-caches 40 FAILURE 1200123 5 clean-caches 41 SUCCESS 358024 6 clean-caches 42 SUCCESS 130462
This works, but I have serious concerns about it from both a style and a performance point of view. I'm completely new to R, so I don't know what would be a nicer way to do this.
My concerns:
Repeated code:
The code blocks to generate the
job
andbuild
vectors are identical. Same forduration
andresult
. If I decide to import more nodes from XML, I'll end up repeating even more code.- Several iterations must be made of my list of files. There are thousands of XML files, and this number will likely grow. As above, if I wish to extract more data from the XML, I must add more iterations.