I have some data under an HTML format and would like to import it in R. However, it seems that type inference is much less sophisticated than what exists with functions such as read.table. I am using the following workaround, which consists in reading the data table once, write it as a csv and re-import it.
# import raw data as HTML
require(XML)
raw_data <- readHTMLTable(doc = ".\\data\\mydata",
header = T,
as.data.frame = T)
# clean format and import
write.table(x = raw_data, file = ".\\temp.csv", quote = TRUE)
clean_data <- read.table(".\\temp.csv")
removeDots <- function(x)
{
x <- gsub(pattern = "NULL", replacement = "", x = x, fixed = T)
gsub(pattern = ".", replacement = "", x = x, fixed = T)
}
colnames(clean_data) <- lapply(X = colnames(clean_data), FUN = removeDots)
Per example, the following data is imported as a table of factors:
<table style="width:100%">
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
Is there any better option ?
doc
can be a character vector containing the HTML content to parse and process., so you should be able to make a small reproducible example by dumping the source html into a string:x <- "<html>...</html>"
. Thanks. \$\endgroup\$readHTMLTable
behavior clearly is suprising and disappointing. A possible alternative toXML
is thervest
package and itshtml_table
function. The examples provided in the doc show that it does differentiate between data types. Otherwise, if you need to stick with theXML
package, I'd say your idea of writing/reading via a temp file is good and what I would have used as well. \$\endgroup\$