2
\$\begingroup\$

I have some data under an HTML format and would like to import it in R. However, it seems that type inference is much less sophisticated than what exists with functions such as read.table. I am using the following workaround, which consists in reading the data table once, write it as a csv and re-import it.

# import raw data as HTML
require(XML)
raw_data <- readHTMLTable(doc = ".\\data\\mydata", 
                          header = T, 
                          as.data.frame = T)
# clean format and import

write.table(x = raw_data, file = ".\\temp.csv", quote = TRUE)
clean_data <- read.table(".\\temp.csv")

removeDots <- function(x)
{
  x <- gsub(pattern = "NULL", replacement = "", x = x, fixed = T)
  gsub(pattern = ".", replacement = "", x = x, fixed = T)
}

colnames(clean_data) <- lapply(X = colnames(clean_data), FUN = removeDots)

Per example, the following data is imported as a table of factors:

<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Is there any better option ?

\$\endgroup\$
8
  • \$\begingroup\$ Which types are being falsely inferred? Kind of hard to tell without data to try. \$\endgroup\$
    – Mike Wise
    Commented Nov 4, 2015 at 10:33
  • \$\begingroup\$ It does not infer any type at all... There is the argument elFun that I can play with, but it is hard to reproduce the parsing of read.table \$\endgroup\$
    – RUser4512
    Commented Nov 4, 2015 at 10:45
  • \$\begingroup\$ Apparently, doc can be a character vector containing the HTML content to parse and process., so you should be able to make a small reproducible example by dumping the source html into a string: x <- "<html>...</html>". Thanks. \$\endgroup\$
    – flodel
    Commented Nov 4, 2015 at 12:08
  • \$\begingroup\$ @flodel I added some sample \$\endgroup\$
    – RUser4512
    Commented Nov 4, 2015 at 12:20
  • 1
    \$\begingroup\$ The readHTMLTable behavior clearly is suprising and disappointing. A possible alternative to XML is the rvest package and its html_table function. The examples provided in the doc show that it does differentiate between data types. Otherwise, if you need to stick with the XML package, I'd say your idea of writing/reading via a temp file is good and what I would have used as well. \$\endgroup\$
    – flodel
    Commented Nov 8, 2015 at 0:30

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.