Clean reading of HTML table with R

Ask Question

Asked 9 years, 10 months ago

Modified 9 years, 10 months ago

Viewed 807 times

I have some data under an HTML format and would like to import it in R. However, it seems that type inference is much less sophisticated than what exists with functions such as read.table. I am using the following workaround, which consists in reading the data table once, write it as a csv and re-import it.

# import raw data as HTML
require(XML)
raw_data <- readHTMLTable(doc = ".\\data\\mydata", 
                          header = T, 
                          as.data.frame = T)
# clean format and import

write.table(x = raw_data, file = ".\\temp.csv", quote = TRUE)
clean_data <- read.table(".\\temp.csv")

removeDots <- function(x)
{
  x <- gsub(pattern = "NULL", replacement = "", x = x, fixed = T)
  gsub(pattern = ".", replacement = "", x = x, fixed = T)
}

colnames(clean_data) <- lapply(X = colnames(clean_data), FUN = removeDots)

Per example, the following data is imported as a table of factors:

<table style="width:100%">
  <tr>
    <td>Jill</td>
    <td>Smith</td> 
    <td>50</td>
  </tr>
  <tr>
    <td>Eve</td>
    <td>Jackson</td> 
    <td>94</td>
  </tr>
</table>

Is there any better option ?

edited Nov 4, 2015 at 12:20

asked Nov 4, 2015 at 9:44

RUser4512

1,1356 silver badges21 bronze badges

\$\begingroup\$ Which types are being falsely inferred? Kind of hard to tell without data to try. \$\endgroup\$
– Mike Wise
Commented Nov 4, 2015 at 10:33
\$\begingroup\$ It does not infer any type at all... There is the argument elFun that I can play with, but it is hard to reproduce the parsing of read.table \$\endgroup\$
– RUser4512
Commented Nov 4, 2015 at 10:45
\$\begingroup\$ Apparently, doc can be a character vector containing the HTML content to parse and process., so you should be able to make a small reproducible example by dumping the source html into a string: x <- "<html>...</html>". Thanks. \$\endgroup\$
– flodel
Commented Nov 4, 2015 at 12:08
\$\begingroup\$ @flodel I added some sample \$\endgroup\$
– RUser4512
Commented Nov 4, 2015 at 12:20
1

\$\begingroup\$ The readHTMLTable behavior clearly is suprising and disappointing. A possible alternative to XML is the rvest package and its html_table function. The examples provided in the doc show that it does differentiate between data types. Otherwise, if you need to stick with the XML package, I'd say your idea of writing/reading via a temp file is good and what I would have used as well. \$\endgroup\$
– flodel
Commented Nov 8, 2015 at 0:30

| Show 3 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Clean reading of HTML table with R

0

Your Answer

Hot Network Questions

Clean reading of HTML table with R

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Related

Hot Network Questions