Take the 2-minute tour ×
Cross Validated is a question and answer site for people interested in statistics, machine learning, data analysis, data mining, and data visualization. It's 100% free, no registration required.

I am searching for existing datasets that we can use to test several datavis techniques we are researching.

I know several resources like those included in R (try plot(Orange) or see here).

But I'd like to take it one step forward:

  • Which are the best real-world datasets to test a visualization tool?
  • Which datasets have you used in academic papers or teaching slides about datavis?
  • Which is the best example from the real world to show the advantages of graphing?
share|improve this question
1  
Many good real-world examples, with some of the linked projects providing the data sets (but most don't, unfortunately): infosthetics.com –  WSkid Sep 27 '11 at 21:52
    
Are you expressly looking for free data sets? –  Fomite Oct 6 '11 at 18:12
1  
Visualization depends on context and audience (among other things), suggesting that "best" is ambiguous in this context. You may get more focused, pertinent replies by indicating what "techniques" you are researching. –  whuber Oct 6 '11 at 19:53
    
@whuber Techniques, about automatization of visualization. Best, for explain. Best, for benchmark. –  robermorales Oct 8 '11 at 9:47
    
@EpiGrad Yes, as free as possible. –  robermorales Oct 8 '11 at 9:48

6 Answers 6

up vote 3 down vote accepted
+50

There are large number of databases available on internet. Depending on the subject, you can get different sources.

For example, in Human Development subject area you can have data sources at (http://hdrstats.undp.org/):

http://hdrstats.undp.org/en/tables/default.html

For Climate change observation, there is a web with high resolution climate data at (http://www.ipcc-data.org/), for example:

http://www.ipcc-data.org/obs/cru_ts2_1.html

Both examples, contains real data, used in published scientific papers, with large quantity of data. Time related and/or space related data. Visualization possibilities of those data are endless.

share|improve this answer
    
which of the possible datasets from these magnific sources do you like best? thanks –  robermorales Oct 11 '11 at 12:06
1  
It depends on the suitability for the "taste" of visualization. For example, to explore/show time series the IPCC web have enough data and is widely used (obviously for analyzing the climate change), to show spatial data the Human Development website contains a lot of space-related data as well as data related to time. –  Jose Zubcoff Oct 11 '11 at 15:25

I like to use the Anscombe data sets (also available in R) to show the importance of plotting when doing regressions. If you aren't familiar, you get the same regression line and diagnostics from all four data sets, even though the sets themselves all look quite different. You can take the plots below and turn them into residual plots to illustrate problems that you might look for in the residuals after performing a regression.

Anscombe data sets

share|improve this answer
    
Yeah, we did know that datasets. It is a good starting point. –  robermorales Sep 27 '11 at 18:08
    
The main problem is that it is not a real-world dataset. –  robermorales Sep 27 '11 at 18:17
1  
@robermorales, Fair enough, but I think that seeing the "pure" version of the problem makes it easier to understand messier, real-world visualizations/problems. –  Charlie Sep 27 '11 at 18:26

which is the best example from the real world to show the advantages of graphing?

Any big table. For examples, google images of "official census table". You'll see things like the one below.

Also look at Gelman et al. (2002) Let's Practice What We Preach: Turning Tables into Graphs. American Statistician 56:121-130

huge complicated table

share|improve this answer
    
good tip! We don't know the ref. –  robermorales Sep 28 '11 at 6:58

William S. Cleveland has two books full of great uses of graphics, and the data and code to create the graphs in Visualizing Data is on his website

share|improve this answer
1  
Great tip! Thanks –  robermorales Oct 8 '11 at 9:39
    
which of the datasets by Cleveland do you like more? thanks –  robermorales Oct 11 '11 at 12:06
1  
@robertomorales I think they are all well chosen for their purposes. Anyone interested in statistical graphics should study Cleveland carefully. –  Peter Flom Oct 11 '11 at 16:50

Possibly you already know of these, but here they are anyway:

The UCI Machine Learning Repository has many publicly accessible, real world data sets.

The US Government makes many of its datasets public at data.gov.

If you want some tricky visualization data, I'd suggest looking at a classification task. Seems to me that the Bag of Words set on the UCI MLR has some nice properties, but I could be mistaken (been a while since I used it).

share|improve this answer
    
Thanks! There are a lot! –  robermorales Oct 8 '11 at 9:43

I just noticed loads of datasets here:

http://www.inside-r.org/howto/finding-data-internet

Don't know if that's any use?

I'm afraid I don't teach visualisation so I can't comment on your specific questions.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.