Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

I've been doing some self directed learning with weather data in Scala, because the data is free, big, and feeds into lots of other stuff I would like to do. Immediately I ran into trouble with how to represent data from a text file with more than 22 columns.

What is the idiomatic way of handling data with more than 22 columns? I've been trying to read in the NWS-USAF-NAVY station list at the NOAA and they have 32 pieces of information per line.

My initial inclination was to use case classes, but the most straightforward way of defining them:

  /*
  first goal is to be able to read the inventory of WBAN stations from NOAA at
  ftp.ncdc.noaa.gov/pub/data/inventories/WBAN.TXT
  formats are listed in:
  ftp://ftp.ncdc.noaa.gov/pub/data/inventories/WBAN-FMT.TXT
  although I don't think that file is completely right
  */

case class WBAN(
  CoopStationID:                  Option[String],  // 01 - 06     Coop Station Id
  ClimateDivision:                Option[String],  // 08 - 09     Climate Division
  WBANStationID:                  Option[String],  // 11 - 15     WBAN Station Id
  WMOStationID:                   Option[String],  // 17 - 21     WMO Station Id
  FAALOCID:                       Option[String],  // 23 - 26     FAA LOC ID
  // and so on, for 32 elements!

Is not permitted because scala does not allow case classes beyond 22 items because it uses tuples to represent the data.

Nested tuples seem like a possible solution, so instead of having a field for each of the items listed by the NOAA, things like latitude, longitude, elevation, etc. could be nested:

  // class representing a latitude or longitude's information
case class DMS(
  Degrees:   Int,
  Minutes:   Int,
  Seconds:   Int
  )

// class combining a lat lon with elevation data
case class LatLonElevation(
  Latitude:          DMS,
  Longitude:         DMS,
  LatLonPrecision:   String,
  ElevationGround:   Option[Int],
  Elevation:         Option[Int],
  ElevationTypeCode: Option[Int]
  )

Or do you put it into a map with a vector in each value?

It seems like there should be a succinct way of doing this, but after implementing it I ended up repeating quite a bit of meaning in different formats, which was extremely ugly. Is there some way to import this kind of data using SLICK or other library or will it have the same limitations as the case classes? As an aside, is it better to use lazy val, Future, or another library to handle the connection?

share|improve this question
1  
Why not just make it a normal class? With so many fields you probably won't want to use it for pattern matching, so you're not losing much. –  Luigi Plinge May 22 '13 at 3:49

2 Answers 2

up vote 2 down vote accepted

If you have so many fields you should group them somehow.

Without grouping you can't make your class immutable, you can't use advantages of case class such as generated equals and copy.

It is difficult to use classes with so many fields.

In your case positions 180-219 is a single field "location" and it should be grouped in a single field. Latitude Seconds is completely useless without other Latitude * fields.

There are only 18 fields with location and you can reduce this number.

Natural way to group data fields is nested case classes.

To generate nested classes from string you could use parsing combinator. It seems a little overkill, but you'll get a clean code structure:

class WbanParsers extends RegexParsers {
  def wban: Parser[Wban] = stationId ~ .... ~ latLong ~ ... ^^
    { case sid ~ ... ~ latLong ~ ... => Wban(sid, ..., latLong, ...) }
  ...
  def latLong: Parser[LatLonElevation] = dms ~ dms ~ ... ^^
    { case lat ~ long ~ ... => LatLonElevation(lat, long, ...) }
  def dms: Parser[Dms] = (" " ^^^ {Positive} | "-" ^^^ {Negative}) ~ degrees ~ minutes ~ seconds ^^
    { case sign ~ d ~ m ~ s => Dms(sign, d, m, s) }
  ...
}

You could create trait with some parsers in order to reuse them in different formats.

For instance:

class WbanParsers extends RegexParsers with LatLongParser {
  def wban: ... ~ latLonf ~ ...
  ...
}

trait LatLongParser {
  this: RegexParsers =>

  def latLong: ...
  def dms: ...
  ...
}
share|improve this answer
1  
I think you might want to read the question again. Despite the title, the problem is not about parsing. –  Régis Jean-Gilles May 22 '13 at 5:50
    
@RégisJean-Gilles: It's possible I did not understand the question. Could you please tell me what is wrong? It's obvious that positions 180-219 is a single field "location" and it should be grouped in a single field. So nested case classes is the way to go. The author realized it himself. "Is there some way to import this kind of data using ...?". I think the only way to import this kind of data into nested classes is to parse it. –  senia May 22 '13 at 7:10
1  
I also don't see that your post is a possible answer to the OP's problem, since it currently is about parsing only. Your comment, however, made the connection clearer. Please improve your answer by focusing on the "nested case classes" part. The parsing part is probably helpful as well, but does not seem to be the core concern of the OP ("what could be a solution" -> nested case classes; "how to get there" -> for example, parsing via parsing combinators). –  Malte Schwerhoff May 22 '13 at 8:16
    
@mhs: thank you for your improvement suggestions. –  senia May 22 '13 at 8:59
1  
mhs took the word out of my mouth. In its updated form your answer now makes sense and is worth a +1. –  Régis Jean-Gilles May 22 '13 at 9:12

The problem you're describing is more general than Scala. There are two approaches to representing some data: in a denormalized or a normalized form. Denormalized data is flat and better suits for storage and transferring, while being harder to reason about and manage for humans. The normalized data is the exact opposite.

You get your data in a denormalized form. Since you're planning to use it in a high level language full of abstractions and categorizations, it is natural to normalize this data while parsing.

The general practice shows that 22 fields of case classes are far more than enough to represent any kind of data in a normalized form. Your case is not any exception, which you've already discovered yourself in your "LatLonElevation" example.

Normalizing your data

Looking at your data structure it's easy to immedeately extract some subentities spanning multiple fields: Latitude, Longitude, Elevation. Than you can see that in fact those three are groupable themselves, since they all represent a Location information. Than you can look again at your structure and see that there are FAA LOC ID, NWS Location Identifier, Country, State/Province Abbreviation and County fields, which it all makes sense to move into the Location too. Doing so on you'll end up with a normalized data structure consisting of a bunch of interreferring case classes each having no more than just several fields.

case class Station 
  ( coopId : String,
    wbanId : String,
    wmoId : String,
    icaoId : String,
    location : Location )

case class Location
  ( faaId : String,
    nwsId : String,
    country : String,
    stateOrProvince : Option[ String ],
    county : Option[ String ],
    latitude : Latitude,
    longitude : Longitude,
    elevation : Elevation )

case class Latitude
  ( direction : LatitudeDirection.Value,
    degrees : Int,
    minutes : Int,
    seconds : Int )

object LatitudeDirection extends Enumeration {
  val North, South = Value
}

//  and so on

Storing this data

Luckily for you there is a SORM Scala DB framework which works fine with normalized data.

share|improve this answer
    
The distinction is helpful and I will check out SORM as well as SLICK which I mentioned above. –  Jonathan Lawlor May 22 '13 at 12:39

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.