Parse Data with Many Attributes in Scala

Question

I've been doing some self directed learning with weather data in Scala, because the data is free, big, and feeds into lots of other stuff I would like to do. Immediately I ran into trouble with how to represent data from a text file with more than 22 columns.

What is the idiomatic way of handling data with more than 22 columns? I've been trying to read in the NWS-USAF-NAVY station list at the NOAA and they have 32 pieces of information per line.

My initial inclination was to use case classes, but the most straightforward way of defining them:

  /*
  first goal is to be able to read the inventory of WBAN stations from NOAA at
  ftp.ncdc.noaa.gov/pub/data/inventories/WBAN.TXT
  formats are listed in:
  ftp://ftp.ncdc.noaa.gov/pub/data/inventories/WBAN-FMT.TXT
  although I don't think that file is completely right
  */

case class WBAN(
  CoopStationID:                  Option[String],  // 01 - 06     Coop Station Id
  ClimateDivision:                Option[String],  // 08 - 09     Climate Division
  WBANStationID:                  Option[String],  // 11 - 15     WBAN Station Id
  WMOStationID:                   Option[String],  // 17 - 21     WMO Station Id
  FAALOCID:                       Option[String],  // 23 - 26     FAA LOC ID
  // and so on, for 32 elements!

Is not permitted because scala does not allow case classes beyond 22 items because it uses tuples to represent the data.

Nested tuples seem like a possible solution, so instead of having a field for each of the items listed by the NOAA, things like latitude, longitude, elevation, etc. could be nested:

  // class representing a latitude or longitude's information
case class DMS(
  Degrees:   Int,
  Minutes:   Int,
  Seconds:   Int
  )

// class combining a lat lon with elevation data
case class LatLonElevation(
  Latitude:          DMS,
  Longitude:         DMS,
  LatLonPrecision:   String,
  ElevationGround:   Option[Int],
  Elevation:         Option[Int],
  ElevationTypeCode: Option[Int]
  )

Or do you put it into a map with a vector in each value?

It seems like there should be a succinct way of doing this, but after implementing it I ended up repeating quite a bit of meaning in different formats, which was extremely ugly. Is there some way to import this kind of data using SLICK or other library or will it have the same limitations as the case classes? As an aside, is it better to use lazy val, Future, or another library to handle the connection?

Why not just make it a normal class? With so many fields you probably won't want to use it for pattern matching, so you're not losing much. — Luigi Plinge, May 22 '13 at 3:49

senia · Accepted Answer · 2013-05-22 08:58:42Z

up vote 2 down vote accepted

If you have so many fields you should group them somehow.

Without grouping you can't make your class immutable, you can't use advantages of case class such as generated equals and copy.

It is difficult to use classes with so many fields.

In your case positions 180-219 is a single field "location" and it should be grouped in a single field. Latitude Seconds is completely useless without other Latitude * fields.

There are only 18 fields with location and you can reduce this number.

Natural way to group data fields is nested case classes.

To generate nested classes from string you could use parsing combinator. It seems a little overkill, but you'll get a clean code structure:

class WbanParsers extends RegexParsers {
  def wban: Parser[Wban] = stationId ~ .... ~ latLong ~ ... ^^
    { case sid ~ ... ~ latLong ~ ... => Wban(sid, ..., latLong, ...) }
  ...
  def latLong: Parser[LatLonElevation] = dms ~ dms ~ ... ^^
    { case lat ~ long ~ ... => LatLonElevation(lat, long, ...) }
  def dms: Parser[Dms] = (" " ^^^ {Positive} | "-" ^^^ {Negative}) ~ degrees ~ minutes ~ seconds ^^
    { case sign ~ d ~ m ~ s => Dms(sign, d, m, s) }
  ...
}

You could create trait with some parsers in order to reuse them in different formats.

For instance:

class WbanParsers extends RegexParsers with LatLongParser {
  def wban: ... ~ latLonf ~ ...
  ...
}

trait LatLongParser {
  this: RegexParsers =>

  def latLong: ...
  def dms: ...
  ...
}

edited May 22 '13 at 8:58

answered May 22 '13 at 4:19

senia
22k23155

1

I think you might want to read the question again. Despite the title, the problem is not about parsing. – Régis Jean-Gilles May 22 '13 at 5:50

@RégisJean-Gilles: It's possible I did not understand the question. Could you please tell me what is wrong? It's obvious that positions 180-219 is a single field "location" and it should be grouped in a single field. So nested case classes is the way to go. The author realized it himself. "Is there some way to import this kind of data using ...?". I think the only way to import this kind of data into nested classes is to parse it. – senia May 22 '13 at 7:10

1

I also don't see that your post is a possible answer to the OP's problem, since it currently is about parsing only. Your comment, however, made the connection clearer. Please improve your answer by focusing on the "nested case classes" part. The parsing part is probably helpful as well, but does not seem to be the core concern of the OP ("what could be a solution" -> nested case classes; "how to get there" -> for example, parsing via parsing combinators). – Malte Schwerhoff May 22 '13 at 8:16

@mhs: thank you for your improvement suggestions. – senia May 22 '13 at 8:59

1

mhs took the word out of my mouth. In its updated form your answer now makes sense and is worth a +1. – Régis Jean-Gilles May 22 '13 at 9:12

| show 4 more comments

Nikita Volkov · Answer 2 · 2013-05-22 09:40:37Z

The problem you're describing is more general than Scala. There are two approaches to representing some data: in a denormalized or a normalized form. Denormalized data is flat and better suits for storage and transferring, while being harder to reason about and manage for humans. The normalized data is the exact opposite.

You get your data in a denormalized form. Since you're planning to use it in a high level language full of abstractions and categorizations, it is natural to normalize this data while parsing.

The general practice shows that 22 fields of case classes are far more than enough to represent any kind of data in a normalized form. Your case is not any exception, which you've already discovered yourself in your "LatLonElevation" example.

Normalizing your data

Looking at your data structure it's easy to immedeately extract some subentities spanning multiple fields: Latitude, Longitude, Elevation. Than you can see that in fact those three are groupable themselves, since they all represent a Location information. Than you can look again at your structure and see that there are FAA LOC ID, NWS Location Identifier, Country, State/Province Abbreviation and County fields, which it all makes sense to move into the Location too. Doing so on you'll end up with a normalized data structure consisting of a bunch of interreferring case classes each having no more than just several fields.

case class Station 
  ( coopId : String,
    wbanId : String,
    wmoId : String,
    icaoId : String,
    location : Location )

case class Location
  ( faaId : String,
    nwsId : String,
    country : String,
    stateOrProvince : Option[ String ],
    county : Option[ String ],
    latitude : Latitude,
    longitude : Longitude,
    elevation : Elevation )

case class Latitude
  ( direction : LatitudeDirection.Value,
    degrees : Int,
    minutes : Int,
    seconds : Int )

object LatitudeDirection extends Enumeration {
  val North, South = Value
}

//  and so on

Storing this data

Luckily for you there is a SORM Scala DB framework which works fine with normalized data.

The distinction is helpful and I will check out SORM as well as SLICK which I mentioned above. — Jonathan Lawlor, May 22 '13 at 12:39

asked	1 year ago
viewed	272 times
active	1 year ago

current community

your communities

more stack exchange communities

Parse Data with Many Attributes in Scala

2 Answers 2

Normalizing your data

Storing this data

Your Answer

Not the answer you're looking for? Browse other questions tagged scala data record etl or ask your own question.

Visit Chat

Hot Network Questions

current community

your communities

more stack exchange communities

Parse Data with Many Attributes in Scala

2 Answers 2

Did you find this question interesting? Try our newsletter

Normalizing your data

Storing this data

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged scala data record etl or ask your own question.

Visit Chat

Related

Hot Network Questions