Take the 2-minute tour ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free, no registration required.

What is the best way to make key value pairs out of a HDFS sequence file? The reason why I am asking, I have to sort a sequence file. The sortByKey method is not available unless your RDD is in the form of key value pairs. I am using Apache Spark 1.0.2 and HDFS 2.5.0 .

share|improve this question

1 Answer 1

up vote 2 down vote accepted

From Spark documentation :

For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.

Key point is that you have to map Hadoop Types to [String, Int, ...] and built your desired (k,v) rdd in order to apply sortByKey method.

val file = sc.sequenceFile(input, classOf[Text], classOf[Text])
val map = file.map{case (k,v) => (k.toString(), v.toString())} 
val sortedOutput = map.sortByKey(true)
share|improve this answer
    
That is the code that I came up with too. I am having trouble with it scaling with larger datasets (>= 500GB). It spills to disk during the shuffle phase and this is where my job dies on larger data sets. It runs the workers out of physical disk space. I know that I could move the path where it is writing to a larger drive, but that just seems lame. For a 32 GB dataset, my job will shuffle 74.5 GB. This seems odd to me. Any thoughts? TIA –  Crackerman Oct 3 '14 at 11:24
    
I don't know the reason of that but I've seen that in release 1.1.0 introduces a new shuffle implementation optimized for very large scale shuffles. issues.apache.org/jira/browse/SPARK-2045 spark.apache.org/releases/spark-release-1-1-0.html –  gasparms Oct 3 '14 at 16:54

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.