Making key value pairs from an HDFS sequence file using Apache Spark

Question

What is the best way to make key value pairs out of a HDFS sequence file? The reason why I am asking, I have to sort a sequence file. The sortByKey method is not available unless your RDD is in the form of key value pairs. I am using Apache Spark 1.0.2 and HDFS 2.5.0 .

gasparms · Accepted Answer · 2014-10-02 19:11:54Z

up vote 2 down vote accepted

From Spark documentation :

For SequenceFiles, use SparkContext’s sequenceFile[K, V] method where K and V are the types of key and values in the file. These should be subclasses of Hadoop’s Writable interface, like IntWritable and Text. In addition, Spark allows you to specify native types for a few common Writables; for example, sequenceFile[Int, String] will automatically read IntWritables and Texts.

Key point is that you have to map Hadoop Types to [String, Int, ...] and built your desired (k,v) rdd in order to apply sortByKey method.

val file = sc.sequenceFile(input, classOf[Text], classOf[Text])
val map = file.map{case (k,v) => (k.toString(), v.toString())} 
val sortedOutput = map.sortByKey(true)

answered Oct 2 '14 at 19:11

gasparms
1,07619

That is the code that I came up with too. I am having trouble with it scaling with larger datasets (>= 500GB). It spills to disk during the shuffle phase and this is where my job dies on larger data sets. It runs the workers out of physical disk space. I know that I could move the path where it is writing to a larger drive, but that just seems lame. For a 32 GB dataset, my job will shuffle 74.5 GB. This seems odd to me. Any thoughts? TIA – Crackerman Oct 3 '14 at 11:24

I don't know the reason of that but I've seen that in release 1.1.0 introduces a new shuffle implementation optimized for very large scale shuffles. issues.apache.org/jira/browse/SPARK-2045 spark.apache.org/releases/spark-release-1-1-0.html – gasparms Oct 3 '14 at 16:54

add a comment |

asked	6 months ago
viewed	712 times
active	6 months ago

current community

your communities

more stack exchange communities

Making key value pairs from an HDFS sequence file using Apache Spark

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged scala hdfs apache-spark or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Making key value pairs from an HDFS sequence file using Apache Spark

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged scala hdfs apache-spark or ask your own question.

Related

Hot Network Questions