0
votes
0answers
9 views

Elastic MapReduce hangs on step because of upload to S3; is this CombineFileInputFormat's fault?

Often when a step (job) has long been complete, Elastic MapReduce with S3 intermediates will hang on 1 or 2 tasks, presumably because data are being uploaded to S3. This hanging can take considerable ...
3
votes
2answers
71 views

Spark/Hadoop throws exception for large LZO files

I'm running an EMR Spark job on some LZO-compressed log-files stored in S3. There are several logfiles stored in the same folder, e.g.: ... s3://mylogfiles/2014-08-11-00111.lzo ...
1
vote
1answer
27 views

Number of concurrently running mappers per node drops precipitously on Elastic MapReduce w/ AMI 3.1.0 and Hadoop 2.4.0 as cluster size increases

In a related question (How to set the precise max number of concurrently running tasks per node in Hadoop 2.4.0 on Elastic MapReduce), I ask for formulas relating the number of concurrently running ...
0
votes
1answer
19 views

Process entire files using Hadoop streaming on Amazon EMR

I have a directory full of gzipped text files on Amazon S3, and I'm trying to use Hadoop streaming on Amazon Elastic MapReduce to apply a function to each file individually (specifically, parse a ...
0
votes
1answer
28 views

Running Simple Hadoop Command using Java code

I would like to list files using hadoop command. "hadoop fs -ls filepath". I want to write a Java code to achieve this. Can I write a small piece of java code, make a jar of it and supply it to Map ...
0
votes
0answers
17 views

Amazon EMR:How to copy logs from S3 and store it in two different locations inside HDFS

I want to copy logs from S3 into HDFS and store them in two different locations. I am doing this. $EMR_BIN/elastic-mapreduce --jobflow $JOBFLOW --jar /home/hadoop/AmazonDistCp-1.1.jar \ --main-class ...
2
votes
0answers
27 views

Does Hadoop Streaming's performance decrease if I use -mapper cat rather than -mapper org.apache.hadoop.mapred.lib.IdentityMapper?

I have had problems trying to use org.apache.hadoop.mapred.lib.IdentityMapper as the argument of -mapper in Hadoop Streaming 1.0.3. "cat" works though; does using cat affect performance -- especially ...
0
votes
1answer
31 views

Unable to load Hive-JDBC driver when accessed through MapReduce program on Amazon's Elastic MapReduce

I have written a MapReduce program in which I am storing some part of output data into Hive table. I have used Hive-JDBC driver to access Hive table via MapReduce code. This program has compiled ...
0
votes
0answers
13 views

Issue with using files in distributed cache in Elastic MapReduce

I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job. However, my script doesn't seem to be able to find the modules in the cache. I archived the ...
0
votes
0answers
18 views

ChainMappers Hadoop Parallism

I am trying yo use ChainedMappers, I have some doubt regarding the usage: Mapper1 (M1) -> M1(K1, V1) Now M1, does some processing & emits 3 key value pairs (K2, V2),(K3, V3),(K4, V4) Mapper2 ...
0
votes
1answer
37 views

R Reducer is not working properly in Amazon EMR

I have done a map reduce code in R to run in Amazon EMR. My input file format: URL1 word1 word2 word3 URL2 word4 word2 word3 URL3 word1 word7 word2 I'm expecting the output as: URLs are concat ...
0
votes
1answer
54 views

Map Error- Attempy_xxxx_ Timed out after 600 seconds

I'm using Hadoop 2.2.0 and in when I run my map tasks I get the following error attempt_xxx Timed out after 1800000 seconds (its 1800000 because I have changed the config for ...
0
votes
0answers
18 views

Cannot Start & Process MapReduce Job

I created a custom EMR cluster with 1 master node and 3 core nodes (with 0 task nodes), all of them of m1.large configuration. I made a sample MapReduce program to analyze TCPDump data on my Eclipse ...
0
votes
2answers
34 views

How to read a file from s3 in EMR?

I would like to read a file from S3 in my EMR Hadoop job. I am using the Custom JAR option. I have tried two solutions: org.apache.hadoop.fs.S3FileSystem: throws a NullPointerException. ...
0
votes
1answer
60 views

Hadoop on EMR - Map Tasks Not Parallel

I've set up an EMR job through Data Pipeline in AWS. This job is to transfer CSV data from S3 to DynamoDB. My data size is 400 MB. I set mapred.max.split.size = 134217728 (i.e. 128 MB). With that, ...
0
votes
0answers
14 views

How do I use FileOutputCommitter from Java in hadoop

this is the beginning of my code: public class LogParserMapReduce extends Configured implements Tool { @Override public int run(String[] args) throws Exception { Configuration conf = ...
1
vote
1answer
124 views

“Unable to verify integrity of data” while running MR job

I'm running a relatively big MR job using Amazon Elastic Map Reduce. I ran the job plenty of times on small data sets with no problem. But when trying to run it on a large dataset I'm getting the ...
1
vote
1answer
24 views

How is data distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number ...
0
votes
3answers
69 views

How is data partitioned and distributed among datanodes in MapReduce?

I'm new to MapReduce, I'm having the task to process large data(lines of records). One thing I should use is the line number of specific record in my mapper, and then reducer process the line number ...
1
vote
1answer
53 views

Copying a large file (~6 GB) from S3 to every node of an Elastic MapReduce cluster

Turns out that copying a large file (~6 GB) from S3 to every node in an Elastic MapReduce cluster in a bootstrap action doesn't scale well; the pipe is only so big, and downloads to the nodes get ...
1
vote
0answers
84 views

EMR hadoop tasks agonize for hours when losing task nodes

I've set up an Amazon EMR jobflow with 1 on-demand core node and 4 task nodes with bidding. When I run my task on only the core node each step finishes within 1 hour. When I'm lucky and have 1 core + ...
0
votes
1answer
32 views

How to bid for a spot instance with price: 0.0164

I looked at the charts of last week's EC2 prices for m1.large in us-east-1c, and I saw prices like: 0.0160, 0.0161, 0.0162, 0.0163 so clearly there must be a way to bid for prices like this, but when ...
1
vote
1answer
53 views

How to find the right portion between hadoop instance types

I am trying to find out how many MASTER, CORE, TASK instances are optimal to my jobs. I couldn't find any tutorial that explains how do I figure it out. How do I know if I need more than 1 core ...
1
vote
1answer
107 views

How can I turn off hadoop speculative execution from Java

After reading Hadoop speculative task execution I am trying to turn off speculative execution using the new Java api, but it has no effect. This is my Main class: public class Main { public ...
2
votes
1answer
294 views

Hadoop failure copying input bz2 file from s3

I have a map-only hadoop job, running on Amazon's EMR, running on the latest ami-version: 3.0.4. Once in a while I get exceptions like this: Error: com.amazonaws.AmazonClientException: Unable to ...
2
votes
2answers
224 views

Oozie on EMR - tasks hang forever in PREP state

I am running Oozie 4.0.1 on Elastic Mapreduce using the 3.0.4 AMI (Hadoop 2.2.0). I've built Oozie from source, and everything installs and seems to work correctly, up to the point of scheduling a ...
0
votes
1answer
326 views

FAILED: NullPointerException null in HIVE QUERY

Following is the HIVE query I am using, I am also using a Ranking function. I am running this on my local machine. SELECT numeric_id, location, Rank(location), followers_count FROM ( SELECT ...
0
votes
0answers
13 views

Is there a way to get information from the jobflow and steps inside a step

Is there a way to know from a shell script that is running in elastic mapreduce (from script-runner.jar) whether there are following steps or it is the last step?
0
votes
1answer
77 views

Running Mappers and Reducers on different Groups of machines

We have a nice, big, complicated elastic-mapreduce job that has wildly different constraints on hardware for the Mapper vs Collector vs Reducer. The issue is: for the Mappers, we need tonnes of ...
2
votes
1answer
51 views

How to know job flow id, other cluster parameters in script running via script-runner.jar

I'm starting an elastic mapreduce cluster with the following command-line: $ elastic-mapreduce \ --create \ --num-instances "${INSTANCES}" \ --instance-type m1.medium \ --ami-version 3.0.4 \ --name ...
0
votes
1answer
119 views

BZip2 Native Splitting on Amazon/EMR

We have a question in specific regard to compressed input on an Amazon EMR Hadoop job. According to AWS: "Hadoop checks the file extension to detect compressed files. The compression types ...
0
votes
0answers
14 views

Custom Grouping and Partitioning in Job Conf

AWS Job not accepting the configuration parameters for Custom Grouping and Custom Sorting. conf3.setOutputValueGroupingComparator(StockKeyGroupingComparator.class); ...
0
votes
0answers
88 views

Is there an open source version of s3distcp?

I would love to use s3distcp for copying data from S3 buckets to S3 buckets but I have the need to use an external proprietary encryption mechanism to ensure the data is encrypted at rest (keeping the ...
0
votes
1answer
40 views

setting ssh permission in hadoop installation

I'm trying to install hadoop for the first time and I'm following this tutorial http://www.youtube.com/watch?v=xrxQXfE7t9A & https://sites.google.com/site/howtohadoop/how-to-install-hdp#bmec2 ...
1
vote
2answers
99 views

How to implement the combiner in Hadoop MapReduce?

I understand that for including a combiner in Hadoop MapReduce the following line is included (which I have done already); conf.setCombinerClass(MyReducer.class); What I don't understand is that ...
0
votes
0answers
72 views

Unable to parse credentials.json

I have been trying to run Amazon's Elastic MapReduce command line interface, and I have gotten to the point of validating the install. I created my .json file per the instructions, but for some ...
0
votes
0answers
106 views

How do I convert my Java Hadoop code to run on EC2?

I wrote a Driver, Mapper, and Reducer class in Java that runs the k-nearest neighbor algorithm on test data, and pulls in the training set using Distributed Cache. I used a Cloudera virtual machine ...
0
votes
1answer
283 views

Trouble using hbase from java on Amazon EMR

So Im trying to query my hbase cluster on Amazon ec2 using a custom jar i launch as a MapReduce step. Im my jar (inside the map function) I call Hbase as so: public void map( Text key, BytesWritable ...
0
votes
1answer
507 views

Class not found exception in eclipse wordcount program

I am running a word count program from eclipse, it says class not found. I exported same program as jar file and executed from command line, it's working fine. Here is the error stack trace ...
0
votes
1answer
118 views

Writing to a file in S3 from jar on EMR on AWS

Is there any way in which I can write to a file from my Java jar to an S3 folder where my reduce files would be written ? I have tried something like: FileSystem fs = FileSystem.get(conf); ...
0
votes
1answer
86 views

outputing custom csv header in reducer of map reduce

I am creating my own reducer as follows: public class MyReducer implemts Reducer<K1,V1,K2,V2>{ @override public void configure(JobConf conf){ } @override public void close(JobConf ...
3
votes
0answers
200 views

elastic map reduce timing out java.io.IOException: Unexpected end of stream

I am running MAP reduce job (Elastic map reduce EMR ) service.The job works fine for small data set but gives following exceptions for large data set (File size 400MB) Running another job with same ...
0
votes
2answers
330 views

cannot ssh into Elastic MapReduce

I'm using elastic-mapreduce to spun new clusters from the command line. After reading this tutorial, I have: elastic-mapreduce --create --alive \ --instance-type m1.xlarge\ --num-instances 5 \ ...
3
votes
1answer
426 views

Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class Myclass

I have my mapper and reducers as follows.But i am getting some kind of strange exception. I cant figure out why is it throwing such kind of exception. public static class MyMapper implements ...
0
votes
0answers
26 views

How to set a custom file name for the leaf files in an elastic map reduce job ?

I am running elastic map reduce jobs. The output files generated by the reducer has names like part-0000. I would rather have these names as "mykey-001". Is this possible with EMR ?
1
vote
0answers
93 views

Error: user not authorized to perform: iam:GetInstanceProfile

When trying to create "Interactive Cluster" using , ruby elastic-mapreduce --create --alive --name "Interactive Cluster" --num-instances=1 --master-instance-type=m1.large --hive-interactive I get ...
0
votes
0answers
41 views

output format in AWS EMR

I'm running a mapreduce program in AWS EMR, which is similar to the word count example of AWS. The output of this is not well formatted, meaning it is not one item per line, nor is there proper ...
1
vote
2answers
409 views

Combine output files of MapReduce job

I have written a Mapper and Reducer in Python and have executed it successfully on Amazon's Elastic MapReduce(EMR) using Hadoop Streaming. The final result folder contains the output in three ...
3
votes
1answer
114 views

MapR client not executing hadoop - Windows

I have an Amazon Windows VM where i did install MapR-Client 2.1.2, and another MapR cluster waiting for the jobs to be executed. I set up MAPR_HOME in C:\opt\mapr, and when I execute hadoop fs -ls / ...
0
votes
2answers
253 views

Mapper and Reducer in Hadoop

I have a confusion about the implementation of Hadoop. I notice that when I run my Hadoop MapReduce job with multiple mappers and reducers, I would get many part-xxxxx files. Meanwhile, it is true ...