0
votes
1answer
16 views

Hadoop mapreduce python command line arguments

In my python mapper code, I need to access the 'path' given in -input 'path'. How is it possible to access this in python code?
0
votes
0answers
77 views

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run ...
0
votes
1answer
31 views

SequenceFile format with Hadoop Streaming Python

Does Hadoop officially support streaming with binary formats as of 0.21? The hadoop-streaming.jar accepts an inputFormat that is a Java class name. How do you provide the Hadoop streaming job this ...
0
votes
1answer
51 views

Hadoop Mapreduce: Is it possible to write mapper output to separate output files(not intermediate ones) without setting number of reducers to zero?

I need to anonymize GBs of data consisting of thousands of files. Doing this normally takes forever; hence, I plan to use an already installed pseudo-distributed Hadoop cluster on our server. ...
1
vote
2answers
32 views

Share specific data between each mapper

I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?
0
votes
1answer
54 views

Sorting using Map-Reduce - Possible approach

I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a ...
0
votes
0answers
25 views

hadoop streaming - how to inner join of two diff files using python

I want to find out top website page visits based on user age group between 18 and 25. I have two files, one contains username, age and other file contains username,website name example - ...
0
votes
1answer
44 views

hadoop streaming - how to inner join of two diff files using python

I want to find out top website page visits based on user age group between 18 and 25. I have two files, one contains username, age and other file contains username, website name. Examples: users.txt ...
3
votes
2answers
70 views

Analysis of a realtime geodata stream

I have the following position stream, which i can access via a web interface: http://positionstub/interface/ It delivers the information (latitude/longitude), where a vehicle is at the moment. It ...
0
votes
1answer
35 views

HIVE flush errors of multi-stage jobs to stderr in Python

I was wondering if it was possible to flush messages from the Hive CLI to the stderr as they occurred. Currently I am trying to execute a multi-stage query (just a sample not the actual): SELECT ...
0
votes
0answers
20 views

hadoop streaming job failed python

im trying to run simple word count with hadoop mapreduce with python. Im getting the following exception java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code ...
0
votes
3answers
89 views

How to pass parameters to Python streaming script in Hive?

Hive user can stream table through script to transform that data: ADD FILE replace-nan-with-zeros.py; SELECT TRANSFORM (...) USING 'python replace-nan-with-zeros.py' AS (...) FROM some_table; ...
0
votes
2answers
40 views

Processing syslog output to csv with python

I need help with taking log events from my siem and processing them into a csv file that can be ingested into hadoop for further processing. Below is sample from the siem and the desired result. I'm ...
0
votes
1answer
115 views

Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard ...
0
votes
2answers
84 views

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...

1 2 3 4 5 15
15 30 50 per page