Tagged Questions
0
votes
1answer
16 views
Hadoop mapreduce python command line arguments
In my python mapper code, I need to access the 'path' given in -input 'path'. How is it possible to access this in python code?
0
votes
0answers
77 views
Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
Hey I'm fairly new to the world of Big Data.
I came across this tutorial on
http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/
It describes in detail of how to run ...
0
votes
1answer
31 views
SequenceFile format with Hadoop Streaming Python
Does Hadoop officially support streaming with binary formats as of 0.21?
The hadoop-streaming.jar accepts an inputFormat that is a Java class name. How do you provide the Hadoop streaming job this ...
0
votes
1answer
51 views
Hadoop Mapreduce: Is it possible to write mapper output to separate output files(not intermediate ones) without setting number of reducers to zero?
I need to anonymize GBs of data consisting of thousands of files. Doing this normally takes forever; hence, I plan to use an already installed pseudo-distributed Hadoop cluster on our server.
...
1
vote
2answers
32 views
Share specific data between each mapper
I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?
0
votes
1answer
54 views
Sorting using Map-Reduce - Possible approach
I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a ...
0
votes
0answers
25 views
hadoop streaming - how to inner join of two diff files using python
I want to find out top website page visits based on user age group between 18 and 25.
I have two files, one contains username, age and other file contains username,website name
example -
...
0
votes
1answer
44 views
hadoop streaming - how to inner join of two diff files using python
I want to find out top website page visits based on user age group between 18 and 25.
I have two files, one contains username, age and other file contains username, website name. Examples:
users.txt
...
3
votes
2answers
70 views
Analysis of a realtime geodata stream
I have the following position stream, which i can access via a web interface:
http://positionstub/interface/
It delivers the information (latitude/longitude), where a vehicle is at the moment. It ...
0
votes
1answer
35 views
HIVE flush errors of multi-stage jobs to stderr in Python
I was wondering if it was possible to flush messages from the Hive CLI to the stderr as they occurred. Currently I am trying to execute a multi-stage query (just a sample not the actual):
SELECT ...
0
votes
0answers
20 views
hadoop streaming job failed python
im trying to run simple word count with hadoop mapreduce with python. Im getting the following exception
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code ...
0
votes
3answers
89 views
How to pass parameters to Python streaming script in Hive?
Hive user can stream table through script to transform that data:
ADD FILE replace-nan-with-zeros.py;
SELECT
TRANSFORM (...)
USING 'python replace-nan-with-zeros.py'
AS (...)
FROM some_table;
...
0
votes
2answers
40 views
Processing syslog output to csv with python
I need help with taking log events from my siem and processing them into a csv file that can be ingested into hadoop for further processing. Below is sample from the siem and the desired result. I'm ...
0
votes
1answer
115 views
Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe
I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard ...
0
votes
2answers
84 views
How do I write the output of an EMR streaming job to HDFS?
I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...