Tagged Questions

votes

1answer

16 views

Hadoop mapreduce python command line arguments

In my python mapper code, I need to access the 'path' given in -input 'path'. How is it possible to access this in python code?

asked 2 days ago

user2401464
112

votes

0answers

77 views

Running a job using hadoop streaming and mrjob: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

Hey I'm fairly new to the world of Big Data. I came across this tutorial on http://musicmachinery.com/2011/09/04/how-to-process-a-million-songs-in-20-minutes/ It describes in detail of how to run ...

python hadoop mapreduce hadoop-streaming mrjob

asked Jun 11 at 5:50

Kiran Karanth
12

votes

1answer

31 views

SequenceFile format with Hadoop Streaming Python

Does Hadoop officially support streaming with binary formats as of 0.21? The hadoop-streaming.jar accepts an inputFormat that is a Java class name. How do you provide the Hadoop streaming job this ...

python hadoop mapreduce

asked Jun 9 at 17:23

T. Webster
1,8081427

votes

1answer

51 views

Hadoop Mapreduce: Is it possible to write mapper output to separate output files(not intermediate ones) without setting number of reducers to zero?

I need to anonymize GBs of data consisting of thousands of files. Doing this normally takes forever; hence, I plan to use an already installed pseudo-distributed Hadoop cluster on our server. ...

java python apache hadoop mapreduce

asked Jun 8 at 19:27

user1097128
1

vote

2answers

32 views

Share specific data between each mapper

I would like to add a specific subset of records to be merged with each chunk of records at each mapper, How can I do this in Hadoop generally? and in Python streaming package mrJob?

python hadoop mapreduce hadoop-streaming mrjob

asked Jun 6 at 14:49

Ahmed Elmorsy
80212

votes

1answer

54 views

Sorting using Map-Reduce - Possible approach

I have a large dataset with 500 million rows and 58 variables. I need to sort the dataset using one of the 59th variable which is calculated using the other 58 variables. The variable happens to be a ...

python sorting hadoop bigdata hadoop-streaming

asked Jun 4 at 19:19

Arkid Mitra
52621332

votes

0answers

25 views

hadoop streaming - how to inner join of two diff files using python

I want to find out top website page visits based on user age group between 18 and 25. I have two files, one contains username, age and other file contains username,website name example - ...

python hadoop streaming using

asked Jun 4 at 4:33

user2450086
1

votes

1answer

44 views

hadoop streaming - how to inner join of two diff files using python

I want to find out top website page visits based on user age group between 18 and 25. I have two files, one contains username, age and other file contains username, website name. Examples: users.txt ...

python hadoop hadoop-streaming

asked Jun 4 at 3:58

user2450086
1

votes

2answers

70 views

Analysis of a realtime geodata stream

I have the following position stream, which i can access via a web interface: http://positionstub/interface/ It delivers the information (latitude/longitude), where a vehicle is at the moment. It ...

php c++ python hadoop storm

asked May 31 at 21:11

mcknight
47212

votes

1answer

35 views

HIVE flush errors of multi-stage jobs to stderr in Python

I was wondering if it was possible to flush messages from the Hive CLI to the stderr as they occurred. Currently I am trying to execute a multi-stage query (just a sample not the actual): SELECT ...

python hadoop subprocess hive

asked May 29 at 22:38

BigOrangeSU
15110

votes

0answers

20 views

hadoop streaming job failed python

im trying to run simple word count with hadoop mapreduce with python. Im getting the following exception java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code ...

python exception hadoop mapreduce

asked May 28 at 8:24

user2401464
112

votes

3answers

89 views

How to pass parameters to Python streaming script in Hive?

Hive user can stream table through script to transform that data: ADD FILE replace-nan-with-zeros.py; SELECT TRANSFORM (...) USING 'python replace-nan-with-zeros.py' AS (...) FROM some_table; ...

python script hadoop streaming hive

asked May 24 at 3:51

Bohdan Voloshyn
565514

votes

2answers

40 views

Processing syslog output to csv with python

I need help with taking log events from my siem and processing them into a csv file that can be ingested into hadoop for further processing. Below is sample from the siem and the desired result. I'm ...

python hadoop syslog

asked May 23 at 22:09

user2313375
11

votes

1answer

115 views

Reading Files in HDFS (Hadoop filesystem) directories into a Pandas dataframe

I am generating some delimited files from hive queries into multiple HDFS directories. As the next step, I would like to read the files into a single pandas dataframe in order to apply standard ...

python hadoop pandas hdfs

asked May 16 at 21:47

SetJmp
3,37462556

votes

2answers

84 views

How do I write the output of an EMR streaming job to HDFS?

I see examples of people writing EMR output to HDFS, but I haven't been able to find examples of how it's done. On top of that, this documentation seems to say that the --output parameter for an EMR ...

python hadoop emr mrjob

asked May 8 at 4:27

Abe
1,301730

15 30 50 per page

newest hadoop python questions feed

212

questions tagged

hadoop python

mapreduce × 79
hadoop-streaming × 48
mrjob × 16
streaming × 15
java × 14
pig × 12
hdfs × 11
amazon-web-services × 10
hbase × 9
emr × 8
amazon-emr × 7
elastic-map-reduce × 7
thrift × 7
hive × 6
apache × 5
reduce × 4
mongodb × 4
database × 4
subprocess × 4
bigdata × 3
machine-learning × 3
boto × 3
avro × 3
php × 3
r × 3

Tagged Questions

Related Tags