Tagged Questions
0
votes
0answers
13 views
Invalid syntax hadoop streaming error
I am trying to run a Hadoop streaming Python job:
/home/hduser/hadoop/bin/hadoop jar /home/hduser/hadoop/share/hadoop/tools/lib/hadoop-*streaming*.jar -file audio.py -cacheFile ...
0
votes
0answers
20 views
Error launching job using mrjob on Hadoop
I am new to hadoop and mrjob and this book really helped me a lot to learn. I was trying to run mrSVM.py on hadoop as it works fine locally.
But I ran the following command:python mrSVM.py -r hadoop ...
0
votes
0answers
12 views
How to mention a Combiner in Oozie while using streaming jar
I have a streaming job that I am calling through Oozie. I am able to run this successfully with a mapper and reducer. But what I am failing to understand is, how do I pass the combiner. All my mapper, ...
0
votes
0answers
16 views
Can not put a file in hdfs using hadoopy
I have installed hadoopy based on this tutorial:
http://www.hadoopy.com/en/latest/tutorial.html#putting-data-on-hdfs
But when I try to run a simple example, for instance(example.py):
import hadoopy
...
1
vote
1answer
25 views
How to run a C++ executable from hadoop python wrapper
I am new in hadoop streaming library using python. So the question may look stupid but I got stuck here badly. Any help is appreciated.
I am trying to run a C++ executable (which takes a local ...
0
votes
0answers
34 views
Where did the Luigi task go?
First time into the realm of Luigi (and Python!) and have some questions. Relevant code is:
from Database import Database
import luigi
class bbSanityCheck(luigi.Task):
conn = luigi.Parameter()
...
0
votes
1answer
50 views
Why won't python read stdin input as a dictionary?
I'm sure I'm doing something dumb here, but here goes. I'm working on a class assignment for my Udacity class "Intro to Map Reduce and Hadoop". Our assignment is to make a mapper/reducer that will ...
0
votes
1answer
20 views
python script for avro conversion using Hadoop Streaming
I have 10 GB of input file which i am trying to convert to avro using python hadoop streaming, the job is successfull but i canot read the output using the avro reader.
It is giving 'utf8' codec ...
0
votes
0answers
17 views
Hive client for Python 3.x
is it possible to connect to hadoop and run hive queries using Python 3.x? I am using Python 3.4.1.
I found out that it can be done as written here:
...
0
votes
0answers
8 views
Hadoop streaming - wrapper executing binary application issues
I'm new to Hadoop and am attempting to use Hadoop streaming to parallelize a physics simulation that is compiled into a binary. The idea would be to run the binary in parallel using maps with one ...
1
vote
1answer
56 views
With Spark,how to connect master or solve an error:“WARN TaskSchedulerImpl: Initial job has not accepted any resources”
Please tell me to how to following problem.
Firstly,I confirmed that following code run when master is "local".
Then I started two EC2 instances(m1.large).
However,when master is ...
1
vote
1answer
12 views
Parsing json string generated from org.apache.avro.mapred.AvroAsTextInputFormat using python streaming
In hadoop streaming using python for reading avro data file I am using the input format, which doc says the input key is string representation in JSON.
-inputformat ...
0
votes
0answers
16 views
Hive data search & exploration tool
I have several Hive tables. I would like to create a web interface where users could search and explore a small sample dataset and also the schema of the tables. One option could be by exporting a ...
3
votes
0answers
90 views
What is an efficient way of running a logistic regression for large data sets (200 million by 2 variables)?
I currently am trying to run a logistic regression model. My data has two variables, one response variable and one predictor variable. The catch is that I have 200 million observations. I am trying to ...
1
vote
1answer
25 views
mongodb_hadoop streaming with python: -inputURI not recognized
I'm trying to create a MapReduce application in python using the mongodb_hadoop connecter.
I have a cluster with hadoop 2.2.0 installed.
I've installed the mongodb_hadoop connector v1.3.0.
I've ...
0
votes
1answer
21 views
How to call filebrowser in HUE
I'll start by saying that I'm very new to HUE and Python and have no prior experience with either.
What I have to do now is make my own HUE application to upload files to HDFS, start an oozie work ...
-1
votes
0answers
10 views
Pydoop IOError: Cannot connect to localhost
I installed Hadoop 2.2.0 and Pydoop on fedora 20.
on executing command
hadoop fs -ls hdfs://localhost:8020/
output:
drwxr-xr-x - root supergroup 0 2014-07-09 17:03 ...
0
votes
0answers
23 views
Python to Pig. Loading Binary Delimited Text
I'm a little new to Pig/Hadoop. I'm trying to load a server logs stored as a gzip, however the logs are stored in binary delimited form. In Python, I would translate the file as below. Anybody know ...
0
votes
1answer
18 views
Amazon EMR job with many json files as input
I am writing a hadoop streaming application in python to run on EMR. The input for the EMR job is a directory of files in an S3 bucket, each of which is a json file containing a single json object. I ...
0
votes
1answer
44 views
is MapReduce usefull for processing big files, crawling a lot of pages for data and inserting them in Hbase?
I have some python scripts that I run every day, these scripts do this stuff :
parse 1000 text files (gziped) :
~ 100 GB
30 Millions rows
Crawl some data from many websites :
40 ...
0
votes
1answer
26 views
Efficient way to intersect multiple large files containing geodata
Okay, deep breath, this may be a bit verbose, but better to err on the side of detail than lack thereof...
So, in one sentence, my goal is to find the intersection of about 22 ~300-400mb files based ...
0
votes
0answers
13 views
Issue with using files in distributed cache in Elastic MapReduce
I'm trying to make use of an external library in my Python mapper script in an AWS Elastic MapReduce job.
However, my script doesn't seem to be able to find the modules in the cache. I archived the ...
1
vote
1answer
40 views
unable to run map reduce using python in Hadoop?
I have written mapper and reducer in python for word count program that works fine.
Here is a sample:
echo "hello hello world here hello here world here hello" | wordmapper.py | sort -k1,1 | ...
0
votes
1answer
30 views
Killing a program with except: pass
Is there any way to kill a program that ignores all exceptions? Stupid, I know. I was testing something (since I wasn't sure what error a failed, embedded pig script would throw), forgot to limit the ...
1
vote
0answers
19 views
Hadoop streaming accessing files in a directory
I wish to access a directory in Hadoop (via Python streaming) and loop through its image files, calculating hashes of each in my mapper.
Does the following logic make sense (and instead of hard ...
0
votes
0answers
21 views
MapReduce Task fails Python
I seem to be getting the following error:
14/07/02 23:29:14 INFO mapreduce.Job: Task Id : attempt_1395688818137_1239_r_000001_2, Status : FAILED
Error: java.lang.RuntimeException: ...
0
votes
3answers
61 views
Why my hadoop output is many parts of file?
I tried to count the frequency of word, and write the file:
mapper.py:
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and ...
2
votes
1answer
23 views
error while executing python mapreduce tasks in hadoop?
I have written mapper and reducer for the wordcount example in python. The scripts works fine as a standalone ones. but I get error when run in hadoop.
I am using hadoop2.2
Here is my command:
...
0
votes
0answers
66 views
Installing pyspark on hadoop and yarn
I have installed spark on top of hadoop and yarn.
when I launch the pyspark shell and try to compute something I get this error.
Error from python worker:
/usr/bin/python: No module named pyspark
...
0
votes
0answers
33 views
Shuffle and Sorting in Hadoop
I have been reading about Hadoop and have implemented sample MR programs in Hadoop using Python.
I am confused about shuffle and sorting in hadoop
My mapper code emits key value pairs
Example ...
0
votes
0answers
62 views
Hive UDF with Python - Runtime error
I write a Hive Query that calls a UDF written in Python, but I get this error:
java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row ...
0
votes
0answers
41 views
issues on python packages on hadoop distributed system nodes
I use python to do hadoop streaming. We use an AWS hadoop streaming distributed systems which has a master node, and four slave nodes. If I need to install a package on python, I need to install the ...
0
votes
1answer
32 views
Hadoop EMR using Python
I'm using Hadoop streaming to use my mapper and reducer code in python to run a Mapreduce job. I have input data in s3, and I'm trying to use that for the job. However, when I run the command like ...
0
votes
1answer
84 views
Why these seemed right hadoop streaming python scripts do not work?
I have a set of hadoop streaming job, like below:
bash file:
hadoop fs -rmr /tmp/someone/sentiment/
hadoop jar ...
0
votes
1answer
201 views
Hive UDF with Python
I'm new to python, pandas, and hive and would definitely appreciate some tips.
I have the python code below, which I would like to turn into a UDF in hive. Only instead of taking a csv as the input, ...
0
votes
0answers
25 views
how to work with dumbo
I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python ...
0
votes
1answer
42 views
Iterative kmeans based on mapreduce and hadoop
I have written a simple k-means clustering code for Hadoop (two separate programs - mapper and reducer). The code is working over a small dataset of 2d points on my local box. It's written in Python ...
0
votes
1answer
256 views
Exporting a Scikit Learn Random Forest for use on Hadoop Platform
I've developed a spam classifier using pandas and scikit learn to the point where it's ready for integration into our hadoop-based system. To this end, I need to export my classifier to a more common ...
-1
votes
1answer
53 views
remove empty line printed from hive query output using python
i am performing a hive query and storing the output in a tsv file in the local FS. I am running a for loop for the hive query and passing different parameters. If the hive query returns no output once ...
0
votes
1answer
77 views
Preserving column data types in Hadoop UDF output (Streaming)
I'm writing a UDF in Python for a Hive query on Hadoop. My table has several bigint fields, and several string fields.
My UDF modifies the bigint fields, subtracts the modified versions into a new ...
0
votes
0answers
30 views
Running extrnal python lib like (NLTK) with hadoop streaming
I tried using http://blog.cloudera.com/blog/2008/11/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/
zip -r nltkandyaml.zip nltk yaml
mv ntlkandyaml.zip ...
2
votes
1answer
63 views
Hadoop streaming failed with java.io.FileNotFoundException
I have written a map only python map-reduce job which accepts data from standard input and process it to produce some output. It works fine when executed locally. However, when I am trying to execute ...
-3
votes
2answers
45 views
how to get output from part-r-0000 in apache pig
I am parsing pcap file using pig. I am getting output in part-r-0000 file.
It is showing me following output.
1101
1646
503
679
556
480
80
471
How to get actual output from that file? What is the ...
1
vote
0answers
57 views
kmeans based on mapreduce by python
I am going to write a mapper and reducer for the kmeans algorithm, I think the best course of action to do is putting the distance calculator in mapper and sending to reducer with the cluster id as ...
0
votes
1answer
19 views
Send output of Hadoop streaming job to STDOUT
For streaming jobs you have to specify an output directory. What if I wanted to output the results of the mapper to stdout instead of an HDFS directory. Is this possible? I want to do this so I can ...
0
votes
2answers
49 views
Convert list elements into array
I have a list tsv file which I am parsing and want to convert it into an array.
Here is the file format -
jobname1 queue maphours reducehours
jobname2 queue maphours reducehours
code
with ...
7
votes
0answers
202 views
Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster
I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.
Running a test job I encounter the following ...
-1
votes
1answer
34 views
change a python script to Unix line-ending convention
What is the easiest way to change a python script to Unix line-ending convention?
I am running a python script on Hadoop and seeing the following stderr log:
/usr/bin/env: python
: No such file or ...
0
votes
3answers
54 views
Remove empty lines from hive query output I am saving on local filesystem
I am running a python script on my devbox to remotely ssh on a grid gateway box to launch another python script which runs the hive query and returns the output back and I save it on my devbox in the ...
1
vote
0answers
27 views
How to determine locality of HDFS file for use in Python?
I have a system that runs Python tasks across a compute cluster using Celery to manage the queue. These tasks operate on data stored in MapR-FS (which exposes the Hadoop HFDS API, so things ...