Tagged Questions
Hadoop is an Apache open-source project that provides software for reliable and scalable distributed computing. The project itself includes a variety of other complementary additions.
9
votes
0answers
869 views
Distributed local clustering coefficient algorithm (MapReduce/Hadoop)
I have implemented MapReduce paradigm based local clustering coefficient algorithm. However I have run into serious troubles for bigger datasets or specific datasets (high average degree of a node). I ...
7
votes
0answers
147 views
Hadoop streaming jobs SUCCEEDED but killed by ApplicationMaster
I just finished setting up a small hadoop cluster (using 3 ubuntu machines and apache hadoop 2.2.0) and am now trying to run python streaming jobs.
Running a test job I encounter the following ...
6
votes
0answers
171 views
Hiveserver2 cannot fetch result of a query from remote connection
Hi I am facing a problem while trying to fetch data from a remote hadoop cluster using hiveserver2.
The JDBC connection is working in the sense that meta data queries such as show tables is working ...
5
votes
0answers
192 views
Cascalog Hadoop version support
I notice that the Cascalog getting started guide specifies a version of Hadoop
:profiles { :dev {:dependencies [[org.apache.hadoop/hadoop-core "1.0.3"]]}}
If my group uses a different version of ...
5
votes
0answers
618 views
Write data that can be read by ProtobufPigLoader from Elephant Bird
For a project of mine, I want to analyse around 2 TB of Protobuf objects. I want to consume these objects in a Pig Script via the "elephant bird" library. However it is not totally clear to my how to ...
4
votes
0answers
196 views
Garbage Collection duration in Hadoop CDH5
We have a four-datanodes-cluster running CDH5.0.2, installed through Cloudera Manager parcels.
In order to import 13M users' rows into HBase, we wrote a simple Python script and used hadoop-streaming ...
4
votes
0answers
146 views
Programmatically determine Field names of Scalding/Cascading Pipe
I'm using Scalding to process records with many (> 22) fields. At the end of the process, I'd like to write out the final Pipe's field names to a file. I know this is possible as Mapper and Reducer ...
4
votes
0answers
424 views
How to bundle many files in S3 using Spark
I have 20 million files in S3 spanning roughly 8000 days.
The files are organized by timestamps in UTC, like this: s3://mybucket/path/txt/YYYY/MM/DD/filename.txt.gz. Each file is UTF-8 text ...
4
votes
0answers
233 views
Incremental MapReduce implementations (other than CouchDB, preferably)
I work on a project that sits on a large-ish pile of raw data, aggregates from which are used to power a public-facing informational site (some simple aggregates like various totals and top-tens of ...
4
votes
0answers
3k views
Hadoop Streaming : Chaining Jobs
This is a documentation on how to chain two or more streaming jobs, using Hadoop
Streaming( currently 1.0.3) only and nothing more.
In order to understand the final code that will do the chaining and ...
4
votes
0answers
650 views
Error running Hadoop pipes Program: “Server failed to authenticate”
While trying to run a C++ program referring this ( link ) on my hadoop cluster. I got the error mentioned below.
I referred related posts (this) regarding this error, and tried tweaking my Makefile, ...
3
votes
0answers
57 views
What is an efficient way of running a logistic regression for large data sets (200 million by 2 variables)?
I currently am trying to run a logistic regression model. My data has two variables, one response variable and one predictor variable. The catch is that I have 200 million observations. I am trying to ...
3
votes
0answers
28 views
How to set number of reducer dynamically based on my mapper output size?
I know that the number of mapper can be set based on my dfs split size by setting mapred.min.split.size to dfs.block.size.
Similary how can set I the number of reducers based on my mapper output ...
3
votes
0answers
78 views
Orchestration of Apache Spark using Apache Oozie
We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs.
After some research several ...
3
votes
0answers
413 views
ssh closes connection immediately after login
I was trying to set up hadoop in pseudo distributed mode in fedora 20. I generated the required public keys and copied to authorized_keys. Now ssh localhost logs in without the password but it ...
3
votes
0answers
77 views
How can I using Pig scripts to generate nested Avro field?
I am new to Pig, My input data is in the format as
Record1:
{
label:int,
id: long
},
Record 2:
{
...
}
...
And what I want as output is to get
Record 1:
{
data:{
label:int,
id:long
...
3
votes
0answers
295 views
Hadoop Hive: How to allow regular user continuously write data and create tables in warehouse directory?
I am running Hadoop 2.2.0.2.0.6.0-101 on a single node.
I am trying to run Java MRD program that writes data to an existing Hive table from Eclipse under regular user. I get exception:
...
3
votes
0answers
150 views
Create Custom InputFormat of ColumnFamilyInputFormat for cassandra
I am working on a project, using cassandra 1.2, hadoop 1.2
I have created my normal cassandra mapper and reducer, but I want to create my own Input format class, which will read the records from ...
3
votes
0answers
187 views
elastic map reduce timing out java.io.IOException: Unexpected end of stream
I am running MAP reduce job (Elastic map reduce EMR ) service.The job works fine for small data set but gives following exceptions for large data set (File size 400MB)
Running another job with same ...
3
votes
0answers
670 views
Logistic Regression\SVM implementation in Mahout
I am currently working on sentimental analysis of twitter data for one of telecom company data.I am loading the data into HDFS and using Mahout's Naive Bayes Classifier for predicting the sentiments ...
3
votes
0answers
316 views
Exact steps to kill Hadoop 2.2.0 Configuration deprecation info messages
This question is similar to Hadoop 2.2.0 Configuration deprecation, but the answers to that question did not resolve the issue, so I am asking for specific steps in this question, and providing a ...
3
votes
0answers
861 views
R-rmr2 PipeMapRed.waitOutputThreads(): subprocess failed with code 2
I am running a rmr2 example from here, this is the code i tried :
Sys.setenv(HADOOP_HOME="/home/istvan/hadoop")
Sys.setenv(HADOOP_CMD="/home/istvan/hadoop/bin/hadoop")
library(rmr2)
library(rhdfs)
...
3
votes
0answers
84 views
How to add aspects to hadoop 2.2
I am on Linux, and I don't see the a jar file for aspectj, so I am curious how do I add aspects to yarn. Ideally I would like to just use the Fault Injection Framework ...
3
votes
0answers
230 views
hdinsight new hiveconnection not working
Im using the hdinsight hadoop locally and after successfully running mapreduce jobs on the hdfs i am trying with hive, unfortunately i am getting errors when running the hive query when creating a ...
3
votes
0answers
64 views
How to ensure I do not run into LeaseExpiredException
Right after my job is finished running I have a program going through to upload files into S3 in chunks. I have to do some processing which is why I didn't write directly into S3. I used ...
3
votes
0answers
269 views
“Starting flush of map output” takes very long time in hadoop map task
I execute a map task on a small file (3-4 MB), but map output is relatively large (150 MB). After showing Map 100%, it takes long time to finish the spill. Please suggest how can I reduce this period. ...
3
votes
0answers
325 views
Scaling up Cassandra and Mahout with Hadoop
Is it possible to configure Mahout to retrieve input data from a Cassandra cluster while executing a Recommender Job over Hadoop?
I have found some resources on this topic - see ...
3
votes
0answers
207 views
Child Error due to javax.security.auth.login.LoginException
I have a 20 node Hadoop cluster where each node has 8GB memory and an 8-core processor. I sometimes get the following error on a random basis when I have a long running job with 300-600 reducers:
...
3
votes
0answers
125 views
Job wide custom cleanup after all the map tasks are completed
While running a map-reduce job, that has only mapper, I have a counter that counts the number of failed documents .And after all the mappers are done, I want the job to fail if the total number of ...
3
votes
0answers
505 views
Exception in Using Hadoop for MapReduce
I am facing an exception in using Hadoop on local box.
Exception in thread "main" java.lang.NoSuchMethodError: ...
3
votes
0answers
967 views
s3distcp: can not create path from empty string
While running s3distcp from S3 to HDFS:
sudo -u hdfs hadoop jar /usr/lib/hadoop/lib/s3distcp.jar --src ...
3
votes
0answers
197 views
custom InputFormat, hadoop c++ pipes
I'd like use hadoop c++ pipes to create my may/reduce code. And the input data is binary, I want to custimize the inputformat to control getSplits logic...... but am unsure if that's a possible ...
3
votes
0answers
180 views
'Stream Closed' error when using s3distcp to copy files from HDFS to Amazon S3
I am using s3distcp to copy files from HDFS to Amazon S3. Recently, I started getting the 'Stream Closed' error for reducer tasks. I noticed that the error only happened where there were multiple ...
3
votes
0answers
310 views
ClassCastException while using Avro and MRUnit mapDriver
I am using MRUnit 0.9.0, Avro 1.7.0 and Hadoop 0.20.205.0.
I have configured the mapDriver as follows:
@Before
public void setup()
{
AvroWordCount.Map mapper = new AvroWordCount.Map();
...
3
votes
0answers
664 views
hadoop CompositeInputFormat not joining all data
I'm currently working with Hadoop 0.20.2 and the old API. What I want to do a map-side join. I have a graph dataset which consist of two files one with edges and the other with nodes. The edges are in ...
3
votes
0answers
805 views
How to use Indexing in Hive?
I have written a custom index handler and wanted to test it. However hive is not using it. So I checked with simple table (pokes (int foo, string bar)) which comes with hive distribution for testing ...
3
votes
0answers
548 views
Whirr: Cannot connect to Hadoop cluster on EC2 after lauch-cluster
I am new to Whirr and I'm trying to setup a Hadoop cluster on EC2 with Whirr,I have followed the tutorial on Cloudera https://ccp.cloudera.com/display/CDHDOC/Whirr+Installation
Before install Whirr, ...
3
votes
0answers
211 views
Deploying custom MBeans to Hadoop
I'm starting development of a Hadoop application and I'd like to manage it via a couple of MBeans. I've experimented with using MBeanUtils.register and MBeanServer's register method in jar files I'm ...
3
votes
0answers
438 views
how to import the package org.apache.hadoop.mapreduce.lib.chain in a hadoop 0.20.2 project?
I'm trying to chain maps and reduces phases in one job. The problem is that I'm running under hadoop 0.20.2 and the package org.apache.hadoop.mapred.lib.Chain seems to be deprecated and replaced by ...
3
votes
0answers
1k views
Hadoop MapReduce - Pig/Cassandra - Unable to create input splits
I'm trying to run a MapReduce Job with Pig and Cassandra and I always get the error:
ERROR 2118: Unable to create input splits for: cassandra://constellation/logs
[SOLVED]
There were some environment ...
2
votes
0answers
20 views
Decompressing LZ4 compressed data in Spark
I have LZ4 compressed data in HDFS and I'm trying to decompress it in Apache Spark into a RDD. As far as I can tell, the only method in JavaSparkContext to read data from HDFS is textFile which only ...
2
votes
0answers
33 views
Spring XD dynamic deployment manifest
I have been reading the Spring XD documentation fairly heavily and can't really get to grips with two things I'd like to achieve in relation to Hadoop YARN.
Maybe they aren't supported yet or won't ...
2
votes
0answers
64 views
loading data into hdfs in parallel
I have a Hadoop cluster consisting of 3 Nodes. I want to load a 180 GB file into HDFS as fast as possible. I know neither -put nor -copyFromLocal are going to help me in this as they are single ...
2
votes
0answers
71 views
Manual Fix of Hbase table Overlap (Multi region has same start key)
I was inserting the data into Hbase through the java client. But, suddenly the Region server crashed at a point. So i restarted the Hbase, which after that the Hmaster was not running. When i run the ...
2
votes
0answers
62 views
Reading SequenceFile written by Spark
I have bunch of sequence files that I want to read using Scalding and I am having some troubles. This is my code:
class ReadSequenceFileApp(args:Args) extends ConfiguredJob(args) {
...
2
votes
0answers
35 views
Why is my test hadoop code that connects to libhdfs throwing a Segmentation fault?
I'm using libhdfs to connect and write into an hdfs system. The program works fine, however when I attach GDB to it, It segfaults in hdfsConnect, but the connection goes through and I'm able to write ...
2
votes
0answers
41 views
How to execute aggreagatewordcount example in hadoop which uses hadoop aggregate framework?
I tried executing the aggregatewordcount example found in hadoop examples jar file. Even though the program ran successfully, the output was not what I expected. The output file just has a single line ...
2
votes
0answers
67 views
Getting NameNode's fsimage size using Java
I'm trying to get metadata about a NameNode from a running Hadoop cluster using Java. Specifically, I would like to get the size of fsimage, the last checkpoint time, and number and size of the edit ...
2
votes
0answers
55 views
NoSuchMethodError using Guava 15 on Hadoop (2.3.0)
I have a compiled jar for Hadoop including this library:
com.google.guava:guava:jar:15.0:compile
When I submit it into my Hadoop CDH5.0.1 cluster I have this error:
java.lang.NoSuchMethodError: ...
2
votes
0answers
37 views
File processing using AWS EMR
I need architectural suggestion for this problem I'm working on. I have log files coming in every 15 minutes in gzipped folder. Each of these have about 100,000 further files to process. I have a ...