Big data is a concept that deals with data sets of extreme volumes. Questions may tend to be related to infrastructure, algorithms, statistics, and data structures.

learn more… | top users | synonyms

5
votes
0answers
135 views

Decision tree implementation issue in apache spark with java

I'm trying to implement simple demo for decision tree classifier using java and apache spark 1.0.0 version. I base on http://spark.apache.org/docs/1.0.0/mllib-decision-tree.html. So far I've wrote ...
3
votes
0answers
62 views

How can Kafka limitations be avoided?

We're trying to build a BI system that will collect very large amounts of data that should be processed by other components. We decided that it will be a good idea to have an intermediate layer to ...
3
votes
0answers
78 views

Orchestration of Apache Spark using Apache Oozie

We are thinking of the integration of apache spark in our calculation process where we at first wanted to use apache oozie and standard MR or MO (Map-Only) jobs. After some research several ...
2
votes
0answers
44 views

Export large amount of data from Cassandra to CSV

I'm using Cassandra 2.0.9 for store quite big amounts of data, let's say 100Gb, in one column family. I would like to export this data to CSV in fast way. I tried: sstable2json - it produces quite ...
2
votes
0answers
71 views

Manual Fix of Hbase table Overlap (Multi region has same start key)

I was inserting the data into Hbase through the java client. But, suddenly the Region server crashed at a point. So i restarted the Hbase, which after that the Hmaster was not running. When i run the ...
2
votes
0answers
71 views

bigrquery: Error with Google Big Query R interface

I'm using bigrquery R package to fetch the data. But i'm getting the following error. Let me know if anyone knows how to fix this error. "Waiting for authentication in browser... Authentication ...
2
votes
0answers
49 views

EMR bootstrap action to run Hue on Mapr M3

Is there some bootstrap script to get hue running on EMR MapR, unlike setting up using this guide http://doc.mapr.com/display/MapR/Configuring+Hue
2
votes
0answers
84 views

R instability with high numbers of ggplots

Background I'm trying to generate a whopping ton of histogram plots (about 100) by using ggplot and this multiplot function. multiplot takes a list of plots as its main argument, so I generate a ...
2
votes
0answers
71 views

django unit test database is empty

We are doing a project which uses more than 10 tables for data flow, but when using unit test for database table data query scheme, it returns an empty set. Is there any way we can run './manage.py ...
2
votes
0answers
207 views

R bigmemory attach.big.matrix is very slow for very wide matrices

I am using the package bigmemory to interact with large matrices in R. This works well for large matrices except that the attach.big.matrix() function to reload a binary file created with ...
2
votes
0answers
89 views

How to load rdf data to bigdata using nanosparqlserver

I have downloaded bigdata.war and have deployed it using sesame HTTP API. Now I am not getting as to how should I load rdf triples/provenance with triples to bigdata using nanosparqlserver? I am using ...
2
votes
0answers
178 views

Scala or Java analogues of PyTables & numexpr

I am looking for Scala or Java analogues of numexpr and PyTables (particularly tables.Expr). This is for a multicore analytics systems on multicore machines which needs to perform matrix operations ...
2
votes
0answers
170 views

Store and visualize super large social graph both on desk application and online

anyone knows about the most efficient way to store and visualize a large graph with several million edges? I'm aware of Gephi. But it can't visualize such a big data set.(at least in my laptop with ...
1
vote
0answers
11 views

Curator framework for zookeeper - Interprocess mutex takes 50ms to acquire lock each time

I am using Curator framework Interprocess mutex for creating distributed lock to reserve some resource. However, I can see that zookeeper takes 50-100 ms each time to acquire a lock and 20-40 ms for ...
1
vote
0answers
15 views

Kafka Spout consumer and another consumer running inside Spring not able to run simultaneously

My application accepts a url containing the data it needs to process through a rest service it exposes using spring. Each time it receives a url from which to accept data, the application Sends the ...
1
vote
0answers
35 views

Why MapReduce processing Avro files is slower than processing flat files?

Why MapReduce processing Avro files is slower than processing flat files? I expected that processing Avro files would be a lot faster than processing flat files, but my assumption is wrong. Avro ...
1
vote
0answers
32 views

Mysql: Multiple updates from a single select

I have a case where I need to match a group of fields as Unique on Addresses table, but for that, on database, i have to detect duplicates, delete them from database and update all associated foreign ...
1
vote
0answers
27 views

hadoop2.4.0 namenode -format showing NoClassDefFoundError error

I have configured and install the HADOOP I configured from the website http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os when I format the ...
1
vote
0answers
25 views

Finding longest common sequences in big data

I have logs from a bunch (millions) of small experiments. Each log contains a list (tens to hundreds) of entries. Each entry is a timestamp and an event ID (there are several thousands of unique ...
1
vote
0answers
31 views

why storm performance are getting very slow after few minutes?

I'm running throughput topology for testing the performance. In the first two minutes I'm getting good performance average of 450k emitted/sec after 10 minutes it goes down to average of 100k per/sec. ...
1
vote
0answers
17 views

Issue with running more than one topology on storm cluster

It is not possible to run more than one topology on the same cluster. All topologies are registered fine, I can see them in the UI, but only the first topology runs. No workers,executors,tasks are ...
1
vote
0answers
99 views

Failing to write offset data to zookeeper in kafka-storm

I was setting up a storm cluster to calculate real time trending and other statistics, however I have some problems introducing the "recovery" feature into this project, by allowing the offset that ...
1
vote
0answers
18 views

Slow performance in using storm local cluster

I'm trying to find out the storm as a pipe line performance. I ran the following code in local cluster mode: http://kaviddiss.com/2013/05/17/how-to-get-started-with-storm-framework-in-5-minutes/ It ...
1
vote
0answers
41 views

Need suggestions implementing recursive logic in Hive UDF

We have a hive table that has around 500 million rows. Each row here represents a "version" of the data and Ive been tasked to create table which just contains the final version of each row. ...
1
vote
0answers
38 views

GoogleCloudPlatform/solutions-automated-file-loader-for-bigquery In PHP

I am not familiar with java development. I am PHP developer. I want solution for automated file loader for bigquery In PHP like this enter link description here currently i am using BigQuery REST ...
1
vote
0answers
29 views

Java equivalent to R iPlots package (alternative to Mondrian)

One demonstrated me the power of R iPlot package : you plot the same individual data two different ways, and when selecting data on one figure, it selects the matching data on the other figure. Eg: ...
1
vote
0answers
34 views

Is the hadoop cluster configuration possible? what are minimum disk space requirements?

My hadoop clusters are based on Virtual Machine. Following is the configuration: 1 master and 9 slaves. master: disk space: 20GB memory: 16G CPU cores: 8 slave1 ~ slave9: disk space: 5GB ...
1
vote
0answers
122 views

Upconversion/ Grouping using Map Reduce

I have 2 documents List of offerings and associated zip codes US Postal code data. The first document is of the form: offer, location(currently only zips) 1, 84121 1, 84101 1, 58103 1, 58102 2, ...
1
vote
0answers
67 views

r - viterbi RHmm Error protection stack overflow

I was looking for a HMM implementation in R to analyze states in a string of characters and the HMM library seems to run slow, then I am using the RHmm library. My data is a string of 1953138 symbols ...
1
vote
0answers
32 views

Hadoop data nodes die very often

Our Hadoop cluster is a cluster of 5 data nodes and 2 name nodes. The traffic is actually very high and a few nodes go down very often. But they come back after a while. Some times it takes a long ...
1
vote
0answers
227 views

Apache Kafka consumer client connecting to Apache Zookeeper: EndOfStreamException

I get an error when trying to 'consume' messages from Kafka (2.9.2-0.8.1) with a Zookeer stand-alone (3.4.5). You can see the source code below as well as the error message and logfile from Zookeeper. ...
1
vote
0answers
76 views

matrix multiplication on hadoop

I'm looking for the best and easy way of matrix multiplication on hadoop java. Meanwhile I looked at this link http://www.norstad.org/matrix-multiply/index.html but I felt tough to understand it. ...
1
vote
0answers
77 views

How to read large data set at hourly interval

For example, I have 30million records, stored in our datastore. Then I want to read a fraction of them randomly at 2 hours interval: e.g I want to read 1 million random records every 2 hours, and do ...
1
vote
0answers
325 views

Hive query with where clause not working

I am querying an external Hbase table from Hive. when i do a simple query select * from Document_Table_Hive The query works and I get the records stored in the table. but when I do a query with ...
1
vote
0answers
37 views

Finding and debugging bad record using hive

Is there any way to pinpoint the badrecord when we are loading the data using hive or while processing the data. The scenario Goes like this. Suppose I have file that need to be loaded as table using ...
1
vote
0answers
26 views

best solution for hirercical and non typed data

I have a db schema like this: ELEMENT(uuid[string], name[string], status[integer], ...) APPLICATION(uuid[string], name[string], config[string], status[integer], parent[foreign key on self or foreign ...
1
vote
0answers
47 views

Hive optimizer not performing well for joins involving partitioned tables

I am using Hive version 0.7.1-cdh3u2 I have two big tables (let's say) A and B, both partitioned by day. I am running the following query select col1,col2 from A join B on (A.day=B.day and ...
1
vote
0answers
30 views

Trying to upgrade from CDH4.2 to CDH4.5, but can not Distribute it

I'm trying to upgrade from CDH4.2 to CDH4.5 use cloudera Manager. I click 'download' of CDH 4.5.0-1.cdh4.5.0.p0.30, it shows 100%, but the button still shows 'download', not distribute. I click ...
1
vote
0answers
178 views

Cloudera Manager. Failed to detect Cloudera Manager Server

I have two PC's with CentOS 6.5 client86-101.aihs.net 80.94.86.101 client86-103.aihs.net 80.94.86.103 cloudera-manager-server installed on client86-101.aihs.net. I have the problem on detecting ...
1
vote
0answers
56 views

How to use Yarn schedulers and queues ?

I need to access Yarn schedulers and queues from java programs to change the priority of submitted MR-jobs. Is it possible ? And if it is, please help with some code snippets. Similar codes for ...
1
vote
0answers
11 views

How to process dynamo objects for querying

I have complex objects (4 level relationships) such as a match and a team and a fixture which has players.i have data in millions which is growing daily.How do i prepare them for reporting if i hold ...
1
vote
0answers
306 views

Reading labview binary files in Matlab?

I have large .bin files (10GB-60GB) created by Labview software, the .bin files represent the output of two sensors used from experiments that I have done. The problem I have is importing the data ...
1
vote
0answers
141 views

Most Efficient Way of Chunking a Large Iterable in Python for Brute Forcing

I am trying to develop a way to address large parallel tasks for bruteforcing a keyspace. I'd like to be able to come up with a way to pass a worker a value in such a way that given a chunk size, that ...
1
vote
0answers
15 views

Searching in a large dataset (Stays in boxes => Meetings)

I am working on a model of social interactions in mice. I have mice and boxes and a simulation that outputs which mouse stays in which box during which time period. The problem is how to obtain, in ...
1
vote
0answers
597 views

Split Large table of Terabytes using MYSQL Sharding

I know that horizontal partitioning...you can create many tables. I've seen that In a application based sharding, you will have the same database structure on multiple database servers. But it won't ...
1
vote
0answers
786 views

MongoDB running out of disk space

I have to store around 200GB of raw data in one MongoDB collection. Which works fine. After I inserted all objects, I have to iterate over the whole collection with a cursor and write some new fields ...
1
vote
0answers
70 views

no sql read and write intensive bigdata table

I am having 10 different queries and a total of 40 columns. Looking for solutions in available Big data noSQL data bases that will perform read and write intensive jobs (multiple queries with SLA). ...
1
vote
0answers
231 views

How to plot a heatmap of a big matrix with matplotlib (45K * 446)

I am trying to plot a heatmap of a big microarray dataset (45K rows per 446 columns). Using pcolor from matplotlib I am unable to do it because my pc goes easily out of memory (more than 8G).. I'd ...
1
vote
0answers
104 views

Error while downloading file from jasper

I am trying to download report from oracle 11g using jasper server. The URL is calling from APEX. For small files its working correctly But when the report become 15MB or something the exported file ...
1
vote
0answers
85 views

python shelve TypeError on large dictionary object

I have a large dictionary object dict_tmp that takes 40GB in RAM (system has a total of 64GB), which has string keys and float values. I use d = shelve.open(fname, protocol=2) and d['dict_tmp'] = ...