Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.

learn more… | top users | synonyms (1)

3
votes
1answer
32 views

Classifying and counting database entries using Scala map and flatMap

I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure: ...
0
votes
1answer
112 views

Unit testing Spark transformation on DataFrame

Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...
0
votes
0answers
31 views

Spark Code for doing SQL Operations on a/ multiple JSON files

Input is one or many json files to run a query and print the result. I have tried both Collect and saveAsTextFile but both are slow and would appreciate suggestions to help speed things up. Here is ...
3
votes
0answers
109 views

PySpark Dataframes program to process huge amounts of server data from a parquet file

I'm new to spark and dataframes and I'm looking for feedback on what bad or inefficient processes might be in my code so I can improve and learn. My program reads in a parquet file that contains ...
0
votes
1answer
117 views

Python + spark to parse and save logs

I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block parser.py: <...
2
votes
1answer
201 views

Implementing an inner product using pyspark

I'm trying to implement a dot product using pyspark in order to learn pyspark's syntax. I've currently implemented the dot product like so: ...
1
vote
0answers
103 views

Increase performance of Spark-job Collaborative Recommendation.

This is my first Spark Application. I am using "ALS.train" for training the model - Model Factorization. The total time that the Application takes is approx 45 mins. Note: I think takeOrdered is the ...
1
vote
0answers
91 views

collect and parallelize in spark performance

the below code is working fine, but as this is new for me, please help to improve performance. as I have not included yet my complex logic but confused over collect and parallelize. Aim is to collect ...
10
votes
1answer
9k views

Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame

Maybe I totally reinvented the wheel, or maybe I've invented something new and useful. Can one of you tell me if there's a better way of doing this? Here's what I'm trying to do: I want a generic <...
4
votes
0answers
147 views

RandomForest multi-class classification

Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...
7
votes
2answers
402 views

Average movie rankings

Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's ...
0
votes
1answer
325 views

Class for finding the median of a two-dimensional space

I have a simple static class that it's purpose is given an RDD of Point to find the median of each dimension and return that as a new ...
8
votes
2answers
19k views

Producing a sorted wordcount with Spark

I'm currently learning how to use Apache Spark. In order to do so, I implemented a simple wordcount (not really original, I know). There already exists an example on the documentation providing the ...
5
votes
1answer
1k views

Why does the LR on spark run so slowly?

Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are: 5 nodes, each node with 8 cores (all the ...