Apache Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write.
3
votes
1answer
32 views
Classifying and counting database entries using Scala map and flatMap
I am new to Spark and Scala and I have solved the following problem. I have a table in database with following structure:
...
0
votes
1answer
112 views
Unit testing Spark transformation on DataFrame
Looking for suggestions on how to unit test a Spark transformation with ScalaTest. The test class generates a DataFrame from static data and passes it to a transformation, then makes assertion on the ...
0
votes
0answers
31 views
Spark Code for doing SQL Operations on a/ multiple JSON files
Input is one or many json files to run a query and print the result.
I have tried both Collect and saveAsTextFile but both are slow and would appreciate suggestions to help speed things up.
Here is ...
3
votes
0answers
109 views
PySpark Dataframes program to process huge amounts of server data from a parquet file
I'm new to spark and dataframes and I'm looking for feedback on what bad or inefficient processes might be in my code so I can improve and learn. My program reads in a parquet file that contains ...
0
votes
1answer
117 views
Python + spark to parse and save logs
I need to parse logs and have got following code. I can see two problems: map().filter() may induce some performance penalties and copy-paste block
parser.py:
<...
2
votes
1answer
201 views
Implementing an inner product using pyspark
I'm trying to implement a dot product using pyspark in order to learn pyspark's syntax.
I've currently implemented the dot product like so:
...
1
vote
0answers
103 views
Increase performance of Spark-job Collaborative Recommendation.
This is my first Spark Application. I am using "ALS.train" for training the model - Model Factorization. The total time that the Application takes is approx 45 mins.
Note: I think takeOrdered is the ...
1
vote
0answers
91 views
collect and parallelize in spark performance
the below code is working fine, but as this is new for me, please help to improve performance. as I have not included yet my complex logic but confused over collect and parallelize.
Aim is to collect ...
10
votes
1answer
9k views
Generic “reduceBy” or “groupBy + aggregate” functionality with Spark DataFrame
Maybe I totally reinvented the wheel, or maybe I've invented something new and useful. Can one of you tell me if there's a better way of doing this? Here's what I'm trying to do:
I want a generic <...
4
votes
0answers
147 views
RandomForest multi-class classification
Below is the code I have for a RandomForest multiclass-classification model. I am reading from a CSV file and doing various transformations as seen in the code. I ...
7
votes
2answers
402 views
Average movie rankings
Given a list of tuples of the form, (a, b, c), is there a more direct or optimized for calculating the average of all the c's ...
0
votes
1answer
325 views
Class for finding the median of a two-dimensional space
I have a simple static class that it's purpose is given an RDD of Point to find the median of each dimension and return that as a new ...
8
votes
2answers
19k views
Producing a sorted wordcount with Spark
I'm currently learning how to use Apache Spark. In order to do so, I implemented a simple wordcount (not really original, I know). There already exists an example on the documentation providing the ...
5
votes
1answer
1k views
Why does the LR on spark run so slowly?
Because the MLlib does not support the sparse input, I ran the following code, which supports the sparse input format, on spark clusters. The settings are:
5 nodes, each node with 8 cores (all the ...