#
apache-spark
Apache Spark is an open source distributed general-purpose cluster-computing framework. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
Here are 1,011 public repositories matching this topic...
酷玩 Spark: Spark 源代码解析、Spark 类库等
-
Updated
May 26, 2019 - Scala
Interactive and Reactive Data Science using Scala and Spark.
-
Updated
Jun 2, 2020 - JavaScript
Distributed Tensorflow, Keras and PyTorch on Apache Spark/Flink & Ray
python
scala
apache-spark
pytorch
keras-tensorflow
bigdl
distributed-deep-learning
deep-neural-network
analytics-zoo
-
Updated
Mar 12, 2021 - Jupyter Notebook
Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning
-
Updated
Feb 18, 2021 - Java
GoEddie
commented
Dec 30, 2019
This is to track implementation of the ML-Features: https://spark.apache.org/docs/latest/ml-features
Bucketizer has been implemented in dotnet/spark#378 but there are more features that should be implemented.
- Feature Extractors
- TF-IDF
- Word2Vec (dotnet/spark#491)
- CountVectorizer (https://github.com/dotnet/spark/p
Kubernetes operator for managing the lifecycle of Apache Spark applications on Kubernetes.
kubernetes
spark
apache-spark
kubernetes-operator
kubernetes-controller
kubernetes-crd
google-cloud-dataproc
-
Updated
Mar 9, 2021 - Go
Apache Spark docker image
-
Updated
Feb 24, 2021 - Dockerfile
PySpark + Scikit-learn = Sparkit-learn
-
Updated
Dec 31, 2020 - Python
A curated list of awesome Apache Spark packages and resources.
-
Updated
Mar 5, 2021
(Deprecated) Scikit-learn integration package for Apache Spark
-
Updated
Dec 3, 2019 - Python
C# and F# language binding and extensions to Apache Spark
streaming
spark
apache-spark
csharp
fsharp
bigdata
dataset
spark-streaming
eventhubs
mapreduce
dataframe
rdd
dstream
mobius
kafka-streaming
near-real-time
-
Updated
Jan 29, 2021 - C#
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
python
airflow
spark
apache-spark
scheduler
s3
data-engineering
data-lake
warehouse
redshift
data-migration
livy
etl-framework
apache-airflow
emr-cluster
etl-pipeline
etl-job
data-engineering-pipeline
airflow-dag
goodreads-data-pipeline
-
Updated
Mar 9, 2020 - Python
R interface for Apache Spark
-
Updated
Mar 11, 2021 - R
Code examples that show to integrate Apache Kafka 0.8+ with Apache Storm 0.9+ and Apache Spark Streaming 1.1+, while using Apache Avro as the data serialization format.
-
Updated
Jan 24, 2017 - Scala
Distributed Deep Learning, with a focus on distributed training, using Keras and Apache Spark.
data-science
machine-learning
spark
apache-spark
deep-learning
hadoop
tensorflow
keras
keras-models
optimization-algorithms
data-parallelism
distributed-optimizers
-
Updated
Jul 25, 2018 - Python
Apache Spark enhanced with native Kubernetes scheduler back-end: NOTE this repository is being ARCHIVED as all new development for the kubernetes scheduler back-end is now on https://github.com/apache/spark/
-
Updated
Jan 8, 2020 - Scala
A command-line tool for launching Apache Spark clusters.
-
Updated
Mar 5, 2021 - Python
Streaming System 相关的论文读物
streaming
apache-spark
storm
stream-processing
spark-streaming
dataflow
flink
heron
drizzle
millwheel
s4
streaming-engine
spe
stream-processing-engine
-
Updated
Mar 31, 2018
REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models
-
Updated
Feb 22, 2021 - Java
Code for Agile Data Science 2.0, O'Reilly 2017, Second Edition
python
vagrant
data-science
data
machine-learning
airflow
kafka
spark
apache-spark
analytics
machine-learning-algorithms
python3
amazon-ec2
python-3
apache-kafka
amazon-web-services
predictive-analytics
agile-data
data-syndrome
agile-data-science
-
Updated
Feb 8, 2021 - Jupyter Notebook
A list about Apache Kafka
infrastructure
kafka
apache-spark
stream-processing
apache-kafka
kafka-streams
data-processing
data-pipeline
streaming-data
-
Updated
Jan 21, 2021
The Internals of Spark Structured Streaming
-
Updated
Feb 17, 2021
This is the development repository of SparkMeasure, a tool for performance troubleshooting of Apache Spark workloads. It simplifies the collection and analysis of Spark task metrics data.
-
Updated
Oct 14, 2020 - Scala
Wirbelsturm is a Vagrant and Puppet based tool to perform 1-click local and remote deployments, with a focus on big data tech like Kafka.
-
Updated
Sep 14, 2015 - Shell
A boilerplate for writing PySpark Jobs
-
Updated
Jul 1, 2020 - Python
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
-
Updated
Jun 6, 2017
Created by Matei Zaharia
Released May 26, 2014
- Repository
- apache/spark
- Website
- spark.apache.org
- Wikipedia
- Wikipedia
Willingness to contribute
The MLflow Community encourages new feature contributions. Would you or another member of your organization be willing to contribute an implementation of this feature (either as an MLflow Plugin or an enhancement to the MLflow code base)?