big-data

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.

python aws data-science machine-learning caffe theano big-data spark deep-learning hadoop tensorflow numpy scikit-learn keras pandas kaggle scipy matplotlib mapreduce

Updated Jun 29, 2023
Python

apache / flink

Star

Apache Flink

python java scala sql big-data flink

Updated Jul 17, 2023
Java

amark / gun

Sponsor

Star

An open source cybersecurity protocol for syncing decentralized graph data.

Updated Jun 15, 2023
JavaScript

prestodb / presto

Star

The official home of the Presto distributed SQL query engine for big data

java data query sql big-data presto hive hadoop lakehouse

Updated Jul 16, 2023
Java

heibaiying / BigData-Notes

Star

大数据入门指南 ⭐

phoenix scala kafka big-data spark yarn hive hadoop storm bigdata hbase zookeeper hdfs mapreduce flume azkaban sqoop

Updated Jul 14, 2023
Java

apache / predictionio

Star

PredictionIO, a machine learning server for developers and ML engineers.

scala big-data predictionio

Updated Jan 9, 2021
Scala

questdb / questdb

Star

An open source time-series database for fast ingest and SQL queries

java iot postgres sql database big-data time-series analytics cpp grafana postgresql simd low-latency financial-analysis tsdb hacktoberfest time-series-database questdb

Updated Jul 14, 2023
Java

andkret / Cookbook

Star

The Data Engineering Cookbook

big-data best-practices cookbook data-engineering data-engineer

Updated Apr 11, 2023

yahoo / CMAK

Star

CMAK is a tool for managing Apache Kafka clusters

scala kafka big-data cluster-management

Updated May 25, 2023
Scala

vesoft-inc / nebula

Star

A distributed, fast open-source graph database featuring horizontal scalability and high availability

distributed-systems database big-data cpp graph raft scalability distributed graph-database graphdb nebula nebula-graph nebulagraph

Updated Jul 17, 2023
C++

trinodb / trino

Star

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

java distributed-systems data-science sql database big-data presto hive hadoop analytics jdbc databases distributed-database query-engine iceberg datalake prestodb trino delta-lake

Updated Jul 16, 2023
Java

cython / cython

Star

The most widely used Python to C compiler

python c performance big-data cpp cython cpython cpython-extensions

Updated Jul 16, 2023
Python

catboost / catboost

Star

A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

python data-science machine-learning data-mining tutorial r big-data gpu cuda kaggle gbdt gbm gpu-computing decision-trees gradient-boosting coreml catboost categorical-features

Updated Jul 16, 2023
Python

apache / beam

Star

Apache Beam is a unified programming model for Batch and Streaming data processing.

python java golang streaming sql big-data beam batch

Updated Jul 17, 2023
Java

apache / storm

Star

Mirror of Apache Storm

java big-data storm

Updated Jul 14, 2023
Java

h2oai / h2o-3

Star

H2O is an Open Source, Distributed, Fast & Scalable Machine Learning Platform: Deep Learning, Gradient Boosting (GBM) & XGBoost, Random Forest, Generalized Linear Modeling (GLM with Elastic Net), K-Means, PCA, Generalized Additive Models (GAM), RuleFit, Support Vector Machine (SVM), Stacked Ensembles, Automatic Machine Learning (AutoML), etc.