#

data-lake

Here are 144 public repositories matching this topic...

lakeFS

treeverse / lakeFS

Star

Open

A global permission is required to view a specific repository

3

johnnyaug commented Aug 27, 2021

The fs:ReadConfig permission on the resource "*" is required when viewing a specific repository from the UI.

Read more

bug good first issue area/auth

Open

Support minIO-style PutObjectExtract operation

1

Open

Links to the documentation on the webui

3

Find more good first issues →

Teradata / kylo

Star

Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.

spark hadoop data-lake teradata nifi kylo

Updated Aug 10, 2021
Java

goodreads_etl_pipeline

san089 / goodreads_etl_pipeline

Star

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Updated Mar 9, 2020
Python

apache / incubator-kyuubi

Star

Apache Kyuubi is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark

kubernetes sql spark hive hadoop jdbc thrift data-lake spark-sql kyuubi-server thrift-jdbc odbc-server

Updated Sep 1, 2021
Scala

Udacity-Data-Engineering-Projects

san089 / Udacity-Data-Engineering-Projects

Star

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Updated Mar 5, 2020
Python

uber / marmaray

Star

Generic Data Ingestion & Dispersal Library for Hadoop

spark hadoop data-lake avro-schema ingest-data schema-format

Updated Jun 3, 2021
Java

cuebook / cuelake

Star

Use SQL to build ELT pipelines on a data lakehouse.

sql apache-spark etl pipelines data-engineering data-lake data-transfer delta data-integration upsert elt data-pipeline datalake data-ingestion spark-sql zeppelin-notebook apache-iceberg lakehouse incremental-updates

Updated Aug 20, 2021
JavaScript

kaiwaehner / hivemq-mqtt-tensorflow-kafka-realtime-iot-machine-learning-training-inference

Star

Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required

Updated Nov 5, 2020
Jupyter Notebook

aws-serverless-data-lake-framework

awslabs / aws-serverless-data-lake-framework

Star

Enterprise-grade, production-hardened, serverless data lake on AWS

aws framework serverless etl analytics best-practices data-engineering iac data-lake lake-formation

Updated Aug 30, 2021
Python

Azure / usql

Star

U-SQL Examples and Issue Tracking

big-data azure data-lake u-sql

Updated May 18, 2021
C#

Azure / AzureDataLake

Star

Samples and Docs for Azure Data Lake Store and Analytics

big-data azure data-lake

Updated Jun 15, 2021

amazon-s3-find-and-forget

awslabs / amazon-s3-find-and-forget

Star

Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)

aws data privacy big-data s3 data-lake parquet gdpr right-to-be-forgotten amazon-s3 data-erasure ccpa

Updated Aug 25, 2021
Python

aws-samples / aws-dbs-refarch-datalake

Star

Reference Architectures for Datalakes on AWS

glue amazon-emr data-transformation data-lake data-catalog data-analytics hive-metastore emr-cluster ingest-data

Updated May 13, 2020
HTML

datamindedbe / lighthouse

Star

Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.

Updated Jul 12, 2021
Scala

smart-data-lake / smart-data-lake

Star

Framework to quickly build and maintain Smart Data Lakes

scala spark hive hadoop transform-data data-lake data-pipelines comprehensive deltalake smart-data-lake

Updated Aug 30, 2021
Scala

Jayvardhan-Reddy / Azure-Certification-DP-200

Star

Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution

azure data-storage resources data-engineering data-lake azure-storage batch-processing data-engineer azure-data-factory microsoft-azure azure-portal azure-cosmos-db azure-services polybase certification-prep azure-databricks exam-prep azure-certification dp-200 sql-dw

Updated Aug 5, 2020

LearningJournal / SparkProgrammingInScala

Star

Apache Spark Course Material

scala big-data spark apache-spark bigdata data-lake datalake spark-sql spark-scala

Updated Jul 26, 2020
Scala

camunda-community-hub / zeeqs

Star

Query API for aggregated Zeebe data

graphql data-lake hacktoberfest zeebe zeebe-tool

Updated Aug 26, 2021
Kotlin

alanchn31 / Data-Engineering-Projects

Star

Personal Data Engineering Projects

postgres airflow spark cassandra mongodb data-warehouse data-engineering data-lake scrapy data-modeling aws-redshift star-schema ingest-data data-engineering-nanodegree

Updated Apr 1, 2021
Jupyter Notebook

aws-samples / analyzing-reddit-sentiment-with-aws

Star

Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.

real-time reddit sentiment-analysis data-stream tutorials data-lake kinesis-firehose self-learning sentiment-classification amazon-athena aws-glue delivery-stream

Updated Apr 20, 2021
Python

querypal

OElesin / querypal

Star

Web UI for Amazon Athena

aws data sql analytics data-lake aws-athena

Updated Feb 10, 2021
Vue

data-mill-cloud / data-mill

Star

A K8s-based infrastructure for analytics

infrastructure data-science machine-learning streaming spark analytics data-lake k8s lambda-architecture data-mill

Updated Jan 15, 2020
Shell

LearningJournal / Spark-Streaming-In-Python

Star

Apache Spark 3 - Structured Streaming Course Material

python big-data apache-spark bigdata pyspark data-lake spark-streaming spark-sql

Updated Sep 26, 2020
Python

datarootsio / terraform-module-azure-datalake

Star

Terraform module for an Azure Data Lake

azure terraform data-lake

Updated Apr 29, 2021
HCL

ExpediaGroup / hiveberg

Star

Demonstration of a Hive Input Format for Iceberg

hive data-lake iceberg

Updated Mar 12, 2021
Java

rayyan17 / jobAnalytics_and_search

Star

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

python aws airflow sql spark analytics s3 jobs pyspark data-engineering data-lake redshift jobseeker jobsearch data-modeling data-pipeline jobscheduler

Updated Aug 30, 2021
Python

0xdefendA / defenda-data-lake

Star

defendA Data Lake. A firehose pipeline to athena providing enrichment and normalization for security events

python security data-science athena data-lake siem firehose

Updated Sep 8, 2020
Python

ec-europa / eubfr-data-lake

Star

EU Budget for Results - Data Lake

Updated Dec 2, 2019
JavaScript

FINRAOS / herd-mdl

Star

Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.

data-lake data-catalog mdl

Updated Aug 25, 2021
Java

vineeths96 / Data-Engineering-Nanodegree

Star

This repository holds the python files and notebooks associated with the Udacity Data Engineering Nanodegree.

aws airflow cassandra aws-s3 postgresql data-engineering data-lake data-modeling udacity-nanodegree data-pipeline data-warehousing aws-redshift

Updated Aug 30, 2021
PLpgSQL

Improve this page

Add a description, image, and links to the data-lake topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the data-lake topic, visit your repo's landing page and select "manage topics."