Skip to content

vmware/versatile-data-kit

main
Switch branches/tags
Code

Latest commit

* vdk-csv: add export-csv command

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-csv: modified requirements.txt

* vdk-csv: modified requirements.txt

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-csv: changes on failing tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-csv: changes on failing tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-csv: changes on failing tests and adding examples

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-csv: changes on failing tests

* cicd: increase memory limits for cicd runner (#927)


what: We intend to increase the limits to our cicd containers.Updated values.yaml file. The script "install-runners.sh" has to be manually triggered for changes to take effect.

why: Control Service Integration Tests have been failing with OOM errors. Although this doesn't directly update the values
it servers as a reference point for future changes.

testing: n/a

Signed-off-by: Momchil Zhivkov [email protected]

* versatile-data-kit: make easier slack instructions (#925)

Add a link to Slack inviter (https://communityinviter.com/apps/cloud-native/cncf) to make it easier for people to join CNCF slack. 
Otherwise, they need to read the docs and it becomes more complex.

Signed-off-by: Antoni Ivanov <[email protected]>

* vdk-airflow: populate readme (#924)

* control-service: increase integration test builder memory (#929)


what: Increased the gradle builder/worker limits

why: Integration tests keep failing with memory issues.

testing: tests complete

Signed-off-by: Momchil Zhivkov [email protected]

* [pre-commit.ci] pre-commit autoupdate (#928)

updates:
- [github.com/asottile/pyupgrade: v2.37.2 → v2.37.3](asottile/pyupgrade@v2.37.2...v2.37.3)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Andy <[email protected]>

* vdk-core: add memory properties client (#921)

* vdk-core: add memory properties client

In case Control Service is not available, for demo purposes, it would be
good to be able to use some simple client so that jobs that stil use
properties would work.
To make it clear that it's not recommended for production (after all
properties are supposed to be used to store state, so in memory client
is very pointless) we write warning each  time write proerties is
called.

Signed-off-by: Antoni Ivanov <[email protected]>

* vdk-examples: add ingest and anonymize example (#922)

* vdk-examples: add ingest and anonymize example

The example was inspired by some talks with potential users. But
regardless it's pretty interesting and useful example and added it to
the collection of examples we have. We do not have a lot of examples of
how plugins can be used and this shows they can be pretty powerful.

Signed-off-by: Antoni Ivanov <[email protected]>

* vdk-core: Improve ingestion error logging (#930)

Currently, after ingestion completes, a user is presented
with the following message:
```
Successful uploads:1
Failed uploads:0
ingesting plugin errors:defaultdict(<class 'vdk.internal.builtin_plugins.ingestion.ingester_utils.AtomicCounter'>, {})
```

This is not ideal, as printing internal Python representations
in user logs might be misleading, as a user might believe
some error has occured when none did.

This change fixes this by printing `None` instead when no
errors occur, and printing the dictionary result instead of
the defaultdict representation when errors did occur.
```
Successful uploads: 1
Failed uploads: 0
Ingesting plugin errors: None
```

Testing done: tested locally

Signed-off-by: Gabriel Georgiev <[email protected]>

* vdk-core,vdk-impala,vdk-lineage,vdk-trino: Support for pluggy 1.0 (#931)

The 1.0 release of `pluggy` introduced a breaking change by
renaming its `callers` module to `_callers`. This has been
fixed by using the `HookCallResult` constant from
`vdk.api.plugin.plugin_registry` instead of
`pluggy.callers._Result` everywhere necessary, and amending
said constant to be either `pluggy.callers._Result` or
`pluggy._callers._Result` dynamically based on the version of
`pluggy` in the current Python env.

Also fixed a test which had `hookwrapper` set to True for some
reason inside its testing plugin.

Testing done: ran tests locally, CICD

Signed-off-by: Gabriel Georgiev <[email protected]>

* control-service: Atomic job cancellation (#860)

Currently, there is a possibility for a job to be set to be cancelled,
and have it complete before the cancellation can complete. This
leads to a 500 error with no explanation.

This change introduces an additional check that the operation
response or its status is not null, and if it is, it logs an appropriate
message and raises an appropriate exception.

Testing done: unit tests

Signed-off-by: Gabriel Georgiev <[email protected]>
Co-authored-by: Miroslav Ivanov <[email protected]>
Co-authored-by: Momchil Z <[email protected]>

* vdk-csv: changes on unit tests

* vdk-csv: changes on no data in database handling and new unit tests

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* vdk-impala, vdk-trino: Remove deprecated use of result field (#933)

The _Result object from pluggy features a result field,
which was deprecated in earlier versions in favour of
get_result(), and removed entirely in 1.0. This change
removes references to it which caused an error for a
heartbeat test.

Testing done: CICD

Signed-off-by: Gabriel Georgiev <[email protected]>

* vdk-csv: add helper methods and delete unused imports

* vdk-csv: add helper methods and delete unused imports

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Momchil Z <[email protected]>
Co-authored-by: Antoni Ivanov <[email protected]>
Co-authored-by: Andy <[email protected]>
Co-authored-by: Gabriel Georgiev <[email protected]>
Co-authored-by: Miroslav Ivanov <[email protected]>
dc79362

Git stats

Files

Permalink
Failed to load latest commit information.

Versatile Data Kit Versatile Data Kit

Last Activity license pre-commit build status twitter YouTube Channel Subscribers

Overview

Versatile Data Kit (VDK) is an open source framework that enables anybody with basic SQL or Python knowledge to create their own data pipelines.

Versatile Data Kit enables Data Engineers to develop, deploy, run and manage Data Jobs. A Data Job is a data processing workload and can be written in Python, SQL, or both at the same time. A Data Job enables Data Engineers to implement automated pull ingestion (E in ELT) and batch data transformation (T in ELT) into a database or any type of data storage.

Versatile Data Kit consists of two main components:

  • A Data SDK provides all tools for the automation of data extraction, transformation, and loading, as well as a plugin framework that allows users to extend the framework according to their specific requirements.
  • A Control Service allows users to create, deploy, manage and execute Data Jobs in Kubernetes runtime environment.

To help solve common data engineering problems Versatile Data Kit:

  • allows ingestion of data from different sources, including CSV files, JSON objects, data provided by REST API services, etc.;
  • ensures data applications are packaged, versioned, and deployed correctly while dealing with credentials, retries, reconnects, etc.;
  • provides built-in monitoring and smart notification capabilities;
  • tracks both code and data modifications and the relations between them, enabling engineers to troubleshoot faster and providing an easy revert to a stable version.

Data Journey and where VDK fits in

Data Journey Data Journey

Installation and Getting Started

Install Versatile Data Kit SDK

pip install -U pip setuptools wheel
pip install quickstart-vdk

Note that Versatile Data Kit requires Python 3.7+.

See the Installation page for more details.

Use

# see Help to see what you can do
vdk --help

Check out the Getting Started page to create and run your first Data Job.

Documentation

Official documentation for Versatile Data Kit can be found here.

Contributing

If you are interested in contributing as a developer, visit CONTRIBUTING.md.

Contacts

Feedback is very welcome via the GitHub site as issues or pull requests

How to use Versatile Data Kit?

For the full list of resources go to Community and Resources

Code of Conduct

Everyone involved in working on the project's source code, or engaging in any issue trackers, Slack channels and mailing lists is expected to follow the Code of Conduct.