Cascading is a feature rich API for defining and executing complex and fault tolerant data processing workflows on various cluster computing platforms. Please see https://github.com/cwensel/cascading for access to all WIP branches.
The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
Generates fake data for big data projects. Have capability to generate medical, industry datasets. File size as well number of files and number of records can be configured