I have a large dataset, which I want to divide into training, validation and test set according to some predefined probability at the beginning (e.g. with 0.8 probability, a tuple is being used for training, 0.1 for validation and 0.1 for testing.)
I though on adding 2 boolean columns, training and validation. Both are dependent on a single random number. If now this random number is below 0.8,the tuple is for training, and if it is >0.9 it is for testing. Otherwise, both columns are false, and therefore it is for validation.
The following query has the problem, that all tuples use the same random ID, meaning all have the same value according to train/validate/test.
update data set
test = (sub.rand >= 0.9),
train = (sub.rand <= 0.8) from (select random() rand) sub;
I want try to make it in one query,because the dataset can be very large and I think, one query would be faster than when splitting it up into multiple ones. (However, any solution for this problem is appreciated).
Thanks for your help.