1

I have a large dataset, which I want to divide into training, validation and test set according to some predefined probability at the beginning (e.g. with 0.8 probability, a tuple is being used for training, 0.1 for validation and 0.1 for testing.)

I though on adding 2 boolean columns, training and validation. Both are dependent on a single random number. If now this random number is below 0.8,the tuple is for training, and if it is >0.9 it is for testing. Otherwise, both columns are false, and therefore it is for validation.

The following query has the problem, that all tuples use the same random ID, meaning all have the same value according to train/validate/test.

update data set 
        test = (sub.rand >= 0.9),
       train = (sub.rand <= 0.8) from (select random() rand) sub;

I want try to make it in one query,because the dataset can be very large and I think, one query would be faster than when splitting it up into multiple ones. (However, any solution for this problem is appreciated).

Thanks for your help.

1 Answer 1

1

Try just

update data set 
    test = random() >= 0.9,
    train = random() <= 0.8;
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.