Postgresql - Add 2 random columns

Question

I have a large dataset, which I want to divide into training, validation and test set according to some predefined probability at the beginning (e.g. with 0.8 probability, a tuple is being used for training, 0.1 for validation and 0.1 for testing.)

I though on adding 2 boolean columns, training and validation. Both are dependent on a single random number. If now this random number is below 0.8,the tuple is for training, and if it is >0.9 it is for testing. Otherwise, both columns are false, and therefore it is for validation.

The following query has the problem, that all tuples use the same random ID, meaning all have the same value according to train/validate/test.

update data set 
        test = (sub.rand >= 0.9),
       train = (sub.rand <= 0.8) from (select random() rand) sub;

I want try to make it in one query,because the dataset can be very large and I think, one query would be faster than when splitting it up into multiple ones. (However, any solution for this problem is appreciated).

Thanks for your help.

klin · Accepted Answer · 2016-11-17 10:46:40Z

1

Try just

update data set 
    test = random() >= 0.9,
    train = random() <= 0.8;

answered Nov 17, 2016 at 10:46

klin

123k15 gold badges239 silver badges262 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Postgresql - Add 2 random columns

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related