I have problem with setting missing values in data frame. In the first 3 columns there are ID of product, ID of store, and number of week. There are also 28 columns from 4 to 31 corresponding to last 28 days of selling item (last 7 days are days in our week). I want to set the missing values by comparing two records with the same first and second column but different number of weeks.
corrections <- function(x,y){
#the functions changes vector y if the difference between weeks is not greeter than 3
if (x[1]==y[1] && x[2]==y[2] && -(x[3]-y[3])<=3){
t=y[3]-x[3]
t=as.integer(t)
a=x[(4+ (t*7) ):31]
b=y[4:(31- (t*7)) ]
c= a-b
for (i in 1:(28-(t*7))){
if (is.na(c[i]))
{
if (!(is.na(a[i]) && is.na(b[i])))
{
if (is.na(b[i]))
b[i]=a[i]
else
a[i]=b[i]
}
}
}
y[4:(31- t*7)]=b
}
return(y)
}
for (i in 2:(dim(salesTraining)[1]) {
salesTraining[i,]=corrections(salesTraining[i-1,], salesTraining[i,])
}
the loop takes 1 minute for every 1000 records so if my data have 212000 records it will take ~3,5 hours (if it's linear complexity). Is there any error or can I do it better - faster?
Example of data frame:
productID storeID weekInData dailySales1 dailySales2 dailySales3 dailySales4 dailySales5 1 1 1 37 0 0 0 0 0 2 1 1 38 0 0 0 0 0 3 1 1 39 0 0 0 0 0 4 1 1 40 0 NA 0 NA 2 5 1 1 41 NA 0 NA 0 0 6 1 1 42 0 0 0 NA 0 7 1 1 43 0 0 NA 0 NA 8 1 1 44 0 2 1 NA 0 9 1 1 45 NA 0 0 NA 0 10 1 1 46 NA 0 0 NA NA dailySales6 dailySales7 dailySales8 dailySales9 dailySales10 dailySales11 dailySales12 dailySales13 1 NA NA 0 NA 0 0 0 0 2 0 NA NA 0 0 0 0 0 3 0 NA 0 0 0 NA 2 NA 4 0 NA 0 NA 0 NA 0 0 5 0 0 NA 0 0 0 0 0 6 NA 0 NA 0 0 0 0 0 7 0 0 0 2 NA 0 0 0 8 0 NA 0 NA 0 NA 0 1 9 1 0 0 0 0 0 1 0 10 0 0 0 NA 0 NA 0 0 dailySales14 dailySales15 dailySales16 dailySales17 dailySales18 dailySales19 dailySales20 1 0 0 0 0 0 0 0 2 0 0 0 0 5 2 NA 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 6 0 0 2 1 0 0 NA 7 0 0 0 0 0 0 1 8 0 0 0 0 0 1 0 9 0 0 -1 0 0 0 0 10 0 0 0 0 0 0 0 dailySales21 dailySales22 dailySales23 dailySales24 dailySales25 dailySales26 dailySales27 1 NA 0 0 0 5 2 0 2 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 5 0 0 NA 1 0 0 0 6 0 0 0 0 0 0 1 7 0 0 0 0 0 1 0 8 0 0 NA 0 0 0 0 9 NA 0 0 0 NA 0 0 10 0 1 0 0 0 0 0 dailySales28 daysStoreClosed_series daysStoreClosed_target dayOfMonth dayOfYear weekOfYear month 1 0 5 2 23 356 51 12 2 0 6 2 30 363 52 12 3 0 6 1 6 5 1 1 4 0 6 1 13 12 2 1 5 0 6 1 19 18 3 1 6 0 5 1 26 25 4 1 7 0 4 1 2 32 5 2 8 0 4 1 9 39 6 2 9 0 4 1 16 46 7 2 10 0 4 1 23 53 8 2 quarter 1 4 2 4 3 1 4 1 5 1 6 1 7 1 8 1 9 1 10 1
dailySales28
on weekn
matchdailySales21
on weekn+1
? Would it not be better if each row only contained 7 days of data instead? Eventually, I feel a data.frame with daily data would be more appropriate, although it could depend on the type of data analysis you are planning. – flodel May 16 at 16:25