Sign up ×
Stack Overflow is a question and answer site for professional and enthusiast programmers. It's 100% free.

I have a solution to a problem that involves looping, and works, but I feel I am missing something that involves a more efficient implementation. The problem: I have a numeric vector sequence, and want to identify the starting position(s) in another vector of the first vector.

It works like this:

# helper function for matchSequence
# wraps a vector by removing the first n elements and padding end with NAs
wrapVector <- function(x, n) {
    stopifnot(n <= length(x))
    if (n == length(x)) 
        return(rep(NA, n))
    else
        return(c(x[(n+1):length(x)], rep(NA, n)))
}

wrapVector(LETTERS[1:5], 1)
## [1] "B" "C" "D" "E" NA
wrapVector(LETTERS[1:5], 2)
## [1] "C" "D" "E" NA  NA

# returns the starting index positions of the sequence found in a vector
matchSequence <- function(seq, vec) {
    matches <- seq[1] == vec
    if (length(seq) == 1) return(which(matches))
    for (i in 2:length(seq)) {
        matches <- cbind(matches, seq[i] == wrapVector(vec, i - 1))
    }
    which(rowSums(matches) == i)
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence(1:2, myVector)
## [1] 3 7
matchSequence(c(4, 1, 1), myVector)
## [1] 5
matchSequence(1:3, myVector)
## integer(0)

Is there a better way to implement matchSequence()?

Added

"Better" here can mean using more elegant methods I didn't think of, but even better, would mean faster. Try comparing solutions to:

set.seed(100)
myVector2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
matchSequence(c(4, 1, 1), myVector2)
## [1]  12  48  91 120 252 491 499 590 697 771 865

microbenchmark::microbenchmark(matchSequence(c(4, 1, 1), myVector2))
## Unit: microseconds
##                                 expr     min       lq     mean   median       uq     max naval
## matchSequence(c(4, 1, 1), myVector2) 154.346 160.7335 174.4533 166.2635 176.5845 300.453   100
share|improve this question
1  
I tested the answers. Yours is 5x faster than Josh's; and 50x faster than mine and howard's on your example. –  Frank yesterday

5 Answers 5

up vote 6 down vote accepted

Another attempt which I believe is quicker again. This owes its speed to only checking for matches from points in the vector which match the start of the searched-for sequence.

flm <- function(sq, vec) {
  hits <- which(sq[1]==vec)
  out <- hits[
    colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
  ]
  out[!is.na(out)]
}

Benchmark results:

#Unit: relative
#  expr      min       lq     mean   median       uq     max neval
# josh2 2.469769 2.393794 2.181521 2.353438 2.345911 1.51641   100
#    lm 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000   100
share|improve this answer
    
Nice! My Filter approach also checked only from such points, but yours is presumably much faster due to vectorization. –  Frank 23 hours ago

Here's a somewhat different idea:

f <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(apply(mm, 2, all))
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)

f(1:2, myVector)
# [1] 3 7
f(c(4,1,1), myVector)
# [1] 5
f(1:3, myVector)
# integer(0)
share|improve this answer
2  
Very nice use of embed()! As @Frank states, this is slower, but only because of the apply(, , all). Although which(apply(mm, 2, all)) is more elegant, it's much slower. Change that line to which(colSums(matches) == length(vec)) and it becomes 10x faster. –  Ken Benoit yesterday

Another idea:

match_seq2 <- function(s,v){
  n  = length(s)
  nc = length(v)-n+1
  which(
    n == rowsum(
      as.integer(v[ rep(0:(n-1), nc) + rep(1:nc, each=n) ] == s),
      rep(seq(nc),each=n)
    )
  )
}

I tried a tapply version, but it was ~4x as slow.


First idea:

match_seq <- function(s, v) Filter( 
  function(i) all.equal( s, v[i + seq_along(s) - 1] ), 
  which( v == s[1] )
) 

# examples:
my_vec <- c(3, NA, 1, 2, 4, 1, 1, 2)
match_seq(1:2, my_vec)      # 3 7
match_seq(c(4,1,1), my_vec) # 5
match_seq(1:3, my_vec)      # integer(0)

I'm using all.equal instead of identical because the OP wants integer 1:2 to match numeric c(1,2). This approach introduces one more case by allowing for matching against points beyond the end of my_vec (which are NA when indexed):

match_seq(c(1,2,NA), my_vec) # 7

The OP's benchmark

# variant on Josh's, suggested by OP:

f2 <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(colSums(mm)==length(seq))
}

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}

set.seed(100)
my_vec2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
s       <- c(4,1,1)
microbenchmark(
    op = matchSequence(s, my_vec2), 
    josh = f(s, my_vec2), 
    josh2 = f2(s, my_vec2), 
    frank = match_seq(s, my_vec2), 
    frank2 = match_seq2(s, my_vec2), 
    jlh = matchSequence2(s, my_vec2),
    tlm = flm(s, my_vec2),
    unit = "relative", check=my_check)

Results:

Unit: relative
   expr        min         lq       mean     median         uq       max neval
     op   2.740224   2.696656   2.474477   2.579514   2.530742  1.320109   100
   josh  11.869979  11.088008  10.101765  10.632478  10.572857  2.204239   100
  josh2   2.294346   2.180026   2.042395   2.113340   2.108952  1.147430   100
  frank 127.457276 118.557030 107.296884 112.423461 113.092295 39.416594   100
 frank2   6.982603   6.542039   5.944226   6.237626   6.246001  1.671778   100
    jlh 158.573764 146.537297 129.315930 138.543959 137.649221 19.330479   100
    tlm   1.000000   1.000000   1.000000   1.000000   1.000000  1.000000   100

So thelatemails's wins! (Anyone else can feel free to update this benchmark if so inclined.)

share|improve this answer
    
Good point about the numeric types. –  Ken Benoit yesterday
1  
Brilliant stuff! I accepted @thelatemail's answer on the basis of performance and parsimony but really appreciate your analysis and comparisons. Thanks all. –  Ken Benoit 13 hours ago

Here's another way:

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence <- function(seq,vec) {
  n.vec <- length(vec)
  n.seq <- length(seq)
  which(sapply(1:(n.vec-n.seq+1),function(i)all(head(vec[i:n.vec],n.seq)==seq)))
}
matchSequence(1:2,myVector)
# [1] 3 7
matchSequence(c(4,1,1),myVector)
# [1] 5
matchSequence(1:3,myVector)
# integer(0)
share|improve this answer

And a recursive idea:

find_pat = function(pat, x) 
{
    ff = function(.pat, .x, acc = if(length(.pat)) seq_along(.x) else integer(0L)) {
        if(!length(.pat)) return(acc)
        Recall(.pat[-1L], .x, acc[which(.pat[[1L]] == .x[acc])] + 1L)
    }

    return(ff(pat, x) - length(pat))
}
find_pat(1:2, myVector)
#[1] 3 7
find_pat(c(4, 1, 1), myVector)
#[1] 5
find_pat(1:3, myVector)
#integer(0)

And on a benchmark:

all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2), 
                               flm(s, my_vec2), 
                               find_pat(s, my_vec2), 
                               unit = "relative")
#Unit: relative
#                      expr      min       lq   median       uq      max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387   100
#           flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848   100
#      find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000  1.00000   100

Using larger data:

set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC), 
                               flm(PAT, VEC), 
                               find_pat(PAT, VEC), 
                               unit = "relative", times = 20)
#Unit: relative
#                    expr       min       lq    median        uq       max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634    20
#           flm(PAT, VEC)  2.810611  2.51955  2.963352  2.877195  1.728512    20
#      find_pat(PAT, VEC)  1.000000  1.00000  1.000000  1.000000  1.000000    20
share|improve this answer
    
Very impressive! –  Ken Benoit 6 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.