How to index a vector sequence within a vector sequence

Question

I have a solution to a problem that involves looping, and works, but I feel I am missing something that involves a more efficient implementation. The problem: I have a numeric vector sequence, and want to identify the starting position(s) in another vector of the first vector.

It works like this:

# helper function for matchSequence
# wraps a vector by removing the first n elements and padding end with NAs
wrapVector <- function(x, n) {
    stopifnot(n <= length(x))
    if (n == length(x)) 
        return(rep(NA, n))
    else
        return(c(x[(n+1):length(x)], rep(NA, n)))
}

wrapVector(LETTERS[1:5], 1)
## [1] "B" "C" "D" "E" NA
wrapVector(LETTERS[1:5], 2)
## [1] "C" "D" "E" NA  NA

# returns the starting index positions of the sequence found in a vector
matchSequence <- function(seq, vec) {
    matches <- seq[1] == vec
    if (length(seq) == 1) return(which(matches))
    for (i in 2:length(seq)) {
        matches <- cbind(matches, seq[i] == wrapVector(vec, i - 1))
    }
    which(rowSums(matches) == i)
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence(1:2, myVector)
## [1] 3 7
matchSequence(c(4, 1, 1), myVector)
## [1] 5
matchSequence(1:3, myVector)
## integer(0)

Is there a better way to implement matchSequence()?

Added

"Better" here can mean using more elegant methods I didn't think of, but even better, would mean faster. Try comparing solutions to:

set.seed(100)
myVector2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
matchSequence(c(4, 1, 1), myVector2)
## [1]  12  48  91 120 252 491 499 590 697 771 865

microbenchmark::microbenchmark(matchSequence(c(4, 1, 1), myVector2))
## Unit: microseconds
##                                 expr     min       lq     mean   median       uq     max naval
## matchSequence(c(4, 1, 1), myVector2) 154.346 160.7335 174.4533 166.2635 176.5845 300.453   100

I tested the answers. Yours is 5x faster than Josh's; and 50x faster than mine and howard's on your example. — Frank, yesterday

thelatemail · Accepted Answer · 2015-10-09 02:00:33Z

Another attempt which I believe is quicker again. This owes its speed to only checking for matches from points in the vector which match the start of the searched-for sequence.

flm <- function(sq, vec) {
  hits <- which(sq[1]==vec)
  out <- hits[
    colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
  ]
  out[!is.na(out)]
}

Benchmark results:

#Unit: relative
#  expr      min       lq     mean   median       uq     max neval
# josh2 2.469769 2.393794 2.181521 2.353438 2.345911 1.51641   100
#    lm 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000   100

Nice! My Filter approach also checked only from such points, but yours is presumably much faster due to vectorization. — Frank, 23 hours ago

Josh O'Brien · Answer 2 · 2015-10-08 23:47:13Z

up vote 9 down vote

Here's a somewhat different idea:

f <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(apply(mm, 2, all))
}

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)

f(1:2, myVector)
# [1] 3 7
f(c(4,1,1), myVector)
# [1] 5
f(1:3, myVector)
# integer(0)

answered yesterday

Josh O'Brien
85.8k4139238

2

Very nice use of embed()! As @Frank states, this is slower, but only because of the apply(, , all). Although which(apply(mm, 2, all)) is more elegant, it's much slower. Change that line to which(colSums(matches) == length(vec)) and it becomes 10x faster. – Ken Benoit yesterday

add a comment |

Frank · Answer 3 · 2015-10-09 03:15:06Z

Another idea:

match_seq2 <- function(s,v){
  n  = length(s)
  nc = length(v)-n+1
  which(
    n == rowsum(
      as.integer(v[ rep(0:(n-1), nc) + rep(1:nc, each=n) ] == s),
      rep(seq(nc),each=n)
    )
  )
}

I tried a tapply version, but it was ~4x as slow.

First idea:

match_seq <- function(s, v) Filter( 
  function(i) all.equal( s, v[i + seq_along(s) - 1] ), 
  which( v == s[1] )
) 

# examples:
my_vec <- c(3, NA, 1, 2, 4, 1, 1, 2)
match_seq(1:2, my_vec)      # 3 7
match_seq(c(4,1,1), my_vec) # 5
match_seq(1:3, my_vec)      # integer(0)

I'm using all.equal instead of identical because the OP wants integer 1:2 to match numeric c(1,2). This approach introduces one more case by allowing for matching against points beyond the end of my_vec (which are NA when indexed):

match_seq(c(1,2,NA), my_vec) # 7

The OP's benchmark

# variant on Josh's, suggested by OP:

f2 <- function(seq, vec) {
    mm <- t(embed(vec, length(seq))) == rev(seq)  ## relies on recycling of seq
    which(colSums(mm)==length(seq))
}

my_check <- function(values) {
  all(sapply(values[-1], function(x) identical(values[[1]], x)))
}

set.seed(100)
my_vec2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
s       <- c(4,1,1)
microbenchmark(
    op = matchSequence(s, my_vec2), 
    josh = f(s, my_vec2), 
    josh2 = f2(s, my_vec2), 
    frank = match_seq(s, my_vec2), 
    frank2 = match_seq2(s, my_vec2), 
    jlh = matchSequence2(s, my_vec2),
    tlm = flm(s, my_vec2),
    unit = "relative", check=my_check)

Results:

Unit: relative
   expr        min         lq       mean     median         uq       max neval
     op   2.740224   2.696656   2.474477   2.579514   2.530742  1.320109   100
   josh  11.869979  11.088008  10.101765  10.632478  10.572857  2.204239   100
  josh2   2.294346   2.180026   2.042395   2.113340   2.108952  1.147430   100
  frank 127.457276 118.557030 107.296884 112.423461 113.092295 39.416594   100
 frank2   6.982603   6.542039   5.944226   6.237626   6.246001  1.671778   100
    jlh 158.573764 146.537297 129.315930 138.543959 137.649221 19.330479   100
    tlm   1.000000   1.000000   1.000000   1.000000   1.000000  1.000000   100

So thelatemails's wins! (Anyone else can feel free to update this benchmark if so inclined.)

Brilliant stuff! I accepted @thelatemail's answer on the basis of performance and parsimony but really appreciate your analysis and comparisons. Thanks all. — Ken Benoit, 13 hours ago

jlhoward · Answer 4 · 2015-10-09 00:03:41Z

Here's another way:

myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence <- function(seq,vec) {
  n.vec <- length(vec)
  n.seq <- length(seq)
  which(sapply(1:(n.vec-n.seq+1),function(i)all(head(vec[i:n.vec],n.seq)==seq)))
}
matchSequence(1:2,myVector)
# [1] 3 7
matchSequence(c(4,1,1),myVector)
# [1] 5
matchSequence(1:3,myVector)
# integer(0)

alexis_laz · Answer 5 · 2015-10-09 18:13:17Z

And a recursive idea:

find_pat = function(pat, x) 
{
    ff = function(.pat, .x, acc = if(length(.pat)) seq_along(.x) else integer(0L)) {
        if(!length(.pat)) return(acc)
        Recall(.pat[-1L], .x, acc[which(.pat[[1L]] == .x[acc])] + 1L)
    }

    return(ff(pat, x) - length(pat))
}
find_pat(1:2, myVector)
#[1] 3 7
find_pat(c(4, 1, 1), myVector)
#[1] 5
find_pat(1:3, myVector)
#integer(0)

And on a benchmark:

all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2), 
                               flm(s, my_vec2), 
                               find_pat(s, my_vec2), 
                               unit = "relative")
#Unit: relative
#                      expr      min       lq   median       uq      max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387   100
#           flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848   100
#      find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000  1.00000   100

Using larger data:

set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC), 
                               flm(PAT, VEC), 
                               find_pat(PAT, VEC), 
                               unit = "relative", times = 20)
#Unit: relative
#                    expr       min       lq    median        uq       max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634    20
#           flm(PAT, VEC)  2.810611  2.51955  2.963352  2.877195  1.728512    20
#      find_pat(PAT, VEC)  1.000000  1.00000  1.000000  1.000000  1.000000    20

asked	yesterday
viewed	226 times
active	today

current community

your communities

more stack exchange communities

How to index a vector sequence within a vector sequence

5 Answers 5

Your Answer

Not the answer you're looking for? Browse other questions tagged r performance vector or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

How to index a vector sequence within a vector sequence

5 Answers 5

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r performance vector or ask your own question.

Related

Hot Network Questions