8

Consider these two strings:

string1 <- "GCTCCC...CTCCATGAAGTA...CTTCACATCCGTGT.CCGGCCTGGCCGCGGAGAGCCC"
string_reference <- "GCTCCC...CTCCATGAAGTATTTCTTCACATCCGTGT.CCGGCCTGGCCGCGGAGAGCCC"

How do I easily remove the dots in "string1", but only those dots that are in the same position in "string_reference"?

Expected output:

string1 = "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
1
  • simple loop stepping through a character at a time...
    – Rob
    Commented Mar 24, 2014 at 22:48

4 Answers 4

7

I'd just use R's truly vectorised subsetting and logical comparison methods...

# Split the strings
x <- strsplit( c( string1 , string_reference ) , "" )
# Compare and remove dots from string1 when dots also appear in the reference string at the same position
paste( x[[1]][ ! (x[[2]]== "." & x[[1]] == ".") ] , collapse = "" )
#[1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
3
  • 2
    Simon, I think the user wants to remove the dots that appear in the same position (hence remove the one at 39, and the first set of dots as well). That said, I wouldn't bet my life on it...
    – BrodieG
    Commented Mar 25, 2014 at 0:13
  • But +1 for the simpler use of subsetting.
    – BrodieG
    Commented Mar 25, 2014 at 0:15
  • @BrodieG of course! And actually, that is exactly what my code does, I just posted the result of an old expression up there not what my command actually did. Cheers! Commented Mar 25, 2014 at 8:23
6

Similar to Robert's, but the "vectorized" version:

s1 <- unlist(strsplit(string1, ""))
s2 <- unlist(strsplit(string_reference, ""))
paste0(Filter(Negate(is.na), ifelse(s1 == s2 & s1 == ".", NA, s1)), collapse="")
# [1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"

I quote "vectorized" because the vectorization is happening on the characters of your string vectors. This assumes there is only one element in your string vectors. If you had multiple elements in your string vectors you would have to loop through the results of strsplit.

2
  • Great! paste0 is redundant since collapse = "". Commented Mar 25, 2014 at 3:14
  • @RobertKrzyzanowski, true, although it's not because collapse is "", rather, it's because there is only one vector.
    – BrodieG
    Commented Mar 25, 2014 at 12:51
5

Using intersect to find the overlapping .'s

cutpos <- do.call(intersect, 
        sapply(list(string_reference,string1), gregexpr, pattern=".", fixed=TRUE)
          )
paste(strsplit(string1,"",fixed=TRUE)[[1]][-cutpos],collapse="")
#[1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"

A small variation of the above (courtesy of @Arun):

attr(cutpos, 'match.length') <- rep(1L, length(cutpos))
attr(cutpos, 'useBytes') <- TRUE

do.call(paste0, c(regmatches(string1, list(cutpos), invert=TRUE), collapse=""))
## [1] "GCTCCCCTCCATGAAGTA...CTTCACATCCGTGTCCGGCCTGGCCGCGGAGAGCCC"
0
1

Use:

string1v <- strsplit(string1, "")[[1]]
string_referencev <- strsplit(string_reference, "")[[1]]
stopifnot(length(string1v) == length(string_referencev))
finalstring <- paste(vapply(seq_along(string1v), function(ind) {
  if (string1v[ind] == '.' && string_referencev[ind] == '.') ''
  else string1v[ind] 
}, character(1)), collapse = "")

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.