How can I tell if a string repeats itself in Python?

Question

I'm looking for a way to test whether or not a given string repeats itself for the entire string or not.

Examples:

[
    '0045662100456621004566210045662100456621',             # '00456621'
    '0072992700729927007299270072992700729927',             # '00729927'
    '001443001443001443001443001443001443001443',           # '001443'
    '037037037037037037037037037037037037037037037',        # '037'
    '047619047619047619047619047619047619047619',           # '047619'
    '002457002457002457002457002457002457002457',           # '002457'
    '001221001221001221001221001221001221001221',           # '001221'
    '001230012300123001230012300123001230012300123',        # '00123'
    '0013947001394700139470013947001394700139470013947',    # '0013947'
    '001001001001001001001001001001001001001001001001001',  # '001'
    '001406469760900140646976090014064697609',              # '0014064697609'
]

are strings which repeat themselves, and

[
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

are examples of ones that do not.

The repeating sections of the strings I'm given can be quite long, and the strings themselves can be 500 or more characters, so looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow. Multiply that by potentially hundreds of strings and I can't see any intuitive solution.

I've looked into regexes a bit and they seem good for when you know what you're looking for, or at least the length of the pattern you're looking for. Unfortunately, I know neither.

How can I tell if a string is repeating itself and if it is, what the shortest repeating subsequence is?

looping through each character trying to build a pattern then checking the pattern vs the rest of the string seems awful slow - but is it? — Tim Castelijns, Apr 6 at 23:19
possible duplicate of Writing a regex to detect repeat-characters — Avinash Raj, Apr 6 at 23:41
@AvinashRaj The OP is asking about all possible solutions. The question you link to only accepts regex solution. Note that regex may be able to solve the problem but in much more time than necessary. For example an optimal solution (i.e. linear time) would use the suffix tree of the text. You just have to find the longest repeated substring and do some checks on the lengths. — Bakuriu, Apr 7 at 16:28

David Zhang · Accepted Answer · 2015-04-08 13:57:57Z

up vote 90 down vote accepted

Here's a simple solution which appears to beat the currently accepted answer in performance:

def principal_period(s):
    i = (s+s).find(s, 1, -1)
    return None if i == -1 else s[:i]

Here are the results of @zero's revision of @davidism's benchmark, with the above function added:

average performance
principal_period: 0.011421355195050506
tigerhawk: 0.014430824859362344
davidism: 0.02022206339939885
zero: 0.029442115473107475
shashank: 0.04027841852348177 with failures
saksham_varma: 0.04634534486664279

This is based on the (somewhat clever) observation that a string repeats itself if and only if it is equal to a nontrivial rotation of itself. Kudos to @AleksiTorhamo for realizing that we can then recover the principal period from the index of s in (s+s)[1:-1], and for informing me of the optional start and end arguments of Python's string.find.

If you only need to check if a string is periodic, and don't need to extract the principal period, an even simpler/faster solution is possible:

def is_periodic(s):
    return s in (s+s)[1:-1]

edited 2 days ago

answered Apr 7 at 10:58

David Zhang
74448

5

You can extend this to find the shortest repeating subsequence by using .find() or .index() instead of in, eg. (s+s).find(s, 1, -1). – Aleksi Torhamo 2 days ago

6

Also, I think (s+s).find(s, 1, -1) will be (very slightly) faster than (s+s)[1:-1].find(s), at least for larger strings, since the slicing means you have to create another copy of (nearly) the entire string. – Aleksi Torhamo 2 days ago

7

This is brilliant....o_o Nothing more to say. – Shashank 2 days ago

5

It's like if you take a sin or cos wave from a periodic function and shift it to the right. Since it's fully periodic, the waves will eventually match up perfectly...The math parallels to this solution are just phenomenal. :) I wish I could give you +∞ upvotes. – Shashank 2 days ago

3

Guido's recent update to PEP 8 is relevant here: "Be consistent in return statements. Either all return statements in a function should return an expression, or none of them should. If any return statement returns an expression, any return statements where no value is returned should explicitly state this as return None, and an explicit return statement should be present at the end of the function (if reachable)." – Zero Piraeus yesterday

| show 11 more comments

Zero Piraeus · Answer 2 · 2015-04-08 23:20:11Z

Here's a solution using regular expressions.

import re

REPEATER = re.compile(r"(.+?)\1+$")

def repeated(s):
    match = REPEATER.match(s)
    return match.group(1) if match else None

Iterating over the examples in the question:

examples = [
    '0045662100456621004566210045662100456621',
    '0072992700729927007299270072992700729927',
    '001443001443001443001443001443001443001443',
    '037037037037037037037037037037037037037037037',
    '047619047619047619047619047619047619047619',
    '002457002457002457002457002457002457002457',
    '001221001221001221001221001221001221001221',
    '001230012300123001230012300123001230012300123',
    '0013947001394700139470013947001394700139470013947',
    '001001001001001001001001001001001001001001001001001',
    '001406469760900140646976090014064697609',
    '004608294930875576036866359447',
    '00469483568075117370892018779342723',
    '004739336492890995260663507109',
    '001508295625942684766214177978883861236802413273',
    '007518796992481203',
    '0071942446043165467625899280575539568345323741',
    '0434782608695652173913',
    '0344827586206896551724137931',
    '002481389578163771712158808933',
    '002932551319648093841642228739',
    '0035587188612099644128113879',
    '003484320557491289198606271777',
    '00115074798619102416570771',
]

for e in examples:
    sub = repeated(e)
    if sub:
        print("%r: %r" % (e, sub))
    else:
        print("%r does not repeat." % e)

... produces this output:

'0045662100456621004566210045662100456621': '00456621'
'0072992700729927007299270072992700729927': '00729927'
'001443001443001443001443001443001443001443': '001443'
'037037037037037037037037037037037037037037037': '037'
'047619047619047619047619047619047619047619': '047619'
'002457002457002457002457002457002457002457': '002457'
'001221001221001221001221001221001221001221': '001221'
'001230012300123001230012300123001230012300123': '00123'
'0013947001394700139470013947001394700139470013947': '0013947'
'001001001001001001001001001001001001001001001001001': '001'
'001406469760900140646976090014064697609': '0014064697609'
'004608294930875576036866359447' does not repeat.
'00469483568075117370892018779342723' does not repeat.
'004739336492890995260663507109' does not repeat.
'001508295625942684766214177978883861236802413273' does not repeat.
'007518796992481203' does not repeat.
'0071942446043165467625899280575539568345323741' does not repeat.
'0434782608695652173913' does not repeat.
'0344827586206896551724137931' does not repeat.
'002481389578163771712158808933' does not repeat.
'002932551319648093841642228739' does not repeat.
'0035587188612099644128113879' does not repeat.
'003484320557491289198606271777' does not repeat.
'00115074798619102416570771' does not repeat.

The regular expression (.+?)\1+$ is divided into three parts:

(.+?) is a matching group containing at least one (but as few as possible) of any character (because +? is non-greedy).
\1+ checks for at least one repetition of the matching group in the first part.
$ checks for the end of the string, to ensure that there's no extra, non-repeating content after the repeated substrings (and using re.match() ensures that there's no non-repeating text before the repeated substrings).

In Python 3.4 and later, you could drop the $ and use re.fullmatch() instead, or (in any Python at least as far back as 2.3) go the other way and use re.search() with the regex ^(.+?)\1+$, all of which are more down to personal taste than anything else.

I have no idea why this concise two liner is not the highest-voted answer. The other answers are not bad, but this one is far better. (It may use the frequently denigrated regular expression, but I can inspect this far easier than the other much longer answers, which are full of nested blocks, potential typos, off-by-one errors, etc.) Ah well, worse is better I suppose. — Paul Draper, Apr 7 at 20:46
I think there are two main reasons for that: 1) some programmers like math more than they like regexes, and 2) since varying the length and nature of the input strings makes different answers pull ahead on performance, the super-long edge case strings (which may not even appear in the real data) make this solution appear suboptimal. — TigerhawkT3, Apr 7 at 21:26
@PaulDraper: Guess what regex is doing behind the scene? it is parsing the string and stores each element until a posible reapeatition match occurs. Thats the same what OP statet as too slow. so just because it is a 2 liner there isn't any performance win. — Zaibis, yesterday
@Zaibis, I'd normally agree, but this is both the shortest and fastest solution (stackoverflow.com/a/29482936/1212596)....Except for David's, which was posted after I made that comment. I actually like David's approach more (clever!). — Paul Draper, yesterday
@jwg the edge cases in the benchmark concerned (which I helped to develop after @davidism first created it) are artificial, and very unlikely to occur: ten thousand 0s followed by a 1, for example (which is the far outlier on the chart). None of the other edge cases have anything like such poor performances in fact, despite being designed to trip up algorithms like this one. However, David Zhang's answer is clearly superior, which is why I started a bounty to promote it. — Zero Piraeus, 13 hours ago

Shashank · Answer 3 · 2015-04-08 02:41:04Z

You can make the observation that for a string to be considered repeating, its length must be divisible by the length of its repeated sequence. Given that, here is a solution that generates divisors of the length from 1 to n / 2 inclusive, divides the original string into substrings with the length of the divisors, and tests the equality of the result set:

def divisors(n):
    if n > 1: yield 1
    from math import sqrt, floor
    kq = []
    for k in range(2, int(floor(sqrt(n))) + 1):
        q, r = divmod(n, k)
        if r == 0:
            yield k
            kq.append(q)
    while kq:
        yield kq.pop()

def repeats(s):
    l = len(s)
    for d in divisors(l):
        spl = (s[i:i+d] for i in range(0, l, d))
        spl0 = next(spl)
        if all(x == spl0 for x in spl):
            return spl0
    return None

EDIT: In Python 3, the / operator has changed to do float division by default. To get the int division from Python 2, you can use the // operator instead. Thank you to @TigerhawkT3 for bringing this to my attention.

The // operator performs integer division in both Python 2 and Python 3, so I've updated the answer to support both versions. The part where we test to see if all the substrings are equal is now a short-circuiting operation using all and a generator expression.

UPDATE: In response to a change in the original question, the code has now been updated to return the smallest repeating substring if it exists and None if it does not. @godlygeek has suggested using divmod to reduce the number of iterations on the divisors generator, and the code has been updated to match that as well. It now returns all positive divisors of n in ascending order, exclusive of n itself.

This solution is amazing. You could change the divisors method to follow the produce-consumer pattern. So that it yield's results as they are found. Will be a shame if this isn't the highest answer. Everything else is brute force. — JustinDanielson, Apr 6 at 23:33
@JustinDanielson It returns a generator object created from a generator expression, which is an implicit producer :) It will lazy-evaluate the divisors. — Shashank, Apr 6 at 23:34
Ohh. I didn't know that. Well, even better then. :D I understand why you want to avoid sqrt, but you could use n/2 as the upper bound for the divisor range. — JustinDanielson, Apr 6 at 23:36
@JustinDanielson Thanks for the suggestion, the range upper bound is now (n/2) inclusive. — Shashank, Apr 6 at 23:42

13 revs, 4 users 60% · Answer 4 · 2015-04-09 17:25:50Z

Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.

I had to modify some functions to work on Python 3. If you see something wrong, want to add your function, or want to add another test string, let me know.

In summary: there's about a 50x difference between the best- and worst-performing solutions for the large set of example data supplied by OP here (via this comment). David Zhang's solution is the clear winner, outperforming all others by at least 5x for the large example set.

A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.

Here are the results, including plots made (using the modified code here) using matplotlib and seaborn to show the different distributions:

Corpus 1 (supplied examples - small set)

mean performance:
 0.0006  david_zhang
 0.0017  zero
 0.0026  antti
 0.0029  carpetpython
 0.0034  tigerhawk_2
 0.0059  tigerhawk_1
 0.0062  davidism
 0.0068  saksham
 0.0100  riad
 0.0186  shashank

median performance:
 0.0006  david_zhang
 0.0016  zero
 0.0024  antti
 0.0027  carpetpython
 0.0032  tigerhawk_2
 0.0055  tigerhawk_1
 0.0061  davidism
 0.0068  saksham
 0.0103  riad
 0.0179  shashank

Corpus 1 graph Higher resolution image

Corpus 2 (supplied examples - large set)

mean performance:
 0.0011  david_zhang
 0.0072  antti
 0.0078  zero
 0.0079  carpetpython
 0.0133  davidism
 0.0147  tigerhawk_2
 0.0225  shashank
 0.0243  tigerhawk_1
 0.0350  riad
 0.0578  saksham

median performance:
 0.0007  david_zhang
 0.0036  zero
 0.0044  antti
 0.0048  carpetpython
 0.0072  tigerhawk_2
 0.0086  davidism
 0.0121  tigerhawk_1
 0.0151  riad
 0.0197  shashank
 0.0214  saksham

Corpus 2 graph Higher resolution image

Corpus 3 (edge cases)

mean performance:
 0.0750  david_zhang
 0.0779  carpetpython
 0.0955  antti
 0.3300  tigerhawk_2
 0.4258  tigerhawk_1
 0.4506  davidism
 1.3021  saksham
 1.3510  shashank
 7.1807  zero
11.5788  riad

median performance:
 0.0215  antti
 0.0218  carpetpython
 0.0221  tigerhawk_2
 0.0232  david_zhang
 0.0266  tigerhawk_1
 0.0469  saksham
 0.1325  davidism
 0.1961  zero
 2.1620  shashank
 3.6235  riad

Corpus 3 graph Higher resolution image

The tests and raw results are available here.

I read your other answer and I think you made a great observation that you can split the string into two parts first to avoid computations! :) I'm not sure if my solution returning 'aa' on that one test is 'failing it'. It would only be failing it if the requirement were to return the shortest repeating substring. Either way, nice benchmark, I liked it. — Shashank, Apr 7 at 3:13
@Shashank based on some discussion in chat, the op wanted the shortest repetition if possible. There was no expectation for anyone to know that, but I had to test against something. — davidism, Apr 7 at 3:14
I made an optimization to my code which makes it so that it uses a generator instead of a list comprehension for spl. This should avoid quite a bit of computations and in my opinion will make it a bit faster (though probably still not as fast as your solution which has the neat idea to divide the string repeatedly by 2 and cleverly use str.split). — Shashank, Apr 7 at 4:25
Mine is faster than yours on 342 on those inputs but @zero's algo makes it really long to re-run the test suite :P — Antti Haapala, Apr 7 at 6:31
@GrijeshChauhan the code which creates the chart is linked in the answer, but here it is again :-) It uses pandas, matplotlib and seaborn. — Zero Piraeus, 12 hours ago

TigerhawkT3 · Answer 5 · 2015-04-08 20:43:52Z

up vote 26 down vote

Non-regex solution:

def repeat(string):
    for i in range(1, len(string)//2+1):
        if not len(string)%len(string[0:i]) and string[0:i]*(len(string)//len(string[0:i])) == string:
            return string[0:i]

Faster non-regex solution, thanks to @ThatWeirdo (see comments):

def repeat(string):
    l = len(string)
    for i in range(1, len(string)//2+1):
        if l%i: continue
        s = string[0:i]
        if s*(l//i) == string:
            return s

The above solution is very rarely slower than the original by a few percent, but it's usually a good bit faster - sometimes a whole lot faster. It's still not faster than davidism's for longer strings, and zero's regex solution is superior for short strings. It comes out to the fastest (according to davidism's test on github - see his answer) with strings of about 1000-1500 characters. Regardless, it's reliably second-fastest (or better) in all cases I tested. Thanks, ThatWeirdo.

Test:

print(repeat('009009009'))
print(repeat('254725472547'))
print(repeat('abcdeabcdeabcdeabcde'))
print(repeat('abcdefg'))
print(repeat('09099099909999'))
print(repeat('02589675192'))

Results:

009
2547
abcde
None
None
None

edited 2 days ago

answered Apr 6 at 23:20

TigerhawkT3
1,917310

Isn't this a brute force solution? – JustinDanielson Apr 6 at 23:23

5

@JustinDanielson Yes. But a solution none the less. – Quirliom Apr 6 at 23:37

3

I'm seeing about 1e-5 to 3e-5 seconds for short strings, 3e-5 to 4e-5 seconds for successful long (1000-character) strings, and a little bit under a millisecond for unsuccessful long strings (worst case). So a thousand 1000-character strings would take about a second. Compared to the math answer, this finds matches 10 times faster, but takes 3 times longer to fail. – TigerhawkT3 Apr 6 at 23:54

1

len(string[0:i]) is always equal to i (in this case at least). Replacing these, and also saving len(string) and string[0:i] in variables might speed things up. Also IMO this is a great solution, awesome ;) – ThatWeirdo Apr 7 at 20:25

1

@ThatWeirdo: see the edits. Your suggestions make this solution the fastest (based on davidism's gist on github) for strings of around 1000-1500 characters, and second-fastest for shorter or longer strings. :) – TigerhawkT3 Apr 7 at 21:01

| show 7 more comments

davidism · Answer 6 · 2015-04-08 13:13:06Z

First, halve the string as long as it's a "2 part" duplicate. This reduces the search space if there are an even number of repeats. Then, working forwards to find the smallest repeating string, check if splitting the full string by increasingly larger sub-string results in only empty values. Only sub-strings up to length // 2 need to be tested since anything over that would have no repeats.

def shortest_repeat(orig_value):
    if not orig_value:
        return None

    value = orig_value

    while True:
        len_half = len(value) // 2
        first_half = value[:len_half]

        if first_half != value[len_half:]:
            break

        value = first_half

    len_value = len(value)
    split = value.split

    for i in (i for i in range(1, len_value // 2) if len_value % i == 0):
        if not any(split(value[:i])):
            return value[:i]

    return value if value != orig_value else None

This returns the shortest match or None if there is no match.

Antti Haapala · Answer 7 · 2015-04-09 07:51:22Z

up vote 9 down vote

This version tries only those candidate sequence lengths that are factors of the string length; and uses the * operator to build a full-length string from the candidate sequence:

def get_shortest_repeat(string):
    length = len(string)
    for i in range(1, length // 2 + 1):
        if length % i:  # skip non-factors early
            continue

        candidate = string[:i]
        if string == candidate * (length // i):
            return candidate

    return None

Thanks to TigerhawkT3 for noticing that length // 2 without + 1 would fail to match the abab case.

edited yesterday

answered Apr 7 at 6:13

Antti Haapala
23.2k73369

This solution is indeed virtually identical to my optimized one. I see that you have a range limit of length//2, just like I did - you have to change that to length//2+1 if you want to catch strings that are exactly doubled (e.g. 'aabaab'). – TigerhawkT3 2 days ago

Ooops, you're correct :D – Antti Haapala yesterday

And now they're identical! \o/ I need to pay more attention to optimization in the future, but I'm glad the algorithm itself was sound. – TigerhawkT3 yesterday

@TigerhawkT3 yeah, but David Zhang beat us both with cleverness :D – Antti Haapala yesterday

add a comment |

Saksham Varma · Answer 8 · 2015-04-07 06:39:53Z

up vote 7 down vote

Here's a straight forward solution, without regexes.

For substrings of s starting from zeroth index, of lengths 1 through len(s), check if that substring, substr is the repeated pattern. This check can be performed by concatenating substr with itself ratio times, such that the length of the string thus formed is equal to the length of s. Hence ratio=len(s)/len(substr).

Return when first such substring is found. This would provide the smallest possible substring, if one exists.

def check_repeat(s):
    for i in range(1, len(s)):
        substr = s[:i]
        ratio = len(s)/len(substr)
        if substr * ratio == s:
            print 'Repeating on "%s"' % substr
            return
    print 'Non repeating'

>>> check_repeat('254725472547')
Repeating on "2547"
>>> check_repeat('abcdeabcdeabcdeabcde')
Repeating on "abcde"

edited Apr 7 at 6:39

answered Apr 7 at 1:55

Saksham Varma
4607

Your solution works well, I added it to the benchmark. It's only noticeably slow with long strings that don't repeat. – davidism Apr 7 at 6:01

@davidism Thanks! Yeah for long non repeating strings, it would have to check for substrings of all lengths, which would make it slow. – Saksham Varma Apr 7 at 6:32

Now that I look at this one carefully, it seems to be nearly identical to my originally posted (before any edits) solution, with the only differences being saving the length and the substring. I guess I had a pretty good algorithm. :P – TigerhawkT3 4 hours ago

@TigerhawkT3 Yeah indeed! :) – Saksham Varma 4 hours ago

add a comment |

RiaD · Answer 9 · 2015-04-08 17:44:42Z

up vote 4 down vote

The problem may also be solved in O(n) in worst case with prefix function.

Note, it may be slower in general case(UPD: and is much slower) than other solutions which depend on number of divisors of n, but usually find fails sooner, I think one of bad cases for them will be aaa....aab, where there are n - 1 = 2 * 3 * 5 * 7 ... *p_n - 1 a's

First of all you need to calculate prefix function

def prefix_function(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    return pi

then either there's no answer or the shortest period is

k = len(s) - prefix_function(s[-1])

and you just have to check if k != n and n % k == 0 (if k != n and n % k == 0 then answer is s[:k], else there's no answer

You may check the proof here (in Russian, but online translator will probably do the trick)

def riad(s):
    n = len(s)
    pi = [0] * n
    for i in xrange(1, n):
        j = pi[i - 1]
        while(j > 0 and s[i] != s[j]):
            j = pi[j - 1]
        if (s[i] == s[j]):
            j += 1
        pi[i] = j;
    k = n - pi[-1]
    return s[:k] if (n != k and n % k == 0) else None

edited 2 days ago

answered 2 days ago

RiaD
17.4k52861

Your prefix_function() is not valid Python: you have missing colons on your while and if statements, and && instead of and. After fixing those, it fails with UnboundLocalError: local variable 'i' referenced before assignment because of the line for i in range(i, n):. – Zero Piraeus 2 days ago

1

Zero, thanks, fixed these mistakes – RiaD 2 days ago

Thanks :-) If you can put together a function that uses your prefix_function() to return similar results to the other answers – either the shortest substring or None – I'll include it in a revised benchmark I'm putting together. – Zero Piraeus 2 days ago

@ZeroPiraeus, added – RiaD 2 days ago

Sorry, it doesn't work right now, I'll check – RiaD 2 days ago

| show 2 more comments

CarpetPython · Answer 10 · 2015-04-09 16:21:29Z

up vote 2 down vote

I started with more than eight solutions to this problem. Some were bases on regex (match, findall, split), some of string slicing and testing, and some with string methods (find, count, split). Each had benefits in code clarity, code size, speed and memory consumption. I was going to post my answer here when I noticed that execution speed was ranked as important, so I did more testing and improvement to arrive at this:

def repeating(s):
    size = len(s)
    incr = size % 2 + 1
    for n in xrange(1, size//2+1, incr):
        if size % n == 0:
            if s[:n] * (size//n) == s:
                return s[:n]

This answer seems similar to a few other answers here, but it has a few speed optimisations others have not used:

xrange is a little faster in this application,
if an input string is an odd length, do not check any even length substrings,
by using s[:n] directly, we avoid creating a variable in each loop.

I would be interested to see how this performs in the standard tests with common hardware. I believe it will be well short of David Zhang's excellent algorithm in most tests, but should be quite fast otherwise.

I found this problem to be very counter-intuitive. The solutions I thought would be fast were slow. The solutions that looked slow were fast! It seems that Python's string creation with the multiply operator and string comparisons are highly optimised.

edited yesterday

answered yesterday

CarpetPython
1395

Not bad at all :-) The benchmark runs on Python 3.4 (partly because OP doesn't specify a version and that's what everyone should be using, and partly because it uses the new statistics module), so I had to change your /s to //s and replace xrange() with range() (which behaves like 2.x's xrange() in 3.x). – Zero Piraeus yesterday

Here are the revisions to the benchmark, so you can review my changes, by the way. – Zero Piraeus yesterday

Thanks Zero. That was fast. The results were slightly down on my predictions. I suspect the techniques I used for speed in Python 2.7 are not very effective in Python 3.4. Oh, well - a fun and educational exercise. – CarpetPython yesterday

// in 3.x is integer division (just like the 2.x behaviour of /), while 3.x's / is float division (which I'm sure would be much slower even if it didn't break your solution by causing an attempt to use a float as an index). As mentioned, 3.x's range() is the same thing as 2.x's xrange(); no equivalent of 2.x's range() exists in 3.x. So I don't think that's the cause of any discrepancy between the benchmark and any timings you made. It's probably just that 3.x is slower overall than 2.x (or maybe your machine is faster than mine). – Zero Piraeus yesterday

When I get some time, I shall have a close look at the run-time differences between Python 2 and Python 3. – CarpetPython yesterday

| show 3 more comments

Sopel · Answer 11 · 2015-04-09 12:14:58Z

I think the fastest solution would be to find a biggest common prefix-suffix of the given string. Then the string is made of at least 2 repeats when both of the following conditions are met:

length of the biggest common prefix-suffix is at least half of the length of the string (prefixLength*2 >= stringLength)
length of the prefix (or suffix) has a common divisor with string length. (stringLength % (stringLength - prefixLength) == 0)

Sorry for not giving a Python code, but it shouldn't be hard to rewrite.

#include <iostream>
#include <vector>
#include <string>

std::string biggestPrefixSuffix(const std::string& string) //using Morris-Pratt algorithm
{
    int stringLength = string.size();
    std::vector<int> pi(stringLength + 1);
    int t = -1;
    pi[0] = -1;

    for(int i = 1; i <= stringLength; ++i)
    {
        while((t > -1) && (string[t] != string[i - 1]))
        {
            t = pi[t];
        }
        ++t;
        pi[i] = t;
    }
    return std::string(string, 0, pi.back());
}
int main()
{
    std::string strings[] =
    {

        "0045662100456621004566210045662100456621",             //# '00456621'
        "0072992700729927007299270072992700729927",             //# '00729927'
        "001443001443001443001443001443001443001443",           //# '001443'
        "037037037037037037037037037037037037037037037",        //# '037'
        "047619047619047619047619047619047619047619",           //# '047619'
        "002457002457002457002457002457002457002457",           //# '002457'
        "001221001221001221001221001221001221001221",           //# '001221'
        "001230012300123001230012300123001230012300123",        //# '00123'
        "0013947001394700139470013947001394700139470013947",    //# '0013947'
        "001001001001001001001001001001001001001001001001001",  //# '001'
        "001406469760900140646976090014064697609",              //# '0014064697609'

        "004608294930875576036866359447",
        "00469483568075117370892018779342723",
        "004739336492890995260663507109",
        "001508295625942684766214177978883861236802413273",
        "007518796992481203",
        "0071942446043165467625899280575539568345323741",
        "0434782608695652173913",
        "0344827586206896551724137931",
        "002481389578163771712158808933",
        "002932551319648093841642228739",
        "0035587188612099644128113879",
        "003484320557491289198606271777",
        "00115074798619102416570771",
    };
    for(const auto& s : strings)
    {
        std::string prefixSuffix = biggestPrefixSuffix(s);
        std::cout << s << " --> ";
        if(s.length() % (s.length() - prefixSuffix.length()) == 0 && prefixSuffix.length() * 2 >= s.length())
        {
            std::cout << std::string(s, prefixSuffix.length());
        }
        else
        {
            std::cout << "string does not repeat";
        }
        std::cout << '\n';
    }
    return 0;
}

outputs:

0045662100456621004566210045662100456621 --> 00456621
0072992700729927007299270072992700729927 --> 00729927
001443001443001443001443001443001443001443 --> 001443
037037037037037037037037037037037037037037037 --> 037
047619047619047619047619047619047619047619 --> 047619
002457002457002457002457002457002457002457 --> 002457
001221001221001221001221001221001221001221 --> 001221
001230012300123001230012300123001230012300123 --> 00123
0013947001394700139470013947001394700139470013947 --> 0013947
001001001001001001001001001001001001001001001001001 --> 001
001406469760900140646976090014064697609 --> 0014064697609
004608294930875576036866359447 --> string does not repeat
00469483568075117370892018779342723 --> string does not repeat
004739336492890995260663507109 --> string does not repeat
001508295625942684766214177978883861236802413273 --> string does not repeat
007518796992481203 --> string does not repeat
0071942446043165467625899280575539568345323741 --> string does not repeat
0434782608695652173913 --> string does not repeat
0344827586206896551724137931 --> string does not repeat
002481389578163771712158808933 --> string does not repeat
002932551319648093841642228739 --> string does not repeat
0035587188612099644128113879 --> string does not repeat
003484320557491289198606271777 --> string does not repeat
00115074798619102416570771 --> string does not repeat

@edit I just realized that RiaD may be talking about the same approach.

Yes, I think this may be the same as RiaD's solution (which was extremely slow for many cases) - and, yes, the question was for a Python solution. — TigerhawkT3, yesterday

asked	4 days ago
viewed	6999 times
active	yesterday

current community

your communities

more stack exchange communities

How can I tell if a string repeats itself in Python?

This question has an open bounty worth +150 reputation from Zero Piraeus ending in 5 days.

11 Answers 11

Your Answer

Not the answer you're looking for? Browse other questions tagged python string pattern-matching or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

How can I tell if a string repeats itself in Python?

This question has an open bounty worth +150 reputation from Zero Piraeus ending in 5 days.

11 Answers 11

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged python string pattern-matching or ask your own question.

Linked

Related

Hot Network Questions