Homework / toy program - character matching

Question

Given a string A = "Hello world" and another string B = "wor", I have to print ("TRUE" or "FALSE") whether the characters in B are present in A. For example, if B = "ollw l" the result should be TRUE; if B = "worlds" the result should be FALSE; if B = "ww", result is FALSE. It has to be done in O(n) time complexity.

My solution is given below. Although this runs in O(n) time I would like some suggestions on how to improve.

int main(void)
{
        char *A = "Hello world";
        char *B = "ollw l";
        int table['z'] = {0};

        while (*A != '\0') {
                table[*A] += 1;
                A++;
        }

        while (*B != '\0') {
                table[*B] -= 1;
                if (table[*B] < 0) {
                        printf("%s\n", "FALSE");
                        return 0;
                }
                B++;
        }
        printf("%s\n", "TRUE");
}

Welcome to Code Review! Please don't forget to delete your question on Stack Overflow so you won't be accused of cross-posting. — 5gon12eder, 11 hours ago
@5gon12eder Yeah, i have flagged my question for deletion there. I can't delete on SO as someone has answered my question. — babon, 11 hours ago
@babon int table['z'] looks weird. what's your assumption about the actual size? — πάντα ῥεῖ, 11 hours ago
@babon "Are you saying it should be int table['z' + 1]?" No, I mean you should cover slots for any possible char value (or even better unsigned char value). — πάντα ῥεῖ, 9 hours ago

5gon12eder · Answer 1 · 2017-01-07 15:54:01Z

Your algorithm looks alright

The algorithmic idea seems correct and fit for meeting the stated requirements. There are also no unreasonable constant factors or excessive resource usage just to meet the required asymptotic complexity. Overall, I think you did the right thing.

Try to nail down the requirements

I'm not entirely sure what your function is supposed to do. Should it only take into account letters or also other characters like white-space and punctuation? Should it be case-sensitive or insensitive? Either choice is reasonable but one has to be made.

The following item is related to this.

Know your range

You're using an array from 0 (inclusive) to z (exclusive). As others have already pointed out in comments, this means an out of bounds access if the character z occurs in your input which is almost certainly a bug. Adding 1 to the array size might seem to fix the issue but I'd suggest to look more closer at the range of valid inputs.

First, you have to define your range of valid characters. This might be the whole range of the char data type. Which is implementation-defined so you should use the CHAR_MIN and CHAR_MAX macros defined in <limits.h>. (Your input actually can never contain the value 0 because you'd treat it as the sentinel value but wasting that one index seems reasonable.)

If your input range is smaller than that (for example, only ASCII characters or only alphanumeric characters), you should do two things.

Verify that the input is in range and do some reasonable error handling if it is not. If you decide to give your function a narrow contract, invalid inputs mean undefined behavior and you can use the assert macro that is only active in debug builds and gain a little performance in release builds.
Subtract the lowest permissible character from the character at hand to get the index in your array.

For the ASCII character set, the range is [0, 128) and [32, 126) for the printable sub-set. (It would probably be more readable to use ' ' (space) and '~' + 1 in code instead of these magic numbers.)

Be warned that the C standard doesn't require that the character set is ASCII and the ASCII math might be wrong for other encodings. Some encodings also have multi-byte characters and those will cause you never-ending nightmares if you wish to handle them correctly.

Beware of integral promotion

If your input can be the full range of the char data type, there is a nasty culprit here. The C standard doesn't specify whether char is signed or unsigned. If your code is used on an implementation where char is signed, the byte 0xff has the numerical value –1 and your index calculation will blow up. The usual way to deal with this is to explicitly cast to unsigned char before using its value.

Consider saving some space – or maybe don't

You're currently using an int to count the frequency of a character. This is likely a reasonable choice but I'd like to challenge you to justify it.

If you want to be on the absolute safe side, a size_t will guarantee that you can even deal with the largest possible array, filled with the same character. If you go for size_t, be aware that it is an unsigned type so you cannot check for “less than zero” after the subtraction but have to check for “equal to zero” before subtracting.

If you want to preserve space (and thus get lower cache usage) a smaller data type will perform better but you'll get undefined behavior if its range is exceeded.

Any decision can be justified but you should be aware of its implications. If your function makes assumptions about the maximum frequency of a character, its documentation should state this limitation.

Separate different concerns into different functions

You have put all code into main. This is obviously not a very modular design and hard to unit-test. I'd recommend that you break your logic into at least three functions.

One function that takes two character arrays and checks whether the former contains all characters from the latter. It doesn't print anything but returns a boolean value.
One function that converts a boolean value into a pointer to a statically allocated character array with a human-readable text. This can be as simple as return value ? "true" : "false";.
One function that obtains (or hard-codes) the input, calls the first function to get the result and then prints the human-readable representation obtained from the second function. (You can leave this logic in main if that's all your program does.)

You have some interesting points regarding the character set issues. +1 for that. — πάντα ῥεῖ, 10 hours ago

πάντα ῥεῖ · Answer 2 · 2017-01-07 16:21:33Z

As mentioned in 5gon12eder's answer the algorithm itself seems to be good and efficient. Setting up a table with the count of characters is a good idea to solve that problem efficiently.
Sorry, I missed to state that with my initial answer.

Though there are a number of points to be considered:

1. Don't make assumptions on the char set used

As I've menitoned in my comment

int table['z'] = {0};

is probably a bad idea:

Don't assume an ASCII character set blindly, there are still others like EBCDIC in use and may screw up your assumptions
If the input contains any other characters than expected you may hit undefined behavior because of the possibility that table is accessed out of range.

The better, and more generic way to adjust the possibly needed table array size is

int table[1 << CHAR_BIT] = {0};

The CHAR_BIT macro is accessible from limits.h.

2. Factor out your functionality to a separate function

This makes your code isolated, reusable and easier to test.

int check_match(char* A, char* B) {
    int table[1 << CHAR_BIT] = {0};
    while (*A != '\0') {
            table[*A] += 1;
            A++;
    }

    while (*B != '\0') {
        table[*B] -= 1;
        if (table[*B] < 0) {
            return 0;
            }
        B++;
    }
    return 1;
}

int main() {
    printf("%s\n",check_match("Hello world","ollw l")?"TRUE":"FALSE");
    printf("%s\n",check_match("Hello world","ww l")?"TRUE":"FALSE");
}

See Live Demo

3. Don't loose pointers initialized from literals

When you are incrementing your char* pointers A and B you are loosing the original literal reference.

That may be unwanted in some more complicated code that needs to refer to the literals later again.

Also make clear that the literals shouldn't be overwritten using the const keyword, since this would lead to undefined behavior:

const char *A = "Hello world";
const char *B = "ollw l";

4. Consider to use the current c standard (C99)

Compiling your (or my proposed) code with the C99 strict options throws some warnings/errors like

prog.c: In function 'check_match':
prog.c:7:22: error: array subscript has type 'char' [-Werror=char-subscripts]
             table[*A] += 1;

You can fix this using a cast to size_t which should be the actual type used for array indexing.

You can check my finally fixed code with all the mentioned points from above here.

I you use 1 << CHAR_BIT, you can still be in trouble on implementations where char is signed, regardless of the encoding. (Which, at least to me, is much more relevant than EBCDIC platforms.) — 5gon12eder, 10 hours ago
@5gon12eder Well, I think that's probably covered in my point 4., isn't it? — πάντα ῥεῖ, 9 hours ago

asked	today
viewed	230 times
active	today

current community

your communities

more stack exchange communities

Homework / toy program - character matching

2 Answers 2

Your algorithm looks alright

Try to nail down the requirements

Know your range

Beware of integral promotion

Consider saving some space – or maybe don't

Separate different concerns into different functions

1. Don't make assumptions on the char set used

2. Factor out your functionality to a separate function

3. Don't loose pointers initialized from literals

4. Consider to use the current c standard (C99)

Your Answer

Not the answer you're looking for? Browse other questions tagged algorithm c strings homework or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Homework / toy program - character matching

2 Answers 2

Your algorithm looks alright

Try to nail down the requirements

Know your range

Beware of integral promotion

Consider saving some space – or maybe don't

Separate different concerns into different functions

1. Don't make assumptions on the char set used

2. Factor out your functionality to a separate function

3. Don't loose pointers initialized from literals

4. Consider to use the current c standard (C99)

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged algorithm c strings homework or ask your own question.

Related

Hot Network Questions