Code Review Stack Exchange is a question and answer site for peer programmer code reviews. It's 100% free, no registration required.

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

I'm writing a Ruby extension in C. It's a string processing module working on UTF-8 encoded strings only.

One method, full_width_to_ascii!, converts full width characters to ASCII equivalents (https://en.wikipedia.org/wiki/Halfwidth_and_fullwidth_forms). Essentially, it subtracts an offset of 0xfee0 from any full width characters. As the bang in the name implies, it works on the string in-place.

For example:

full_width_to_ascii!('A B C')
=> 'A B C'

In UTF-8, all full-width characters are encoded in 3 bytes each, while ASCII characters are 1 byte each. So while the length of the resulting string in Unicode code points will always be the same, the new UTF-8 encoded length in bytes may be smaller.

I modify the encoded string data (retrieved using StringValueCStr()) and null-terminate it at the new end-point. Finally, I call the following function to reduce the encoded string length:

// from internal.h from MRI
#define STR_NOEMBED      FL_USER1
#define STR_SHARED       FL_USER2 /* = ELTS_SHARED */
#define STR_EMBED_P(str) (!FL_TEST_RAW((str), STR_NOEMBED))

void reduce_encoded_length(VALUE str, int length) {
  if (!STR_EMBED_P(str)) {
    RSTRING(str)->as.heap.len = length;
  } else {
    // see string.c, STR_SET_EMBED_LEN
    RBASIC(str)->flags &= ~RSTRING_EMBED_LEN_MASK;
    RBASIC(str)->flags |= (length) << RSTRING_EMBED_LEN_SHIFT;
  }
}

Looking at the MRI source, the implementation of strings is surprisingly complicated, but from my understanding so far, the function above should be safe for UTF-8 encoded strings.

Does this seem reasonable? Will it break horribly on older Ruby versions? Across platforms?

See also:

https://github.com/ruby/ruby/blob/32674b167bddc0d737c38f84722986b0f228b44b/string.c http://patshaughnessy.net/2012/1/4/never-create-ruby-strings-longer-than-23-characters

share|improve this question

This question has an open bounty worth +50 reputation from bsa ending in 3 days.

This question has not received enough attention.

I'm hoping for a review from someone who is already strongly familiar with the source code of the MRI, or who is willing to spend some time reading through string.c to understand it and try to find edge cases I've missed.

    
I can't help much with the Ruby side of things, but I'd recommend taking a look at how iconv converts between these character sets. – syb0rg yesterday

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.