Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String encoding is dropped in C++ round-trip through String class #988

Open
clauswilke opened this issue Aug 12, 2019 · 10 comments
Open

String encoding is dropped in C++ round-trip through String class #988

clauswilke opened this issue Aug 12, 2019 · 10 comments

Comments

@clauswilke
Copy link

@clauswilke clauswilke commented Aug 12, 2019

Is this behavior as expected? It causes unexpected bugs that only show up on Windows, see e.g. here.

library(Rcpp)

cppFunction('CharacterVector test1(const String &s) {
  CharacterVector v(s);
  return v;
}')

cppFunction('CharacterVector test2(CharacterVector c) {
  String s = c[0];
  CharacterVector v(s);
  return v;
}')

cppFunction('CharacterVector test3(CharacterVector c) {
  CharacterVector v(c);
  return v;
}')

# the following fails on Windows for test1() and test2()
x <- "special char: \u03bc"
test1(x)
#> [1] "special char: μ"
test2(x)
#> [1] "special char: μ"
test3(x)
#> [1] "special char: μ"

# The encoding is lost for functions test1() and test2()
Encoding(x)
#> [1] "UTF-8"
Encoding(test1(x))
#> [1] "unknown"
Encoding(test2(x))
#> [1] "unknown"
Encoding(test3(x))
#> [1] "UTF-8"

Created on 2019-08-11 by the reprex package (v0.3.0)

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Aug 12, 2019

Probably not "intentional". There is some encoding support in String.h ie some constructors have an second argument, Without it, maybe we are dropping infromation on Windows with is known to be difficult. I don't have good Windows access but maybe you try variants of test11() and test2() which call set_encoding() to test this?

@clauswilke
Copy link
Author

@clauswilke clauswilke commented Aug 12, 2019

I don't have access to a Windows machine either, but I see the encoding change from "UTF-8" to "unknown" on OS X, it just doesn't cause any downstream problems.

I looked at the Rcpp code and noticed two things:

  1. In the copy constructor, I would have expected the encoding to just be copied over from other:

    String(const String& other) : data(other.get_sexp()), valid(true), buffer_ready(false), enc(Rf_getCharCE(other.get_sexp())) {

  2. In the assignment operator, encoding isn't set at all:

    inline String& operator=(const String& other) { data = Rcpp_ReplaceObject(data, other.get_sexp()); valid = true; buffer_ready = false; return *this; }

    (In fact, none of the assignment operators set the encoding.)

@coatless
Copy link
Contributor

@coatless coatless commented Aug 12, 2019

I think this was a deliberate design choice for Rcpp::String circa 2015/6, c.f. original ticket #263, PR #310, and PR #529 that set defaults to CE_UTF8.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Aug 12, 2019

@clauswilke Re 1) Do you want to try that modification, maybe in the most careful that we'd enable it with an option? [Change management is hard with 1600+ client packages. Re 2): Unsure, but we had a number of discussions and attempts on that.

@coatless Thanks for digging up those tickets.

With greetings from Iceland...

@clauswilke
Copy link
Author

@clauswilke clauswilke commented Aug 12, 2019

Maybe we talk a little more about what the design exactly should be. I'm not sure I know enough about encoding and R internals to have a strong opinion. However, below I provide another reprex, now purely C++, that I'd argue is confusing to a downstream user. I create a string with defined encoding and then make a copy of that string, and the encoding changes.

#include <Rcpp.h>
using namespace Rcpp;

#include <string>

void output_encoding(const String &s, const std::string &name) {
  Rcout << "String " << name << " has encoding: ";

  cetype_t enc = s.get_encoding();

  switch(enc) {
  case CE_NATIVE:
    Rcout << "native" << std::endl;
    break;
  case CE_UTF8:
    Rcout << "utf8" << std::endl;
    break;
  default:
    Rcout << enc << std::endl;
  }
}

// [[Rcpp::export]]
void test_encoding(){
  String s("abcdabcd", CE_UTF8);
  output_encoding(s, "s");

  String s2(s);
  output_encoding(s2, "s2");

  String s3 = s;
  output_encoding(s3, "s3");
}

/*** R
test_encoding()
*/

Output:

Rcpp::sourceCpp("~/Desktop/test.cpp")
#> 
#> > test_encoding()
#> String s has encoding: utf8
#> String s2 has encoding: native
#> String s3 has encoding: native

Created on 2019-08-12 by the reprex package (v0.3.0)

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Aug 12, 2019

I am with you on things being surprising at times, but as far as I know (and there are areas of R I feel I know more about) the whole charactter is a mess particularly once Windows comes in so in that sense maybe the first recommendation (which we could make more public in the Rcpp FAQ) is to stick to CharacterVector instead. R added a lot of elbow grease to make this work for R, the best may just be to free ride as much as we can. String is fine, but may fall short on some of these corner cases. My $0.02.

@clauswilke
Copy link
Author

@clauswilke clauswilke commented Aug 12, 2019

Is it possible to take a specific element from a CharacterVector and stick it into a new CharacterVector without going through a String intermediate? It's not immediately obvious to me how, but I'm still learning Rcpp and may be missing something.

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Aug 12, 2019

You can. As always, devil in the detail. Got my first attempt wrong, then cheated as usual and looked at the existing unit tests.

> cppFunction("CharacterVector getNth(CharacterVector v, int i) {   ## manual indent 
                              CharacterVector n(1); n[0] = v[i]; return(n); }")
> getNth(LETTERS, 2)
[1] "C"
> 
@clauswilke
Copy link
Author

@clauswilke clauswilke commented Aug 12, 2019

I can confirm this keeps the encoding intact.

Rcpp::cppFunction("CharacterVector getNth(CharacterVector v, int i) {
                              CharacterVector n(1); n[0] = v[i]; return(n); }")
x <- "special char: \u03bc"
getNth(x, 0)
#> [1] "special char: μ"
Encoding(x)
#> [1] "UTF-8"
Encoding(getNth(x, 0))
#> [1] "UTF-8"

Created on 2019-08-12 by the reprex package (v0.3.0)

@eddelbuettel
Copy link
Member

@eddelbuettel eddelbuettel commented Aug 12, 2019

Great! It's a stop-gap.

In theory, there is no reason why String could not do the same.
In practice, it may be a beast to get there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.