Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upString encoding is dropped in C++ round-trip through String class #988
Comments
|
Probably not "intentional". There is some encoding support in |
|
I don't have access to a Windows machine either, but I see the encoding change from "UTF-8" to "unknown" on OS X, it just doesn't cause any downstream problems. I looked at the Rcpp code and noticed two things:
|
|
@clauswilke Re 1) Do you want to try that modification, maybe in the most careful that we'd enable it with an option? [Change management is hard with 1600+ client packages. Re 2): Unsure, but we had a number of discussions and attempts on that. @coatless Thanks for digging up those tickets. With greetings from Iceland... |
|
Maybe we talk a little more about what the design exactly should be. I'm not sure I know enough about encoding and R internals to have a strong opinion. However, below I provide another reprex, now purely C++, that I'd argue is confusing to a downstream user. I create a string with defined encoding and then make a copy of that string, and the encoding changes. #include <Rcpp.h>
using namespace Rcpp;
#include <string>
void output_encoding(const String &s, const std::string &name) {
Rcout << "String " << name << " has encoding: ";
cetype_t enc = s.get_encoding();
switch(enc) {
case CE_NATIVE:
Rcout << "native" << std::endl;
break;
case CE_UTF8:
Rcout << "utf8" << std::endl;
break;
default:
Rcout << enc << std::endl;
}
}
// [[Rcpp::export]]
void test_encoding(){
String s("abcdabcd", CE_UTF8);
output_encoding(s, "s");
String s2(s);
output_encoding(s2, "s2");
String s3 = s;
output_encoding(s3, "s3");
}
/*** R
test_encoding()
*/Output: Rcpp::sourceCpp("~/Desktop/test.cpp")
#>
#> > test_encoding()
#> String s has encoding: utf8
#> String s2 has encoding: native
#> String s3 has encoding: nativeCreated on 2019-08-12 by the reprex package (v0.3.0) |
|
I am with you on things being surprising at times, but as far as I know (and there are areas of R I feel I know more about) the whole charactter is a mess particularly once Windows comes in so in that sense maybe the first recommendation (which we could make more public in the Rcpp FAQ) is to stick to |
|
Is it possible to take a specific element from a |
|
You can. As always, devil in the detail. Got my first attempt wrong, then cheated as usual and looked at the existing unit tests. > cppFunction("CharacterVector getNth(CharacterVector v, int i) { ## manual indent
CharacterVector n(1); n[0] = v[i]; return(n); }")
> getNth(LETTERS, 2)
[1] "C"
> |
|
I can confirm this keeps the encoding intact. Rcpp::cppFunction("CharacterVector getNth(CharacterVector v, int i) {
CharacterVector n(1); n[0] = v[i]; return(n); }")
x <- "special char: \u03bc"
getNth(x, 0)
#> [1] "special char: μ"
Encoding(x)
#> [1] "UTF-8"
Encoding(getNth(x, 0))
#> [1] "UTF-8"Created on 2019-08-12 by the reprex package (v0.3.0) |
|
Great! It's a stop-gap. In theory, there is no reason why |
Is this behavior as expected? It causes unexpected bugs that only show up on Windows, see e.g. here.
Created on 2019-08-11 by the reprex package (v0.3.0)