Skip to content

ww898/utf-cpp

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

UTF-8/16/32 C++ library

This is the C++11 template based header only library under Windows/Linux/MacOs to convert UFT-8/16/32 symbols and strings. The library transparently support wchar_t as UTF-16 for Windows and UTF-32 for Linux and MacOs.

UTF-8 and UTF-32 (UCS-32) both support 31 bit wide code points [0‥0x7FFFFFFF]with no restriction. UTF-16 supports only unicode code points [0‥0x10FFFF], where high [0xD800‥0xDBFF] and low [0xDC00‥0xDFFF] surrogate regions are prohibited.

The maximum UTF-16 symbol size is 2 words (4 bytes, both words should be in the surrogate region). UFT-32 (UCS-32) is always 1 word (4 bytes). UTF-8 has the maximum symbol size (see conversion table for details):

  • 4 bytes for unicode code points
  • 6 bytes for 31bit code points
UTF-16 surrogate decoder:
High\Low DC00 DC01 DFFF
D800 010000 010001 0103FF
D801 010400 010401 0107FF
DBFF 10FC00 10FC01 10FFFF

UTF-16 Surrogates

Supported compilers

Tested on following compilers:

Usage example

    // यूनिकोड
    static char const u8s[] = "\xE0\xA4\xAF\xE0\xA5\x82\xE0\xA4\xA8\xE0\xA4\xBF\xE0\xA4\x95\xE0\xA5\x8B\xE0\xA4\xA1";
    using namespace ww898::utf;
    std::u16string u16;
    convz<utf_selector_t<decltype(*u8s)>, utf16>(u8s, std::back_inserter(u16));
    std::u32string u32;
    conv<utf16, utf_selector_t<decltype(u32)::value_type>>(u16.begin(), u16.end(), std::back_inserter(u32));
    std::vector<char> u8;
    convz<utf32, utf8>(u32.data(), std::back_inserter(u8));
    std::wstring uw;
    conv<utf8, utfw>(u8s, u8s + sizeof(u8s), std::back_inserter(uw));
    auto u8r = conv<char>(uw);
    auto u16r = conv<char16_t>(u16);
    auto uwr = convz<wchar_t>(u8s);

    auto u32r = conv<char32_t>(std::string_view(u8r.data(), u8r.size())); // C++17 only

    static_assert(std::is_same<utf_selector<decltype(*u8s)>, utf_selector<decltype(u8)::value_type>>::value, "Fail");
    static_assert(
        std::is_same<utf_selector_t<decltype(u16)::value_type>, utf_selector_t<decltype(uw)::value_type>>::value !=
        std::is_same<utf_selector_t<decltype(u32)::value_type>, utf_selector_t<decltype(uw)::value_type>>::value, "Fail");

UTF-8 Conversion table

UTF-8/32 table