UTF-8 (Unicode Transformation Format, 8 bits) is a character encoding that describes each Unicode code point using a byte sequence of one to six bytes. It is backwards-compatible with ASCII while still supporting representation of all Unicode code points.
5
votes
0answers
37 views
Metalsmith plugin for stripping UTF8 BOM from files
I've been developing a metalsmith static site and I came across an issue where Visual Studio was automatically adding a BOM to the pages.
I wrote the following plugin for metalsmith (it needs to be ...
1
vote
1answer
31 views
Robustly dealing with malformed Unicode files
I'm writing a script that deals reads UTF-8-encoded XML files and writes parts of those files into a tempfile for further processing.
Sometimes, the input files will have a few malformed characters. ...
8
votes
2answers
258 views
Validating UTF-8 byte array
I'm writing a validator function that receives a byte[] and checks whether it represents a valid UTF-8 byte sequence, according to this table.
Is my approach ...
7
votes
3answers
322 views
Functions to escape CSS rules in PHP
Some context
I've been tasked with supplying an escaping function to arbitrary CSS values that are entered through a form. The goals and caveats are:
I know it's bad practice to let users input ...
4
votes
1answer
126 views
Method to return a string of max length (in bytes vs. characters)
In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).
...
3
votes
1answer
63 views
Replacing Perisan and Arabic digits
I'm using this function to replace UTF-8 characters representing numbers in text with 'normal' digits.
I'm wondering if this is optimized code since this is using two ...
1
vote
1answer
44 views
Avoiding use of .encode() in rss2html
My concern with this code is the excessive use of .encode('utf-8'). Any advice on refining these functions would be very helpful.
rss2html GitHub repo
...
10
votes
2answers
344 views
Customised Java UTF-8
I have implemented a customized UTF-8 encoding mechanism. The code works fine, but I have a lot of concerns regarding the code.
...
6
votes
4answers
1k views
Function to convert ISO-8859-1 to UTF-8
I wrote this function last year to convert between the two encodings and just found it. It takes a text buffer and its size, then converts to UTF-8 if there's enough space.
What should be changed to ...
6
votes
3answers
2k views
Count byte length of string
I am looking for some guidance and optimization pointers for my custom JavaScript function which counts the bytes in a string rather than just chars. The website uses UTF-8 and I am looking to ...
4
votes
2answers
88 views
Macros to detect UTF-8
I'm working on a program that handles UTF-8 characters. I've made the following macros to detect UTF-8. I've tested them with a few thousand words and they seem to work.
I'll add another one to do ...
5
votes
1answer
390 views
Better code for converting a char to its UTF-8 percent encoding representation?
This is working code for a URI template (RFC 6570) implementation; when the character to render is not within a specific character set, it is needed to grab the UTF-8 representation of that character ...
4
votes
1answer
428 views
Please review my UTF-8 character reader function
You may see full code here (note that the link points to the specific commit).
Language is "clean C" (that is, a subset of C89, C99 and C++98 — it is intended to compile under all of these ...