Join the Stack Overflow Community
Stack Overflow is a community of 6.6 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I'm loading a website HTML using this call -

    NSMutableURLRequest *request = [NSMutableURLRequest requestWithURL:url];
    [request setValue:@"utf-8" forHTTPHeaderField:@"Accept-Encoding"];
    [request setValue:@"text/html" forHTTPHeaderField:@"Accept"];
    [NSURLConnection sendAsynchronousRequest:request
                                       queue:[NSOperationQueue currentQueue]
                           completionHandler:^(NSURLResponse *response, NSData *data, NSError *error) { ... }

and then, to convert NSData into NSString, I need to know the encoding, so I call -

NSString *textEncoding = [response textEncodingName];

from the code block, but it returns nil on websites that won't specify "Content-Encoding" header field.

If I don't know the encoding, [[NSString alloc] initWithData:data encoding:responseEncoding] won't give me readable HTML.

How can I detect the right encoding for websites that don't send "Content-Encoding" header field?

share|improve this question
up vote 2 down vote accepted

It is possible to try different encodings and see which one results with readable text -

static int encodingPriority[] = {
    NSUTF8StringEncoding,
    NSASCIIStringEncoding,
    NSISOLatin1StringEncoding,
    NSISOLatin2StringEncoding,
    NSUnicodeStringEncoding,
    NSWindowsCP1251StringEncoding,
    NSWindowsCP1252StringEncoding,
    NSWindowsCP1253StringEncoding,
    NSWindowsCP1254StringEncoding,
    NSWindowsCP1250StringEncoding,
    NSNEXTSTEPStringEncoding,
    NSJapaneseEUCStringEncoding,
    NSNonLossyASCIIStringEncoding,
    NSShiftJISStringEncoding,          /* kCFStringEncodingDOSJapanese */
    NSISO2022JPStringEncoding,        /* ISO 2022 Japanese encoding for e-mail */
    NSMacOSRomanStringEncoding,
    NSUTF16BigEndianStringEncoding,
    NSUTF16LittleEndianStringEncoding,
    NSUTF32StringEncoding,
    NSUTF32BigEndianStringEncoding,
    NSUTF32LittleEndianStringEncoding
};

#define REQUIRED_HTML_STRING    @"<html"

- (NSString *)htmlStringForUnknownEncodingData:(NSData *)data detectedEncoding:(NSStringEncoding *)detectedEncoding
{
    NSStringEncoding encoding;
    NSString *html;

    for (int i = 0; i < sizeof(encodingPriority); i++) {
        encoding = encodingPriority[i];

        // try this encoding
        html = [[NSString alloc] initWithData:data encoding:encoding];

        // we need to find a text, because bad encoding will return an unreadable text
        if (html && [html rangeOfString:REQUIRED_HTML_STRING options:NSCaseInsensitiveSearch].location != NSNotFound) {
            *detectedEncoding = encoding;
            return html;
        }
    }
    return nil;
}

then, to detect which encoding the HTML in your NSData is using, call -

NSStringEncoding encoding;
html = [self htmlStringForUnknownEncodingData:data detectedEncoding:&encoding];

if (html)
    NSLog("Encoding detected!");
else
    NSLog("No encoding detected");
share|improve this answer

I tried the code from @Kof. I have noticed I got encoding from response is utf-8 though. If you set encoding to [[NSString alloc] initWithData:data encoding:@"utf-8"] directly, it will definitely return null. It is because encoding accept type NSStringEncoding which has type of NSENUM. If you try [NSString alloc] initWithData:data encoding:NSUTF8StringEncoding, it will return you the result.

share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.