Take the 2-minute tour ×
Unix & Linux Stack Exchange is a question and answer site for users of Linux, FreeBSD and other Un*x-like operating systems.. It's 100% free, no registration required.

I am noob on Unix. I've a very unique problem where and really appreciate your help. I need to parse string containing hash code value and convert the hash code is equivalent char representation. Here is the example code of it.

I see that you#39;re eligible to get ticket for show on your device#44;

now, the script should output to

I see that you're eligible to get ticket for show on your device,
share|improve this question
    
Is this html encoding? –  richard Jan 21 at 21:23
    
yes, it uses the html entity encoding style but without the &. –  pratikch Jan 21 at 21:29
1  
There are tools that can do html decoding, so use sed to convert it to “proper” html encoding, then put it through one of these decoders. –  richard Jan 21 at 21:34
    
How do you get from 146 to ` (grave accent, U+0060)? Or do you mean (right single quotation mark), which is at 146 in the ibm-cp1252 character set for instance? –  Stéphane Chazelas Jan 21 at 22:16
    
After your edit, that still doesn't add up. HTML ’ is a control character, ` is ` (backtick, grave accent), ' (apostrophe) is ', (right single quotation mark) is ’. –  Stéphane Chazelas Jan 22 at 12:46

1 Answer 1

up vote 1 down vote accepted

perl is good for this:

$ str='I see that you#146;re eligible to get ticket for show on your device#44;'
$ perl -pe 's/#(\d+);/chr($1)/ge' <<<"$str"
I see that you’re eligible to get ticket for show on your device,

I had to set my terminal's encoding to WINDOWS-1252 to get that output. Decimal 146 is not a valid ISO-8859-1 character.


To treat those codes as HTML entities, we'll add the missing ampersand, and decode:

perl -MHTML::Entities -lne 's/(#\d+;)/&$1/g; print decode_entities($_)' <<<"$str"
share|improve this answer
    
Thank you for your answer, but can't use anything other than simple shell script :( –  pratikch Jan 21 at 21:36
    
you need to output character with code point 146, not byte 146. (This will be a 2 byte code in utf-8) –  richard Jan 21 at 21:36
    
Yes I agree, this uses UTF encoding. Sorry If I've missed to mention that. –  pratikch Jan 21 at 21:39
    
There's a great number of charset that have RIGHT SINGLE QUOTATION MARK on 146, not only windows-1252 (all the other windows-125x as well for instance). Of course, using that character here is wrong, the OP should use the apostrophe ' instead. –  Stéphane Chazelas Jan 21 at 22:07
    
agree, corrected the question. –  pratikch Jan 22 at 4:07

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.