Querying non-ASCII rows from Postgres

Question

Does [:ascii:] class work in Postgres at all? It is not listed in their help, however I see examples in the web which utilize it.

I have a UTF-8 database, where collation and c_type are en_US.UTF-8, and Postgres version is 9.6.2. When I search for non-ASCII rows like this:

select title from wallabag_entry where title ~ '[^[:ascii:]';

I get both Unicode and non-Unicode symbols (full output is here):

Сталинская правозащитница: мать Меленкова бабушка Настя
Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?
Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев
Как комиссар Крекшин в 1740-е чуть не отменил историю России
Have you heard of Saint Death? Don’t pray to her.
Архаїчна українська мова: перевага чи недолік?
Гренада не их
China’s marriage rate is plummeting because women are choosing autonomy over

What is wrong with this query?

Is it possible that you're getting sentences with Unicode non-breakable spaces? (or any other character that hides in plain view, for that matter) — joanolo, 17 hours ago
You can use a regexp_replace() to mark your non-ASCII chars. See my answer. — joanolo, 17 hours ago
You should always paste the exact result in dba.se. We can't test a graphic for non-ascii characters. we can test the actual result set. This is a poster child for shouldn't be a graphic — Evan Carroll, 16 hours ago

joanolo · Accepted Answer · 2017-03-19 19:08:44Z

To answer your question: [:ascii:] works. You may have some characters in your text that you do not recognize as non-ASCII, yet they're there. They can be something like a non-breakable space, for instance, or any other Unicode space character.

It is not strange to have non-breakable spaces ( ) in texts that you cut-and-paste from a web page, yet you don't notice they're there.

Here is an example to show:

WITH t(t) AS
(
    VALUES 
      ( 'Сталинская правозащитница: мать Меленкова бабушка Настя' ),
      ( 'Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?' ),
      ( 'Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев' ),
      ( 'Как комиссар Крекшин в 1740-е чуть не отменил историю России' ),
      ( 'Have you heard of Saint Death? Don’t pray to her.' ),
      ( 'Архаїчна українська мова: перевага чи недолік?' ),
      ( 'Гренада не их' ),
      ( 'China’s marriage rate is plummeting because women are choosing autonomy over ' )

)
SELECT 
    t,  regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM 
    t 
WHERE 
    t ~ '[^[:ascii:]]' ;

That's what you get:

                                       t                                       |                                                                                                 t_marked                                                                                                  
-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Сталинская правозащитница: мать Меленкова бабушка Настя                       | [С][т][а][л][и][н][с][к][а][я] [п][р][а][в][о][з][а][щ][и][т][н][и][ц][а]: [м][а][т][ь] [М][е][л][е][н][к][о][в][а] [б][а][б][у][ш][к][а] [Н][а][с][т][я]
 Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?               | [Д][н][е][в][н][и][к] [Н][К][В][Д][и][с][т][а] [Ш][а][б][а][л][и][н][а]: [З][н][а][е][т] [л][и] [М][о][с][к][в][а] [п][о][л][о][ж][е][н][и][е] [н][а] [ф][р][о][н][т][е]?
 Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев   | [Б][е][г] [п][о] [г][о][р][о][д][у] [и] [п][о][е][з][д][к][а] [н][а] [о][с][л][е]: [к][а][к] [в] [с][р][е][д][н][е][в][е][к][о][в][ь][е] [н][а][к][а][з][ы][в][а][л][и] [п][р][е][л][ю][б][о][д][е][е][в]
 Как комиссар Крекшин в 1740-е чуть не отменил историю России                  | [К][а][к] [к][о][м][и][с][с][а][р] [К][р][е][к][ш][и][н] [в] 1740-[е] [ч][у][т][ь] [н][е] [о][т][м][е][н][и][л] [и][с][т][о][р][и][ю] [Р][о][с][с][и][и]
 Have you heard of Saint Death? Don’t pray to her.                             | Have you heard of Saint Death? Don[’]t pray to her.
 Архаїчна українська мова: перевага чи недолік?                                | [А][р][х][а][ї][ч][н][а] [у][к][р][а][ї][н][с][ь][к][а] [м][о][в][а]: [п][е][р][е][в][а][г][а] [ч][и] [н][е][д][о][л][і][к]?
 Гренада не их                                                                 | [Г][р][е][н][а][д][а] [н][е] [и][х]
 China’s marriage rate is plummeting because women are choosing autonomy over  | China[’]s marriage rate is plummeting because women are choosing autonomy over

You can see from this, that your problem is the right-apostrophe character. ASCII only supports the apostrophe, left-apostrophe and right-apostrophe are typographically correct Unicode extensions

dbfiddle here

You can check it also with previous versions at http://rextester.com/UKIQ48014 (PostgreSQL 9.5) and http://sqlfiddle.com/#!15/4c563/1/0 (PostgreSQL 9.3)

The texts that I guess you think are pure ASCII, and are not:

 WITH t(t) AS
 (
     VALUES 
       ('A fully ASCII text!'),
       ('Have you heard of Saint Death? Don’t pray to her.'),
       ('China’s marriage rate is plummeting because women are choosing autonomy over ')
 )
 SELECT 
    regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
 FROM 
    t 
 WHERE 
    t ~ '[^[:ascii:]]' ;

 | t_marked                                                                        |
 | :------------------------------------------------------------------------------ |
 | Have you heard of Saint Death? Don[’]t pray to her.                             |
 | China[’]s marriage rate is plummeting because women are choosing autonomy over  |

dbfiddle here

These texts are using ’ instead of ' to mark apostrophes.

Check Punctuation: Why is the right single quote (U+2019), and not the semantically distinct apostrophe (U+0027), the preferred apostrophe character in Unicode? ... to see that you're not the first person encountering this problem.

This is a really fantastic answer because it shows you the non-ascii characters. This is how I would have answered this question. — Evan Carroll, 16 hours ago

asked	today
viewed	88 times
active	today

current community

your communities

more stack exchange communities

Querying non-ASCII rows from Postgres

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged postgresql utf-8 regular-expression unicode regex or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Querying non-ASCII rows from Postgres

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged postgresql utf-8 regular-expression unicode regex or ask your own question.

Related

Hot Network Questions