Database Administrators Stack Exchange is a question and answer site for database professionals who wish to improve their database skills and learn from others in the community. Join them; it only takes a minute:

Sign up
Here's how it works:
  1. Anybody can ask a question
  2. Anybody can answer
  3. The best answers are voted up and rise to the top

Does [:ascii:] class work in Postgres at all? It is not listed in their help, however I see examples in the web which utilize it.

I have a UTF-8 database, where collation and c_type are en_US.UTF-8, and Postgres version is 9.6.2. When I search for non-ASCII rows like this:

select title from wallabag_entry where title ~ '[^[:ascii:]';

I get both Unicode and non-Unicode symbols (full output is here):

Сталинская правозащитница: мать Меленкова бабушка Настя
Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?
Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев
Как комиссар Крекшин в 1740-е чуть не отменил историю России
Have you heard of Saint Death? Don’t pray to her.
Архаїчна українська мова: перевага чи недолік?
Гренада не их
China’s marriage rate is plummeting because women are choosing autonomy over 

What is wrong with this query?

share|improve this question
    
Is it possible that you're getting sentences with Unicode non-breakable spaces? (or any other character that hides in plain view, for that matter) – joanolo 17 hours ago
    
@joanolo, how to check this? How to see not-plain view? – Suncatcher 17 hours ago
    
You can use a regexp_replace() to mark your non-ASCII chars. See my answer. – joanolo 17 hours ago
    
You should always paste the exact result in dba.se. We can't test a graphic for non-ascii characters. we can test the actual result set. This is a poster child for shouldn't be a graphic – Evan Carroll 16 hours ago
    
@EvanCarroll, sorry. Corrected the sample. – Suncatcher 16 hours ago
up vote 5 down vote accepted

To answer your question: [:ascii:] works. You may have some characters in your text that you do not recognize as non-ASCII, yet they're there. They can be something like a non-breakable space, for instance, or any other Unicode space character.

It is not strange to have non-breakable spaces ( ) in texts that you cut-and-paste from a web page, yet you don't notice they're there.

Here is an example to show:

WITH t(t) AS
(
    VALUES 
      ( 'Сталинская правозащитница: мать Меленкова бабушка Настя' ),
      ( 'Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?' ),
      ( 'Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев' ),
      ( 'Как комиссар Крекшин в 1740-е чуть не отменил историю России' ),
      ( 'Have you heard of Saint Death? Don’t pray to her.' ),
      ( 'Архаїчна українська мова: перевага чи недолік?' ),
      ( 'Гренада не их' ),
      ( 'China’s marriage rate is plummeting because women are choosing autonomy over ' )

)
SELECT 
    t,  regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
FROM 
    t 
WHERE 
    t ~ '[^[:ascii:]]' ;

That's what you get:

                                       t                                       |                                                                                                 t_marked                                                                                                  
-------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Сталинская правозащитница: мать Меленкова бабушка Настя                       | [С][т][а][л][и][н][с][к][а][я] [п][р][а][в][о][з][а][щ][и][т][н][и][ц][а]: [м][а][т][ь] [М][е][л][е][н][к][о][в][а] [б][а][б][у][ш][к][а] [Н][а][с][т][я]
 Дневник НКВДиста Шабалина: Знает ли Москва положение на фронте?               | [Д][н][е][в][н][и][к] [Н][К][В][Д][и][с][т][а] [Ш][а][б][а][л][и][н][а]: [З][н][а][е][т] [л][и] [М][о][с][к][в][а] [п][о][л][о][ж][е][н][и][е] [н][а] [ф][р][о][н][т][е]?
 Бег по городу и поездка на осле: как в средневековье наказывали прелюбодеев   | [Б][е][г] [п][о] [г][о][р][о][д][у] [и] [п][о][е][з][д][к][а] [н][а] [о][с][л][е]: [к][а][к] [в] [с][р][е][д][н][е][в][е][к][о][в][ь][е] [н][а][к][а][з][ы][в][а][л][и] [п][р][е][л][ю][б][о][д][е][е][в]
 Как комиссар Крекшин в 1740-е чуть не отменил историю России                  | [К][а][к] [к][о][м][и][с][с][а][р] [К][р][е][к][ш][и][н] [в] 1740-[е] [ч][у][т][ь] [н][е] [о][т][м][е][н][и][л] [и][с][т][о][р][и][ю] [Р][о][с][с][и][и]
 Have you heard of Saint Death? Don’t pray to her.                             | Have you heard of Saint Death? Don[’]t pray to her.
 Архаїчна українська мова: перевага чи недолік?                                | [А][р][х][а][ї][ч][н][а] [у][к][р][а][ї][н][с][ь][к][а] [м][о][в][а]: [п][е][р][е][в][а][г][а] [ч][и] [н][е][д][о][л][і][к]?
 Гренада не их                                                                 | [Г][р][е][н][а][д][а] [н][е] [и][х]
 China’s marriage rate is plummeting because women are choosing autonomy over  | China[’]s marriage rate is plummeting because women are choosing autonomy over 

You can see from this, that your problem is the right-apostrophe character. ASCII only supports the apostrophe, left-apostrophe and right-apostrophe are typographically correct Unicode extensions

dbfiddle here

You can check it also with previous versions at http://rextester.com/UKIQ48014 (PostgreSQL 9.5) and http://sqlfiddle.com/#!15/4c563/1/0 (PostgreSQL 9.3)


The texts that I guess you think are pure ASCII, and are not:

 WITH t(t) AS
 (
     VALUES 
       ('A fully ASCII text!'),
       ('Have you heard of Saint Death? Don’t pray to her.'),
       ('China’s marriage rate is plummeting because women are choosing autonomy over ')
 )
 SELECT 
    regexp_replace(t, '([^[:ascii:]])', '[\1]', 'g') AS t_marked
 FROM 
    t 
 WHERE 
    t ~ '[^[:ascii:]]' ;
 | t_marked                                                                        |
 | :------------------------------------------------------------------------------ |
 | Have you heard of Saint Death? Don[’]t pray to her.                             |
 | China[’]s marriage rate is plummeting because women are choosing autonomy over  |
 

dbfiddle here

These texts are using instead of ' to mark apostrophes.

Check Punctuation: Why is the right single quote (U+2019), and not the semantically distinct apostrophe (U+0027), the preferred apostrophe character in Unicode? ... to see that you're not the first person encountering this problem.

share|improve this answer
2  
This is a really fantastic answer because it shows you the non-ascii characters. This is how I would have answered this question. – Evan Carroll 16 hours ago
1  
I updated with the OPs example. – Evan Carroll 16 hours ago
1  
Really fantastic and helpful answer! Thanks. – Suncatcher 16 hours ago

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.