PostgreSQL - Replace HTML Entities

Question

I have just set about the task of stripping out HTML entities from our database, as we do a lot of crawling and some of the crawlers didn't do this at input time :(

So I started writing a bunch of queries that look like;

UPDATE nodes SET name=regexp_replace(name, '&#xe0;', 'à', 'g') WHERE name LIKE '%#xe0%';
UPDATE nodes SET name=regexp_replace(name, '&#xe1;', 'á', 'g') WHERE name LIKE '%#xe1%';
UPDATE nodes SET name=regexp_replace(name, '&#xe2;', 'â', 'g') WHERE name LIKE '%#xe2%';

Which is clearly a pretty naive approach. I've been trying to figure out if there is something clever I can do with the decode function; maybe grabbing the html entity by regex like /&#x(..);/, then passing just the %1 part to the ascii decoder, and reconstructing the string...or something...

Shall I just press on with the queries? There will probably only be 40 or so of them.

You'll want to VACCUM aggressively if you're doing this, to avoid huge table bloat. Doing the text processing in a PL is by far the better approach as @SzymonGuz explains. It's possible in SQL using substring or regexp_matches and a replacement table, but it'll be slow and ugly. — Craig Ringer, Aug 28 '12 at 3:21

Szymon Guz · Accepted Answer · 2012-08-27 19:40:03Z

up vote 3 down vote accepted

Write a function using pl/perlu and use this module http://search.cpan.org/~gaas/HTML-Parser-3.69/lib/HTML/Entities.pm

Of course you need to have perl installed and pl/perl available.

1) First of all create the procedural language pl/perlu:

CREATE EXTENSTION plperlu;

2) Then create a function like this:

CREATE FUNCTION decode_html_entities(text) RETURNS TEXT AS $$
    use HTML::Entities;
    return decode_entities($_[0]);
$$ LANGUAGE plperlu;

3) Then you can use it like this:

select decode_html_entities('aaabbb&amp;.... asasdasdasd &hellip;');
   decode_html_entities    
---------------------------
 aaabbb&.... asasdasdasd …
(1 row)

edited Aug 27 '12 at 19:40

answered Aug 27 '12 at 19:30

Szymon Guz
4,41442345

Thanks, I was hoping not to have to jump into perl, but I guess a pure sql solution would be a little too much to ask for! – lynks Aug 28 '12 at 10:14

Well, it would be also very easy, however much longer than those 2 lines of perl code. – Szymon Guz Aug 28 '12 at 14:11

Quandary · Answer 2 · 2013-01-09 12:09:23Z

up vote 0 down vote

You could use xpath (HTML-encoded content is the same as XML encoded content):

select 
  'AT&amp;T' as input ,
  (xpath('/z/text()', ('<z>' || 'AT&amp;T' || '</z>')::xml))[1] as output

answered Jan 9 at 12:09

Quandary
9,8661276155

asked	1 year ago
viewed	563 times
active	9 months ago

PostgreSQL - Replace HTML Entities

2 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged sql regex postgresql find-and-replace or ask your own question.

PostgreSQL - Replace HTML Entities

2 Answers

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged sql regex postgresql find-and-replace or ask your own question.

Related