HTML Compressor with regex

Question

I would like to compress a Magento HTML page using some regex, and this is what I have written:

function html_compress($string){

    global $idarray;
    $idarray=array();

    //Replace PRE and TEXTAREA tags
    $search=array(
                    '@(<)\s*?(pre\b[^>]*?)(>)([\s\S]*?)(<)\s*(/\s*?pre\s*?)(>)@',   //Find PRE Tag
                    '@(<)\s*?(textarea\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?textarea\s*?)(>)@' //Find TEXTAREA
                );
    $string=preg_replace_callback($search,
                                    function($m){
                                        $id='<!['.uniqid().']!>';
                                        global $idarray;
                                        $idarray[]=array($id,$m[0]);
                                        return $id;
                                    },
                                    $string
    );

    //Remove blank useless space
    $search = array(
                    '@( |\t|\f)+@', // Shorten multiple whitespace sequences
                    '@(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+@',   //Remove blank lines
                    '@^(\s)+|( |\t|\0|\r\n)+$@' //Trim Lines
                    );
    $replace = array(' ',"\\1",'');
    $string = preg_replace($search, $replace, $string);

    //Replace IE COMMENTS, SCRIPT, STYLE and CDATA tags
    $search=array(
                    '@<!--\[if\s(?:[^<]+|<(?!!\[endif\]-->))*<!\[endif\]-->@',  //Find IE Comments
                    '@(<)\s*?(script\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?script\s*?)(>)@',    //Find SCRIPT Tag
                    '@(<)\s*?(style\b[^>]*?)(>)([\s\S]*?)(<)\s*?(/\s*?style\s*?)(>)@',  //Find STYLE Tag
                    '@(//<!\[CDATA\[([\s\S]*?)//]]>)@', //Find commented CDATA
                    '@(<!\[CDATA\[([\s\S]*?)]]>)@'  //Find CDATA
                );
    $string=preg_replace_callback($search,
                                    function($m){
                                        $id='<!['.uniqid().']!>';
                                        global $idarray;
                                        $idarray[]=array($id,$m[0]);
                                        return $id;
                                    },
                                    $string
    );

    //Remove blank useless space
    $search = array(
                    '@(class|id|value|alt|href|src|style|title)=(\'\s*?\'|"\s*?")@',    //Remove empty attribute
                    '@<!--([\s\S]*?)-->@',  // Strip comments except IE
                    '@[\r\n|\n|\r]@', // Strip break line
                    '@[ |\t|\f]+@', // Shorten multiple whitespace sequences
                    '@(^[\r\n]*|[\r\n]+)[\s\t]*[\r\n]+@', //Remove blank lines
                    '@^(\s)+|( |\t|\0|\r\n)+$@' //Trim Lines
                    );
    $replace = array(' ','',' ',' ',"\\1",'');
    $string = preg_replace($search, $replace, $string);

    //Replace unique id with original tag
    $c=count($idarray);
    for($i=0;$i<$c;$i++){
        $string = str_replace($idarray[$i][0], "\n".$idarray[$i][1]."\n", $string);
    }

    return $string;
}

It works, but I have got some concerns:

Has it got sense to explit all the \s*? between the tags and retrieve with this (<)\s*?(style\b[^>]*?)(>)?
Does this script eat resources and considerably delay the page load? Is there any possible optimization?
Is the part that remove the white space redundant?
Is all of this "cacheable"?

Already read (also on differents site), but since I thought I don't have to do impossible tasks, maybe, perhaps I could give a try — Razorphyn, Dec 20 '14 at 10:47
Why not use an existing minifier for HTML that is proven to be working? With regex it will be neigh impossible to catch all the strange corner cases that can crop up. — jessehouwing, Dec 20 '14 at 11:12
@jessehouwing could you give me an example? Because what I have found wasn't satisfaing me or I can't use it on my server... — Razorphyn, Dec 20 '14 at 11:53
You will likely lose a lot more response time based on the minification process at runtime of the server. Use GZIP. If you can't determine the HTML beforehand (i.e, it's dynamically generated), then you're probably sod-out-of-luck. Minifying HTML at runtime is going to add significant overhead to your application. — Dan Pantry, Sep 2 '15 at 10:18

Mast · Answer 1 · 2015-09-02 09:22:58Z

See: http://stackoverflow.com/a/6225706/736079

For comments on enabling content compression for your HTML pages, that usually is enough to reduce the payload by more than 50%.

You can also make use of output buffering and combine it with the HTMLMinify function:

<?php
function sanitize_output($content) {
     $content = Minify_HTML::minify($content);
}

ob_start("sanitize_output");
?>

See: https://github.com/mrclay/minify/blob/master/min/lib/Minify/HTML.php

This still uses Regex at the core, which is still not ideal, but it has been tested by a larger audience and looks quite solid (test to make sure). If you are hosting on IIS, you might be able to use a .NET HttpModule or an ISAPI filter as well. This isn't limited to PHP only, sometimes the Web Server itself has plugins that can help you, like Apache's mod_pagespeed.

asked	1 year ago
viewed	706 times
active	10 months ago

current community

your communities

more stack exchange communities

HTML Compressor with regex

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged php performance html regex compression or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

HTML Compressor with regex

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged php performance html regex compression or ask your own question.

Related

Hot Network Questions