Sanitizing user input with markdown-like syntax

Question

Since I sanitize all the user-provided string before upload them to the DB, I wanted to give the users the possibility to format text as it happens here on Stack Exchange or on WhatsApp:

*word* -> bold
_word_ -> italic

This class contains two functions:

upload(): is called when the user upload a text and it replaces *word* with word, _word_ with word and \n with  
download(): is called when the user wants to modify the text and it does exactly the opposite; it replaces HTML tags with the custom signs * and _

My questions:

Could be this code considered as a real class in the way of thinking or is it just procedural code put into a class?
Would you improve it in any way?
Do you have any suggestion to write it better?

class txtFormatting {
    private $text;

    function __construct($text)
    {
        $this->text = $text;
    }

    function upload() {
        $this->text = preg_replace('/[ \t]+/', ' ', $this->text); //transforms: 2/+ whitespaces -> 1 whitespace
        $this->text = nl2br($this->text); //transforms: \n -> <br />
        $this->text = preg_replace(array("/\r\n/", "/\n\r/", "/\n/", "/\r/"), '', $this->text); 
        $this->text = explode(' ', $this->text); //each word becomes a value
        $regexAY =
            [
                '/[*]{1}[a-zA-Z0-9]+[*]{1}/' =>
                    [
                        "pattern" => "*",
                        "openTag" => "<b>",
                        "closeTag" => "</b>"
                    ],
                '/[_]{1}[a-zA-Z0-9]+[_]{1}/' =>
                    [
                        "pattern" => "_",
                        "openTag" => "<i>",
                        "closeTag" => "</i>"
                    ]
            ];

        $newText = [];
        foreach ($this->text as $key => $word) {
            foreach ($regexAY as $regex => $value) {
                if (preg_match($regex, $word)) {
                    $pattern = $regexAY[$regex]["pattern"];
                    $openTag = $regexAY[$regex]["openTag"];
                    $closeTag = $regexAY[$regex]["closeTag"];
                    $word = preg_replace('/\\' .$pattern. '(.*?)\\' .$pattern. '/', $openTag. '$1' .$closeTag, $word); // /\*(.*?)\*/ OR /_(.*?)_/
                }
            }
            if ($word !== '') { array_push($newText, $word); }
        }

        return $this->text = implode(' ', $newText);
    }

    function download() {
        /*function br2nl() {
            return preg_replace('/\<br(\s*)?\/?\>/i', "\n", $this->text); // /\<br(\s*)?\/?\>/i
        }*/
        $this->text = preg_replace('/\<br(\s*)?\/?\>/i', "\n", $this->text);
        $this->text = explode(' ', $this->text);
        $regexAY =
            [
                '/<b>[a-zA-Z0-9]+<\/b>/' =>
                    [
                        "pattern" => ["/<b>/", "/<\/b>/"],
                        "replacement" => "*"
                    ],
                '/<i>[a-zA-Z0-9]+<\/i>/' =>
                    [
                        "pattern" => ["/<i>/", "/<\/i>/"],
                        "replacement" => "_"
                    ]
            ];

        $newText = [];
        foreach ($this->text as $key => $word) {
            foreach ($regexAY as $regex => $value) {
                if (preg_match($regex, $word)) {
                    $word = preg_replace($regexAY[$regex]["pattern"], $regexAY[$regex]["replacement"], $word);
                }
            }
            if ($word !== '') { array_push($newText, $word); }
        }

        return $this->text = implode(' ', $newText);
    }
}


$text = "     This _is_ _just_ _a test_
       *text*
       so     _don't_
       consider       it just   *read*
      it";
$a = new txtFormatting($text);
echo $a->upload()."\n";

$text = "This <i>is</i> <i>just</i> _a test_<br /> <b>text</b><br /> so <i>don't</i><br /> consider it just <b>read</b><br /> it";
$b = new txtFormatting($text);
echo $b->download()."\n";

JanErikGunnar · Accepted Answer · 2017-01-23 14:11:50Z

up vote 5 down vote accepted

The names of the methods should reflect what they are doing, i.e. you should call it something like encode (instead of upload) and decode (instead of download).
You should not store encoded information the database. Consider, for example, if you want to be able to support other methods of output in the future (e.g. into a PDF or whatever), or if you want to change * to be rendered as something else. Then you would have all this HTML inserted in your data that you have to decode. Instead, you should upload the original unencoded data into the table, return the unencoded data when needed for editing, and encode the data only just before you require it in the encoded format. This way there is also no need for a decode (download) method. During user input, you should only have to make sure the data is valid according to business rules.
Based on the example, it seems there is no  if _ is around multiple words? (Sorry, I'm no regex expert.)

edited 13 hours ago

answered 14 hours ago

JanErikGunnar

1662

your note about the _ is completely correct, and also applies for * – Vogel612 14 hours ago

Precious tips! I just didn't get what you mean with RAW data; or better, I already heard and read something about it but my researches didn't find anything usefull every time I did them. Can you briefly explain me what is it? (For the 3rd point: yes it's correct, at this stage I wanted my regex to allow text formatting just for single words) – brigo 13 hours ago

@brigo Basically, in general, one has the actual, unencoded business data, the one that users input and edit. This is the only data that should be in databases, and what I called raw data here (which was probably not the most accurate term). After later loading this data from the database, one would transform (i.e. encode, escape, truncate, add formatting etc) this data depending on in what context the data would be used. (Though, the data should of course be validated against business and integrity rules before being inserted in the table, i.e. make sure it's not empty etc) – JanErikGunnar 13 hours ago

@JanErikGunnar Ok now I understood what you mean with raw data: it's the user-provided data "untouched" by any modification. In this case it would be, for example "Hi, I'm user n. 43240 and I like ice creams"; right? Is it not basically the same saving the data with the "more standardized" HTML tags, so I also wouldn't need any decoding process when displaying the description to the other users and I would need it only in the less common case in which the user n.4320 wants to modify it? (I'm really interested in the DB management so I really apprecciate to know your point of view) – brigo 11 hours ago

I think there is still a little confusion :) Encode = replacing asterisks with HTML etc. You would encode when you need it to be HTML. When user is editing it, you send what is in the database without any encoding or decoding. You never decode it. – JanErikGunnar 11 hours ago

| show 3 more comments

asked	today
viewed	64 times
active	today

current community

your communities

more stack exchange communities

Sanitizing user input with markdown-like syntax

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged php object-oriented or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Sanitizing user input with markdown-like syntax

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged php object-oriented or ask your own question.

Related

Hot Network Questions