Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

detect url in caption ? #544

Open
sefidpardazesh opened this issue Jun 21, 2017 · 14 comments
Open

detect url in caption ? #544

sefidpardazesh opened this issue Jun 21, 2017 · 14 comments
Labels
bug

Comments

@sefidpardazesh
Copy link

@sefidpardazesh sefidpardazesh commented Jun 21, 2017

in bot telegram api For text messages we have entity type for detect url, mention, text_mention. But! For photo,video with caption how we detect url,mention.? In other hand how can we use entity type in caption of photo,video?

@jacklul
Copy link
Collaborator

@jacklul jacklul commented Jun 21, 2017

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

There is no such thing for caption, you will have to write a regex for this...

@sefidpardazesh
Copy link
Author

@sefidpardazesh sefidpardazesh commented Jun 21, 2017

thanks.
what is reges for mention and text_mention?

@KilluaFein
Copy link

@KilluaFein KilluaFein commented Jul 13, 2017

Entities are there only for cases when updating messages (that are either html formated or use markdown) so it can be reformatted properly.

@jacklul I'm trying to reformat an edited message, but without success. How can I use the entities to properly reformat?

@jacklul
Copy link
Collaborator

@jacklul jacklul commented Jul 13, 2017

@KilluaFein proof of concept:

   private function parseEntitiesString($text, $entities)
    {
        $global_incr = 0;
        foreach ($entities as $entity) {
            if ($entity->getType() == 'italic') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '_', $start, 0);
                $text = $this->mb_substr_replace($text, '_', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'bold') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '*', $start, 0);
                $text = $this->mb_substr_replace($text, '*', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '`', $start, 0);
                $text = $this->mb_substr_replace($text, '`', $end, 0);

                $global_incr = $global_incr + 2;
            } elseif ($entity->getType() == 'pre') {
                $start = $global_incr + $entity->getOffset();
                $end = 3 + $start + $entity->getLength();

                $text = $this->mb_substr_replace($text, '```', $start, 0);
                $text = $this->mb_substr_replace($text, '```', $end, 0);

                $global_incr = $global_incr + 6;
            } elseif ($entity->getType() == 'text_link') {
                $start = $global_incr + $entity->getOffset();
                $end = 1 + $start + $entity->getLength();
                $url = '(' . $entity->getUrl() . ')';

                $text = $this->mb_substr_replace($text, '[', $start, 0);
                $text = $this->mb_substr_replace($text, ']' . $url, $end, 0);

                $global_incr = $global_incr + 2 + mb_strlen($url);
            } elseif ($entity->getType() == 'code') {
                $start = $global_incr + $entity->getOffset();

                $text = mb_substr($text, 0, $start);
            }
        }

        return $text;
    }

Never managed to make it work for 100% cases.
Multibyte characters break offsets.

@KilluaFein
Copy link

@KilluaFein KilluaFein commented Jul 13, 2017

Multibyte characters break offsets.

Like emoji, right?

and what is mb_substr_replace()?

@KilluaFein
Copy link

@KilluaFein KilluaFein commented Jul 13, 2017

offset and length are UTF-16 encoded, maybe a way to convert to UTF-8 to solve this?

@jacklul
Copy link
Collaborator

@jacklul jacklul commented Jul 13, 2017

mb_XXX functions are for multi-byte strings (mb I guess).

It took me a lot of time thinking on this and I NEVER found a solution to properly get it to work.

@f77
Copy link

@f77 f77 commented Mar 13, 2018

public static function processEntities (string $_text, array $_message_raw): string
    {
        $preset = [
            'bold'      => '<b>%text</b>',
            'italic'    => '<i>%text</i>',
            'text_link' => '<a href="%url">%text</a>',
            'code'      => '<code>%text</code>',
            'pre'       => '<pre>%text</pre>',
        ];

        if (!isset ($_message_raw['entities']))
        {
            return $_text;
        }

        $iterationText = $_text;
        $globalDiff    = 0;
        foreach ($_message_raw['entities'] as $entity)
        {
            $type   = $entity['type'];
            $offset = $entity['offset'] + $globalDiff;
            $length = $entity['length'];

            $pBefore = \mb_substr ($iterationText, 0, $offset);
            $pText   = \mb_substr ($iterationText, $offset, $length);
            $pAfter  = \mb_substr ($iterationText, ($offset + $length));

            // Note: str_replace() works good with utf-8 in the last php versions.
            if (isset ($preset[$type]))
            {
                // Get pattern from the preset.
                $replacedContent = $preset[$type];

                // First, replace url, in that rare case, if in the text will be the %text macros.
                if (!empty ($entity['url']))
                {
                    $replacedContent = \str_replace ('%url', $entity['url'], $replacedContent);
                }

                // Replace main text.
                $replacedContent = \str_replace ('%text', $pText, $replacedContent);

                $newText       = $pBefore . $replacedContent . $pAfter;
                $globalDiff    += (\mb_strlen ($newText) - \mb_strlen ($iterationText));
                $iterationText = $newText;
            }
        }

        return $iterationText;
    }
@akalongman akalongman added the bug label May 10, 2018
@akalongman
Copy link
Member

@akalongman akalongman commented May 10, 2018

@jacklul what is actually a problem? And how to reproduce?

@jacklul
Copy link
Collaborator

@jacklul jacklul commented May 10, 2018

I believe the point of this issue is to have a way to edit and reformat messags using entities field, because these do not contain formating we have to use 'entities' field for that, I never managed to create a function that could parse this and put into message string correctly because of multibyte strings...

One of simpliest examples would be button under a message that removes or add text to the message while keeping message contents (and that content cannot be obtained/generated in any other way than grabbing it from Message object).

@rsdsoft
Copy link

@rsdsoft rsdsoft commented Dec 5, 2019

Any news on this issue? Emojis + text formatting using entities info (offset, length)

@noplanman
Copy link
Member

@noplanman noplanman commented Dec 8, 2019

I have a working version (I think), needs some further testing and then I'll release it 👍

@noplanman
Copy link
Member

@noplanman noplanman commented Dec 12, 2019

My latest experiment, which I'll pack into a small package when it works 100%.

Try the class below, and use it like:

$entity_decoder = new EntityDecoder($message, 'markdown'); // or 'html'
$decoded_text   = $entity_decoder->decode();
<?php

use Longman\TelegramBot\Entities\Message;
use Longman\TelegramBot\Entities\MessageEntity;

class EntityDecoder
{
    private $entities;
    private $text;
    private $style;
    private $without_cmd;
    private $offset_correction;

    /**
     * @param Message $message     Message object to reconstruct Entities from.
     * @param string  $style       Either 'html' or 'markdown'.
     * @param bool    $without_cmd If the bot command should be included or not.
     */
    public function __construct(Message $message, string $style = 'html', bool $without_cmd = false)
    {
        $this->entities    = $message->getEntities();
        $this->text        = $message->getText($without_cmd);
        $this->style       = $style;
        $this->without_cmd = $without_cmd;
    }

    public function decode(): string
    {
        if (empty($this->entities)) {
            return $this->text;
        }

        $this->fixBotCommandEntity();

        // Reverse entities and start replacing bits from the back, to preserve offset positions.
        foreach (array_reverse($this->entities) as $entity) {
            $this->text = $this->decodeEntity($entity, $this->text);
        }

        return $this->text;
    }

    protected function fixBotCommandEntity(): void
    {
        // First entity would be the bot command, remove if necessary.
        $first_entity = reset($this->entities);
        if ($this->without_cmd && $first_entity->getType() === 'bot_command') {
            $this->offset_correction = ($first_entity->getLength() + 1);
            array_shift($this->entities);
        }
    }

    /**
     * @param MessageEntity $entity
     *
     * @return array
     */
    protected function getOffsetAndLength(MessageEntity $entity): array
    {
        static $text_byte_counts;

        if (!$text_byte_counts) {
            // https://www.php.net/manual/en/function.str-split.php#115703
            $str_split_unicode = preg_split('/(.)/us', $this->text, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);

            // Generate an array of UTF-16 encoded string lengths, which is necessary
            // to correct the offset and length values of special characters, like Emojis.
            $text_byte_counts = array_map(function ($char) {
                return strlen(mb_convert_encoding($char, 'UTF-16', 'UTF-8')) / 2;
            }, $str_split_unicode);
        }

        $offset = $entity->getOffset() - $this->offset_correction;
        $length = $entity->getLength();

        $offset += $offset - array_sum(array_slice($text_byte_counts, 0, $offset));
        $length += $length - array_sum(array_slice($text_byte_counts, $offset, $length));

        return [$offset, $length];
    }

    /**
     * @param string $style
     * @param string $type
     *
     * @return string
     */
    protected function getFiller(string $style, string $type): string
    {
        $fillers = [
            'html'     => [
                'text_mention' => '<a href="tg://user?id=%2$s">%1$s</a>',
                'text_link'    => '<a href="%2$s">%1$s</a>',
                'bold'         => '<b>%s</b>',
                'italic'       => '<i>%s</i>',
                'code'         => '<code>%s</code>',
                'pre'          => '<pre>%s</pre>',
            ],
            'markdown' => [
                'text_mention' => '[%1$s](tg://user?id=%2$s)',
                'text_link'    => '[%1$s](%2$s)',
                'bold'         => '*%s*',
                'italic'       => '_%s_',
                'code'         => '`%s`',
                'pre'          => '```%s```',
            ],
        ];

        return $fillers[$style][$type] ?? '';
    }

    /**
     * Decode an entity into the passed string.
     *
     * @param MessageEntity $entity
     * @param string        $text
     *
     * @return string
     */
    private function decodeEntity(MessageEntity $entity, string $text): string
    {
        [$offset, $length] = $this->getOffsetAndLength($entity);

        $text_bit = $this->getTextBit($entity, $offset, $length);

        // Replace text bit.
        return mb_substr($text, 0, $offset) . $text_bit . mb_substr($text, $offset + $length);
    }

    /**
     * @param MessageEntity $entity
     * @param int           $offset
     * @param int           $length
     *
     * @return false|string
     */
    private function getTextBit(MessageEntity $entity, $offset, $length)
    {
        $type     = $entity->getType();
        $filler   = $this->getFiller($this->style, $type);
        $text_bit = mb_substr($this->text, $offset, $length);

        switch ($type) {
            case 'text_mention':
                $text_bit = sprintf($filler, $text_bit, $entity->getUser()->getId());
                break;
            case 'text_link':
                $text_bit = sprintf($filler, $text_bit, $entity->getUrl());
                break;
            case 'bold':
            case 'italic':
            case 'code':
            case 'pre':
                $text_bit = sprintf($filler, $text_bit);
                break;
            default:
                break;
        }

        return $text_bit;
    }
}
@rsdsoft
Copy link

@rsdsoft rsdsoft commented Dec 12, 2019

My latest experiment, which I'll pack into a small package when it works 100%.

Tested and do not see problems. A lot of emojis and different formatting works ok at the first glance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants
You can’t perform that action at this time.