0

I get one page html source via phpQuery, and then get below string code from script tag in head via php regex:

var BASE_DATA = {
userInfo: {
  id: 0,
  userName: 'no-needed',
  avatarUrl: 'no-needed',
  isPgc: false,
  isOwner: false
},
headerInfo: {
  id: 0,
  isPgc: false,
  userName: 'no-needed',
  avatarUrl: 'no-needed',
  isHomePage: false,
  crumbTag: 'no-needed',
  hasBar: true
},
articleInfo: 
{
  title: 'needed',
  content: 'needed',
  groupId: 'needed',
  itemId: 'needed',
  type: 1,
  subInfo: {
    isOriginal: false,
    source: 'needed',
    time: 'needed'
  },
  tagInfo: {
    tags: [{"name":"no-needed 1"},{"name":"no-needed 2"},{"name":"no-needed 3"}],
    groupId: 'no-needed',
    itemId: 'no-needed',
    repin: 0,
  },
  has_extern_link: 0,
  coverImg: 'no-needed'
},
commentInfo:
{
  groupId: 'no-needed',
  itemId: 'no-needed',
  comments_count: 151,
  ban_comment: 0
},};

I want to convert this string to php array, like:

$base_data = array(
'articleInfo' => array(
    'title' => 'needed',
    'content' => 'needed',
    'groupId' => 'needed',
    'itemId' => 'needed',
    'subInfo' => array(
        'source' => 'needed',
        'time' => 'needed',
    ),
));

or

$base_data = array(
'title' => 'needed',
'content' => 'needed',
'groupId' => 'needed',
'itemId' => 'needed',
'subInfo' => array(
    'source' => 'needed',
    'time' => 'needed',
),);

I already tried with many ways, like: json_decode, get the content from the braces via php regex and the function preg_match_all.But all of them run not well.

I tried two ways:

the first way:

$json = str_ireplace(array('var BASE_DATA =', '};'), array('', '}'), $js);
json_decode($json, true);

the second way:

preg_match_all('/\{([^}]+)\}/', $js, $matches);
print_r($matches[1]);

or

preg_match_all('/articleInfo:\s*\{([^}]+)\}/', $script_text, $matches);
print_r($matches[1][0]);

It seems to close to finish, but it still looks no well, I have to parser string in articleInfo part.... that is why I posted this post.

I even wanted to use V8 JavaScript engine, but.....

do you anyone know the better way to finish it please ?

4
  • If the source string includes the var BASE_DATA = part, you're not going to be able to json_decode that until you strip it off. The trailing comma at the end is also an issue, and it may or may not complain about the very last semi-colon. Not sure on that front how forgiving json_decode is. Commented Apr 16, 2019 at 22:47
  • yes, I already found this issue, so I already tried with str_ireplace(array('var BASE_DATA =', '};'), array('', '}'), $js); but I do know what I should do for next. actually, I also tried: preg_match_all('/\{([^}]+)\}/', $js, $matches); It seems to close to finish, but it still looks no well, I have to parser string in articleInfo part.... that is why I posted this post. Commented Apr 16, 2019 at 22:53
  • "I already tried with many ways" do tell... Commented Apr 16, 2019 at 23:01
  • @miken32 I edited, please check Commented Apr 16, 2019 at 23:12

2 Answers 2

1

I had to reformat your JSON which was not valid (checked on https://jsonlint.com/).

I voluntarily used multiple str_replace() so you better understand the process, however you can optimize the code below by making multiple replacements at the same time within the same str_replace().

This works:

<?php

$to_decode = "var BASE_DATA = {
userInfo: {
  id: 0,
  userName: 'no-needed',
  avatarUrl: 'no-needed',
  isPgc: false,
  isOwner: false
},
headerInfo: {
  id: 0,
  isPgc: false,
  userName: 'no-needed',
  avatarUrl: 'no-needed',
  isHomePage: false,
  crumbTag: 'no-needed',
  hasBar: true
},
articleInfo: 
{
  title: 'needed',
  content: 'needed',
  groupId: 'needed',
  itemId: 'needed',
  type: 1,
  subInfo: {
    isOriginal: false,
    source: 'needed',
    time: 'needed'
  },
  tagInfo: {
    tags: [{\"name\":\"no-needed 1\"},{\"name\":\"no-needed 2\"},{\"name\":\"no-needed 3\"}],
    groupId: 'no-needed',
    itemId: 'no-needed',
    repin: 0,
  },
  has_extern_link: 0,
  coverImg: 'no-needed'
},
commentInfo:
{
  groupId: 'no-needed',
  itemId: 'no-needed',
  comments_count: 151,
  ban_comment: 0
},};";

/* Clean JSON and encapsulate in brackets */
$to_decode = str_replace('var BASE_DATA = {', '', $to_decode);
$to_decode = '{'.substr($to_decode, 0, -3).'}';

/* Remove spaces, tabs, new lines, etc. */
$to_decode = str_replace(' ', '', $to_decode);
$to_decode = str_replace("\n", '', $to_decode);
$to_decode = str_replace("\t", '', $to_decode);
$to_decode = str_replace("\r", '', $to_decode);

/* Encapsulate keys with quotes */
$to_decode = preg_replace('/([a-z_]+)\:/ui', '"{$1}":', $to_decode);
$to_decode = str_replace('"{', '"', $to_decode);
$to_decode = str_replace('}"', '"', $to_decode);
$to_decode = str_replace('\'', '"', $to_decode);

/* Remove unecessary trailing commas */
$to_decode = str_replace(',}', '}', $to_decode);

echo '<pre>';
var_dump(json_decode($to_decode));

Result using print_r :

(I added true/false for clarity, these will only show using var_dump() otherwise)

stdClass Object
(
    [userInfo] => stdClass Object
        (
            [id] => 0
            [userName] => no-needed
            [avatarUrl] => no-needed
            [isPgc] => false
            [isOwner] => false
        )

    [headerInfo] => stdClass Object
        (
            [id] => 0
            [isPgc] => false
            [userName] => no-needed
            [avatarUrl] => no-needed
            [isHomePage] => false
            [crumbTag] => no-needed
            [hasBar] => true
        )

    [articleInfo] => stdClass Object
        (
            [title] => needed
            [content] => needed
            [groupId] => needed
            [itemId] => needed
            [type] => 1
            [subInfo] => stdClass Object
                (
                    [isOriginal] => false
                    [source] => needed
                    [time] => needed
                )

            [tagInfo] => stdClass Object
                (
                    [tags] => Array
                        (
                            [0] => stdClass Object
                                (
                                    [name] => no-needed1
                                )

                            [1] => stdClass Object
                                (
                                    [name] => no-needed2
                                )

                            [2] => stdClass Object
                                (
                                    [name] => no-needed3
                                )

                        )

                    [groupId] => no-needed
                    [itemId] => no-needed
                    [repin] => 0
                )

            [has_extern_link] => 0
            [coverImg] => no-needed
        )

    [commentInfo] => stdClass Object
        (
            [groupId] => no-needed
            [itemId] => no-needed
            [comments_count] => 151
            [ban_comment] => 0
        )

)
0

thank @Bruno Leveque for your idea.

I edited your code like below so that it run well:

  1. I changed $to_decode = str_replace(' ', '', $to_decode); to $to_decode = preg_replace('/[\n| |\s]{2,}/',' ',$to_decode);, that means all 1+ space will be changed to 1 space. because sometimes we need space, like: content: '

  2. I added $to_decode = str_replace("'", '"', $to_decode); before your comment code /* Encapsulate keys with quotes */

  3. changed $to_decode = preg_replace('/([a-z_]+)\:/ui', '"{$1}":', $to_decode); to $to_decode = preg_replace('/([a-z_]+)\: /ui', '"$1":', $to_decode); (one more space there); and commented //$to_decode = str_replace('"{', '"', $to_decode); and //$to_decode = str_replace('}"', '"', $to_decode);

  4. added one more code:$to_decode = str_replace(", }", '}', $to_decode);

so my final code is :enter image description here

because @Bruno Leveque does not know the exactly content of "needed" and "no-needed", so thank you for idea.

seems no perfect way....

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.