0

I'm trying to parse json files from huge JSON file (1.9GB) so i split them into chunks of 10MB (190 files).

in order to ease the process so i load them 80 files at a time and i put them into a list

i use this to iterate through the 80 files

for root, dirs, filenames in os.walk(path):
    for f in filenames:
        function below

and this is the function to read file names with corrected path

dat = 'C:/Users/User/My Lab/Python/scripts/thesis/data_extractor/review/{file}'.format(file=f)
with open(dat) as data_file:
        for item in data_file:
                if len(item) > 1:
                        dict_review.append(item)

after the process is done, i iterate the list and parse them using json.loads

data = None
for row in dict_review:
        data = json.loads(row,'utf-8')  

and thats where the exception happens

Unexpected error:  <type 'exceptions.TypeError'>
Reason:  expected string or buffer

i tried casting the row into string with str(row) but still returns the same exception.

i wonder what i did wrong, thanks!

SOLVED:

it was my mistake, actually the JSON was properly parsed, the problem is when i try to remove all funny characters with regex

re.sub(r'[^\w]', ' ',data['votes']) 
to
re.sub(r'[^\w]', ' ',str(data['votes']))

i need to cast the object into string

thanks!

17
  • The first part could use glob and be for filename in glob.iglob("/some/path/*.ext"), for example. Commented Mar 26, 2016 at 5:46
  • Also, you posted parts of your script, this lets us without enough context. Where does dict_review comes from, and what type is it? is it a dict or a list? And what happens between the second and third code blocks? Commented Mar 26, 2016 at 5:48
  • 1
    try to visualize the content of the row with print('row=%r' % row) before the json.loads... line. I'm sure you will be surprised. Commented Mar 26, 2016 at 6:03
  • @heltonbiker dict_review is a global variable and its a list. everything is fine, until it reaches the json.loads() which user3159253 mentioned and i found a "\n" inside of a json file, im going to check it out now
    – kenlz
    Commented Mar 26, 2016 at 6:10
  • @user3159253 yes sir! i was surprised that i found an invalid json! thanks!
    – kenlz
    Commented Mar 26, 2016 at 6:10

1 Answer 1

0

it was my mistake, actually the JSON was properly parsed, the problem is when i try to remove all funny characters with regex

re.sub(r'[^\w]', ' ',data['votes']) 
to
re.sub(r'[^\w]', ' ',str(data['votes']))

i need to cast the object into string

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.