Revisions to Scanning multiple huge files in Python (follow-up)

Changed text output code formatting

Source Link

edited Dec 9, 2015 at 0:03

11.8k
1
27
59

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

added 102 characters in body

Source Link

edited Dec 8, 2015 at 23:40

JJack_

233
1
7

But due to a memory issue I'm not able to save the res dictionary. I'm sure that different processes have different address spaces and so all of them write to their own local copy of the dictionary. I'm forced to use Manager to share data between processes, obtaining worst performance. ShouldMay I use queuequeue?

But due to a memory issue I'm not able to save the res dictionary. I'm sure that different processes have different address spaces and so all of them write to their own local copy of the dictionary. I'm forced to use Manager to share data between processes, obtaining worst performance. Should I use queue?

But due to a memory issue I'm not able to save the res dictionary. I'm sure that different processes have different address spaces and so all of them write to their own local copy of the dictionary. I'm forced to use Manager to share data between processes, obtaining worst performance. May I use queue?

added 848 characters in body

Source Link

edited Dec 8, 2015 at 23:34

JJack_

233
1
7

I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each of the ten files in the datasetsdataset to look for a set key and itstheir respective value.

This is a follow-up code on feedback from Scanning multiple huge files in Python. All files are encoded with UTF-8 and They should contain multiple language.

Here is parta little slice of my sorted key file:

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

Here it is the bash script I use to create sorted keys file.

#! /bin/bash
clear
BASEPATH="/home/process"
mkdir processed
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 | sort -u -k2 > $BASEPATH/processed/sorted_keys
split -d -l 3000000 processed/sorted_keys processed/slice/slice-
for filename in processed/slice/*; do
    python processing.py $filename
done
rm $BASEPATH/processed/sorted_keys
rm -rf $BASEPATH/processed/slice

For each slice I launch processing.py Here is my working code, with Manager:

I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each of the ten files in the datasets to look for a key and its value.

This is a follow-up code on feedback from Scanning multiple huge files in Python.

Here is part of my sorted key file:

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

Here is my working code, with Manager:

I'm trying to implement an algorithm able to search for multiple keys through ten huge files in Python (16 million of rows each one). I've got a sorted file with 62 million of keys, and I'm trying to scan each of the ten files in the dataset to look for a set key and their respective value.

This is a follow-up code on feedback from Scanning multiple huge files in Python. All files are encoded with UTF-8 and They should contain multiple language.

Here is a little slice of my sorted key file:

en Mahesh_Prasad_Varma
en Mahesh_Saheba
en maheshtala
en Maheshtala_College
en Mahesh_Thakur
en Maheshwara_Institute_Of_Technology
en Maheshwar_Hazari
....
en Just_to_Satisfy_You_(song) 1
en Just_to_See_Her 2
en Just_to_See_You_Smile 2
en Just_Tricking 1
en Just_Tricking! 1
en Just_Tryin%27_ta_Live 1
en Just_Until... 1
en Just_Us 1
en Justus 2
en Justus_(album) 2
....
en Zsófia_Polgár 1

en Mahesh_Prasad_Varma 1
en maheshtala 1
en Maheshtala_College 1
en Maheshwara_Institute_Of_Technology 2
en Maheshwar_Hazari 1

Here it is the bash script I use to create sorted keys file.

#! /bin/bash
clear
BASEPATH="/home/process"
mkdir processed
mkdir processed/slice
cat $BASEPATH/dataset/* | cut -d' ' -f1,2 | sort -u -k2 > $BASEPATH/processed/sorted_keys
split -d -l 3000000 processed/sorted_keys processed/slice/slice-
for filename in processed/slice/*; do
    python processing.py $filename
done
rm $BASEPATH/processed/sorted_keys
rm -rf $BASEPATH/processed/slice

For each slice I launch processing.py Here is my working code, with Manager:

Changed wording a little, hopefully to the better

Source Link

edited Dec 8, 2015 at 23:28

holroy

11.8k
1
27
59

Loading

deleted 167 characters in body; edited tags

Source Link

edited Dec 8, 2015 at 23:17

200_success

145.6k
22
190
479

Loading

added 3 characters in body

Source Link

edited Dec 8, 2015 at 23:16

JJack_

233
1
7

Loading

added 335 characters in body

Source Link

edited Dec 8, 2015 at 23:10

JJack_

233
1
7

Loading

added 335 characters in body

Source Link

edited Dec 8, 2015 at 23:05

JJack_

233
1
7

Loading

Source Link

asked Dec 8, 2015 at 22:55

JJack_

233
1
7

Loading

Stack Exchange Network

Return to Question