2

I've got 200k csv files and I need to import them all to a single postgresql table. It's a list of parameters from various devices and each csv's file name contains device's serial number and I need it to be in one of the colums for each row.

So to simplify, I've got few columns of data (no headers), let's say that columns in each csv file are: Date, Variable, Value and file name contains SERIALNUMBER_and_someOtherStuffIDontNeed.csv

I'm trying to use cygwin to write a bash script to iterate over files and do it for me, however for some reason it won't work, showing 'syntax error at or near "as" '

Here's my code:

#!/bin/bash
FILELIST=/cygdrive/c/devices/files/*
for INPUT_FILE in $FILELIST
do
psql -U postgres -d devices -c "copy devicelist
(
Date,
Variable,
Value,
SN as CURRENT_LOAD_SOURCE(),
)
from '$INPUT_FILE
delimiter ',' ;"
done

I'm learning SQL so it might be an obvious mistake, but I can't see it.

Also I know that in that form I will get full file name, not just the serial number bit I want but I can probably handle that somehow later.

Please advise.

Thanks.

5
  • 2
    Can't you combine all of your inputs into one correctly formatted file and use a bcp like utility to load it in one operation? This will be painfully slow. In any case, your for loop will fail because the * will expand to an [Arg list too long] error. Use find . | xargs awk ' ...' >> allInOneFile.txt to create your file. To write this is a small consulting engagement. Good luck.
    – shellter
    Commented Nov 29, 2016 at 17:12
  • Good point @shellter -- there are easy ways to combine multiple text files from the command line.
    – CLAbeel
    Commented Nov 29, 2016 at 17:13
  • 1
    copy ... from program 'sed "s/$/,$INPUT_FILE/" $INPUT_FILE' ...
    – Abelisto
    Commented Nov 29, 2016 at 17:40
  • Thanks for your comments, I'm actually in the process of creating one correctly formated csv, however it's taking ages... It's been running since yesterday and I'm estimating that it's going to take at least another day or so, therefore that's not really the best solution.
    – Michal
    Commented Nov 30, 2016 at 8:50
  • don't think that loading the file 1 record at a time is going to be any faster ;-/ . Unless your files are in the terrabyte size, or you're still running on a Pentium 386, something is almost certainly wrong with how you are creating your "one correctly formatted csv". Typically, processing takes less time than writing to disk. In a decent environment, 1 GB per min would a reasonable benchmark for spinning disks. You could post a new Q with your formatting problem, get a fix, and rerun all of your files before your current solution will finish ;-)!? Good luck!!
    – shellter
    Commented Dec 1, 2016 at 14:28

2 Answers 2

1

I dont think there is a CURRENT_LOAD_SOURCE() function in postgres. A work-around is to leave the name-column NULL on copy, and patch is to the desired value just after the copy. I prefer a shell here-document because that make quoting inside the SQL body easier. (BTW: for 10K of files, the globbing needed to obtain FILELIST might exceed argmax for the shell ...)


#!/bin/bash

FILELIST="`ls /tmp/*.c`"

for INPUT_FILE in $FILELIST
do
echo "File:" $INPUT_FILE

psql -U postgres -d devices <<OMG

  -- I have a schema "tmp" for testing purposes    
CREATE TABLE IF NOT EXISTS tmp.filelist(name text, content text);

COPY tmp.filelist ( content)
from '/$INPUT_FILE' delimiter ',' ;

UPDATE tmp.filelist SET name = '$FILELIST'
WHERE name IS NULL;
OMG

done

0

For anyone interested in an answer, I've used a python script to change file names and then another script using psycopg2 to connect to the database and then done everyting in one connection. Took 10 minutes instead of 10 hours.

Here's the code:

Renaming files (also apparently to import from CSV you need all the rows to be filled and the information I needed was in first 4 columns anyway, therefore I've put together a solution to generate whole new CSVs instead of just renaming them):

import os
import csv

path='C:/devices/files'

os.chdir(path)
i=0

for file in os.listdir(path):
    try:

        i+=1

        if i%10000 == 0:
            #just to see the progress
            print(i)

        serial_number = (file[:8])
        creader = csv.reader(open(file))
        cwriter = csv.writer(open('processed_'+file, 'w'))

        for cline in creader:
            new_line = [val for col, val in enumerate(cline) if col not in (4, 5, 6, 7)]
            new_line.insert(0, serial_number)
            #print(new_line)
            cwriter.writerow(new_line)

    except:
        print('problem with file: ' + file)
        pass

Updating database:

import os
import psycopg2


path="C:\\devices\\files"
directory_listing = os.listdir(path)

conn = psycopg2.connect("dbname='devices' user='postgres' host='localhost'")
cursor = conn.cursor()
print(len(directory_listing))
i=100001

while i < 218792:
    current_file=(directory_listing[i])
    i+=1
    full_path = "C:/devices/files/" + current_file
    with open(full_path) as f:
    cursor.copy_from(file=f, table='devicelistlive', sep=",")
    conn.commit()

conn.close()

Don't mind while and weird numbers, it's just because I was doing it in portions for testing purposes. Can easily be replaced with for

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.