I've got a script that is attempting to insert a large number of rows into a postgresql table. When I say large, I mean up to 200,000. I'm inserting data from python using sql alchemy. Each row is made up of one unique ID and a number of 0/1 flags.
When I try to insert a small number of rows, it works just fine. I have even inserted around 18,000 without any problems, and I think it only took a handful of seconds.
Lately, I have stepped it up to try inserting a much larger set of data set of around 150,000 records. I had my script print the time that it started doing this, and this insert has been running for 12+ hours at this point. It seems disproportionately long when compared to the fast 20k row insert. Here is the code that I'm using.
sql_engine = sqlalchemy.create_engine("postgresql://database")
meta=sqlalchemy.MetaData(sql_engine)
my_table= sqlalchemy.Table('table_name', meta, autoload=True, autoload_with=sql_engine)
already_inserted=[i for i in sql_engine.execute(sqlalchemy.select([some_column]))]
table_rows=[]
for i in summary:
if i[some_column] not in alread_inserted:
table_rows.append(
{logic that builds row of 0s and 1s})
if len(table_rows)>0:
my_table.insert().execute(table_rows)
Are there any tips to getting this to work? Should I be inserting in smaller chunks? Furthermore, would inserts go faster if I only tried to insert the flags that are equal to 1 and left the zeros as null?
statement = my_table.insert().values(table_rows); sql_engine.execute(statement)
. It will create you bulk insert, as sometimes sqlalchemy performs row by row insert, event if you specify bulk.