I am trying to create a small application that will help me extract data from Hadoop via hiveserver2 by simply writing it to CSV. At the moment, it takes about 40 seconds to pull 10,000 records from a hs2 table consisting of 500k rows with 203 columns. I would like to think that there is a faster, more efficient way of doing this than the way I am doing it now. Ideally I would like to eventually be able to pull and write 1 million rows per minute. It doesn't necessarily have to be written to CSV. It probably isn't feasible with my current skill level, but I like to set goals for myself.
import pyhs2
import time
import csv
csv_out = open('data.csv', 'wb')
mywriter = csv.writer(csv_out)
def generic_user():
gu = 'xxxxxx'
return gu
def password():
pw = 'xxxxxxx'
return pw
with pyhs2.connect(host='xxxxxxxx',
port=10000,
authMechanism='PLAIN',
user=generic_user(),
password=password(),
database='xxxxxxxxxx') as conn:
with conn.cursor() as cur:
q = raw_input('Enter query: ').replace('csc', 'CSC')
print q
#timer start
start_time = time.time()
#Execute query
cur.execute(q)
col = []
for key in cur.getSchema():
col.append(key['columnName'])
header = []
header.append(col)
for rows in zip(header):
mywriter.writerows(rows)
records = cur.fetch()
# print records
for rows in zip(records):
mywriter.writerows(rows)
pull_time = time.time() - start_time
print pull_time