I'm trying to train a classifier to detect whether a user is genuine or not based on their typing pattern. I'm extracting 4 features, namely press-press time, press-release time, release-press time and release-release time, which are basically the intervals between consecutive keystrokes. Since with each session, the number of keystrokes can vary (for users typing wrong characters first and then typing the correct ones), I wanted to the make the dimensionality of the data constant and I read here about Feature Hashing.
I have a three part question:
- How to handle data with varying dimension for each sample? (Check the last two samples of genuine_user.csv)
- Have I correctly implemented Feature hashing?
- How to resolve the
ValueError: setting an array element with a sequence
?
genuine_user.csv (6 samples provided)
id,date,genuine,password,release_codes,pp,pr,rp,rr,ppavg,pravg,rpavg,rravg,total
1,2010-10-18 11:45:54,1,cPc9312,67 16 80 67 105 99 97 98,64 144 144 376 215 193 73,95 191 95 71 71 72 71 70,255 239 215 447 287 264 143,160 48 120 376 216 192 72,172,92,264,169,1279
1,2010-10-18 11:46:13,1,cPc9312,67 16 80 67 105 99 97 98,96 136 120 568 263 216 96,98 183 71 95 71 48 120 96,279 207 215 639 311 336 192,181 24 144 544 240 288 72,213,97,311,213,1591
1,2010-10-18 11:46:18,1,cPc9312,67 16 80 67 105 99 97 98,120 144 168 568 232 240 120,96 192 96 87 71 71 143 95,312 240 255 639 303 383 215,216 48 159 552 232 312 72,227,106,335,227,1687
1,2010-10-18 11:46:22,1,cPc9312,67 16 80 67 105 99 97 98,120 144 136 408 264 208 72,96 168 72 96 72 72 96 72,288 216 232 480 336 304 144,192 48 160 384 264 232 48,193,93,285,189,1424
1,2010-10-20 13:14:23,1,cpc9312,67 80 67 105 99 97 98,176 192 440 296 225 55,120 120 136 104 104 119 128,296 328 544 400 344 183,176 208 408 296 240 64,230,118,349,232,1512
1,2010-10-18 11:46:09,1,cPc9312,67 16 80 67 105 99 97 98 67 16 80 67 105 99 97 98,216 144 168 496 264 359 9 9398 72 136 120 408 256 216 72,120 216 96 72 72 71 18 11 75 160 72 96 72 72 120 96,432 240 240 568 335 377 20 9473 232 208 216 480 328 336 168,312 24 144 496 263 306 2 9462 157 48 144 384 256 264 48,822,89,910,820,12430
(release_codes is a list of the ASCII value of the keystroke)
fhash.py
import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import FeatureHasher
from collections import OrderedDict
from sklearn.svm import OneClassSVM
keystroke_data = pd.read_csv(r'data/genuine_user.csv', header= 0)
results = []
for user in keystroke_data.id.unique():
user_keystroke_data= keystroke_data[keystroke_data['id'] == user]
no_of_samples = user_keystroke_data.id.count()
hash_length= len(user_keystroke_data.password.iloc[0])
#TODO: refine the value of hash_length
hash_length= pow(hash_length, 2)
X = user_keystroke_data[['release_codes', 'pp', 'pr', 'rp', 'rr', 'total']]
y = user_keystroke_data[['genuine']]
print("==== Before transformation =====")
print("X type: {}".format(type(X)))
print("X shape: {}".format(X.shape))
print("y type: {}".format(type(y)))
print("y shape: {}".format(y.shape))
# TODO: sort this shit out
hasher= FeatureHasher(n_features=10, input_type='dict', non_negative=False)
X_transformed= []
for i in range(no_of_samples):
temp_X = X.iloc[i]
temp_list = []
#rc contains the list of release code. ignore the last code as it refer to "enter" value
rc = list(map(int, temp_X.release_codes.split()))
pp = list(map(int, temp_X.pp.split()))
pr = list(map(int, temp_X.pr.split()))
rp = list(map(int, temp_X.rp.split()))
rr = list(map(int, temp_X.rr.split()))
for j in range(0, len(rc)-1):
temp_list.append({'rc':rc[j], 'pp':pp[j], 'pr':pr[j], 'rp':rp[j], 'rr':rr[j]})
X_transformed.append(hasher.transform(temp_list))
X_transformed = pd.DataFrame(X_transformed)
with open(r'output.csv', 'w') as file:
file.write(X_transformed.to_csv())
print("==== After transformation =====")
print("X_transformed shape: {}".format(X_transformed.shape))
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.4, random_state=0)
print("X_train type: {}".format(type(X_train)))
# OC-SVM
svm_clf = OneClassSVM(kernel='poly')
svm_clf.fit(X_train, y_train)
prediction_results= svm_clf.predict(X_test)
counter= Counter(prediction_results)
correct_preditions = counter.get(1.0, 0)
wrong_preditions = counter.get(-1.0, 0)
results.append({'user':user, 'classifier':'ocsvm',
'data':{'correct_preditions':correct_preditions, 'wrong_preditions':wrong_preditions}
})
Output
$ python fhash.py
==== Before transformation =====
X type: <class 'pandas.core.frame.DataFrame'>
X shape: (450, 6)
y type: <class 'pandas.core.frame.DataFrame'>
y shape: (450, 1)
==== After transformation =====
X_transformed shape: (450, 1)
X_train type: <class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
File "fhash.py", line 61, in <module>
svm_clf.fit(X_train, y_train)
File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/svm/classes.py", line 1036, in fit
sample_weight=sample_weight, **params)
File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/svm/base.py", line 151, in fit
X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
ensure_min_features, warn_on_dtype, estimator)
File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
A user here suggested to pad the values. But there is no clarity on how to do that. Is that the solution? If yes, how should I do it?