Join the Stack Overflow Community
Stack Overflow is a community of 6.7 million programmers, just like you, helping each other.
Join them; it only takes a minute:
Sign up

I'm trying to train a classifier to detect whether a user is genuine or not based on their typing pattern. I'm extracting 4 features, namely press-press time, press-release time, release-press time and release-release time, which are basically the intervals between consecutive keystrokes. Since with each session, the number of keystrokes can vary (for users typing wrong characters first and then typing the correct ones), I wanted to the make the dimensionality of the data constant and I read here about Feature Hashing.

I have a three part question:

  1. How to handle data with varying dimension for each sample? (Check the last two samples of genuine_user.csv)
  2. Have I correctly implemented Feature hashing?
  3. How to resolve the ValueError: setting an array element with a sequence?

genuine_user.csv (6 samples provided)

id,date,genuine,password,release_codes,pp,pr,rp,rr,ppavg,pravg,rpavg,rravg,total
1,2010-10-18 11:45:54,1,cPc9312,67 16 80 67 105 99 97 98,64 144 144 376 215 193 73,95 191 95 71 71 72 71 70,255 239 215 447 287 264 143,160 48 120 376 216 192 72,172,92,264,169,1279
1,2010-10-18 11:46:13,1,cPc9312,67 16 80 67 105 99 97 98,96 136 120 568 263 216 96,98 183 71 95 71 48 120 96,279 207 215 639 311 336 192,181 24 144 544 240 288 72,213,97,311,213,1591
1,2010-10-18 11:46:18,1,cPc9312,67 16 80 67 105 99 97 98,120 144 168 568 232 240 120,96 192 96 87 71 71 143 95,312 240 255 639 303 383 215,216 48 159 552 232 312 72,227,106,335,227,1687
1,2010-10-18 11:46:22,1,cPc9312,67 16 80 67 105 99 97 98,120 144 136 408 264 208 72,96 168 72 96 72 72 96 72,288 216 232 480 336 304 144,192 48 160 384 264 232 48,193,93,285,189,1424
1,2010-10-20 13:14:23,1,cpc9312,67 80 67 105 99 97 98,176 192 440 296 225 55,120 120 136 104 104 119 128,296 328 544 400 344 183,176 208 408 296 240 64,230,118,349,232,1512
1,2010-10-18 11:46:09,1,cPc9312,67 16 80 67 105 99 97 98 67 16 80 67 105 99 97 98,216 144 168 496 264 359 9 9398 72 136 120 408 256 216 72,120 216 96 72 72 71 18 11 75 160 72 96 72 72 120 96,432 240 240 568 335 377 20 9473 232 208 216 480 328 336 168,312 24 144 496 263 306 2 9462 157 48 144 384 256 264 48,822,89,910,820,12430

(release_codes is a list of the ASCII value of the keystroke)

fhash.py

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import FeatureHasher
from collections import OrderedDict
from sklearn.svm import OneClassSVM

keystroke_data = pd.read_csv(r'data/genuine_user.csv', header= 0)
results = []
for user in keystroke_data.id.unique():
user_keystroke_data= keystroke_data[keystroke_data['id'] == user]
no_of_samples = user_keystroke_data.id.count()
hash_length= len(user_keystroke_data.password.iloc[0])
#TODO: refine the value of hash_length
hash_length= pow(hash_length, 2)

X = user_keystroke_data[['release_codes', 'pp', 'pr', 'rp', 'rr', 'total']]
y = user_keystroke_data[['genuine']]

print("==== Before transformation =====")
print("X type: {}".format(type(X)))
print("X shape: {}".format(X.shape))
print("y type: {}".format(type(y)))
print("y shape: {}".format(y.shape))

# TODO: sort this shit out
hasher= FeatureHasher(n_features=10, input_type='dict', non_negative=False)

X_transformed= []
for i in range(no_of_samples):
    temp_X = X.iloc[i]
    temp_list = []
    #rc contains the list of release code. ignore the last code as it refer to "enter" value
    rc = list(map(int, temp_X.release_codes.split()))
    pp = list(map(int, temp_X.pp.split()))
    pr = list(map(int, temp_X.pr.split()))
    rp = list(map(int, temp_X.rp.split()))
    rr = list(map(int, temp_X.rr.split()))
    for j in range(0, len(rc)-1):
        temp_list.append({'rc':rc[j], 'pp':pp[j], 'pr':pr[j], 'rp':rp[j], 'rr':rr[j]})
    X_transformed.append(hasher.transform(temp_list))

X_transformed = pd.DataFrame(X_transformed)
with open(r'output.csv', 'w') as file:
    file.write(X_transformed.to_csv())

print("==== After transformation =====")
print("X_transformed shape: {}".format(X_transformed.shape))

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.4, random_state=0)
print("X_train type: {}".format(type(X_train)))
# OC-SVM
svm_clf = OneClassSVM(kernel='poly')
svm_clf.fit(X_train, y_train)
prediction_results= svm_clf.predict(X_test)
counter= Counter(prediction_results)
correct_preditions = counter.get(1.0, 0)
wrong_preditions = counter.get(-1.0, 0)
results.append({'user':user, 'classifier':'ocsvm', 
    'data':{'correct_preditions':correct_preditions, 'wrong_preditions':wrong_preditions}
})

Output

$ python fhash.py
==== Before transformation =====
X type: <class 'pandas.core.frame.DataFrame'>
X shape: (450, 6)
y type: <class 'pandas.core.frame.DataFrame'>
y shape: (450, 1)
==== After transformation =====
X_transformed shape: (450, 1)
X_train type: <class 'pandas.core.frame.DataFrame'>
Traceback (most recent call last):
  File "fhash.py", line 61, in <module>
    svm_clf.fit(X_train, y_train)
  File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/svm/classes.py", line 1036, in fit
    sample_weight=sample_weight, **params)
  File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/svm/base.py", line 151, in fit
    X, y = check_X_y(X, y, dtype=np.float64, order='C', accept_sparse='csr')
  File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py", line 521, in check_X_y
    ensure_min_features, warn_on_dtype, estimator)
  File "/home/tak/anaconda3/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.

A user here suggested to pad the values. But there is no clarity on how to do that. Is that the solution? If yes, how should I do it?

share|improve this question

As for the first part of your question, I would replace each feature of variable length by a histogram with a fixed number of bins. This histogram is an estimate of the probability density function of the intervals between consecutive keystrokes. In this approach the feature vector is made up by concatenating the histogram bin values corresponding to the original features. If each of the 4 variable length features is histogrammed using n bins, the resulting feature vector has a dimension of 4xn.

In the snippet below, the function hist_from string transforms a string of variable length containing a sequence of interval values into a histogram of 20 equal-width bins in the range (0, 10000). The call hists_from_data(data[:, 5:9]) applies this transformation to the columns of indices 5 to 8 of your dataset and returns an array with 80 columns and as many rows as there are samples.

import numpy as np

data = np.loadtxt(r'C:\Users\Antonio\Desktop\genuine_user.csv', 
                  dtype='string', 
                  delimiter=',', 
                  skiprows=1)

def hist_from_string(intervals, bins=np.linspace(0, 10000, 21)):
    lst = [int(value) for value in intervals.split()]
    h, _ = np.histogram(lst, bins=bins, density=True)
    return h

def hists_from_data(arr):
    feats = []
    for row in arr:
        for intervals in row:
            feats.append(hist_from_string(intervals))
    return np.asarray(feats).reshape(arr.shape[0], -1)

X = hists_from_data(data[:, 5:9])
y = np.asarray(data[:, 2], dtype='int')

Notes:

  • If the default bin edges do not fit your needs you could pass a different sequence to hist_from_string.
  • You could also augment the feature vector by computing some statistics such as the min, max, mean, range, etc. for the different intervals and concatenating them to the four histograms.
  • The example above runs fine with Python 2.7. You might need to make some tweaks to get the code to work with Python 3.
share|improve this answer

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Not the answer you're looking for? Browse other questions tagged or ask your own question.