I am new to machine learning and am trying to set up a logistic regression for prediction purposes in Python using scikit-learn. I already set one up with a small, mock dataset, but when expanding this code to work for larger datasets, I run into an issue regarding a ValueError. Here is my code:
inputData = np.genfromtxt(file, skip_header=1, unpack=True)
print "X array shape: ",inputData.shape
inputAnswers = np.genfromtxt(file2, skip_header=1, unpack=True)
print "Y array shape: ",inputAnswers.shape
logreg = LogisticRegression(penalty='l2',C=2.0)
logreg.fit(inputData, inputAnswers)
The inputData 2D array (matrix) has 149 rows and 231 columns. I'm trying to fit it to the inputAnswers array, which has 149 rows, correctly corresponding to the 149 rows of the inputData array. However, here is the output I receive:
X array shape: (231, 149)
Y array shape: (149,)
Traceback (most recent call last):
File "LogRegTry_rawData.py", line 26, in <module>
logreg.fit(inputData, inputAnswers)
File "[path]", line 676, in fit
(X.shape[0], y.shape[0]))
ValueError: X and y have incompatible shapes.
X has 231 samples, but y has 149.
I understand what the error means, but I'm not sure of both why it is showing up in this situation and how to fix it. Any help is greatly appreciated. Thank you!