2

How is this done? I am using Sklearn to train an SVM. My classes are unbalanced. Note that my problem is multiclass, multilabel so I am using OneVsRestClassifier:

mlb = MultiLabelBinarizer()
y = mlb.fit_transform(y_train)

clf = OneVsRestClassifier(svm.SVC(kernel='rbf'))
clf = clf.fit(x, y) 
pred = clf.predict(x_test)

Can I add a 'sample_weight' parameter somewhere to account for the unbalanced classes?


When I add a class_weight dict to the svm I get the error:

ValueError: Class label 2 not present

This is because I have converted my labels to binary using the mlb. However, if I do not convert the labels, I get:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. 

class_weight is a dict, mapping the class labels to the weight: {1: 1, 2: 1, 3: 3...}

Here are the details of x and y:

print(X[0])  
[ 0.76625633  0.63062721  0.01954162 ...,  1.1767817   0.249034    0.23544988]
print(type(X))
<type 'numpy.ndarray'>

print(y[0])
print(type(y))
[1, 2, 3, 4, 5, 6, 7]
<type 'numpy.ndarray'>

Note that mlb = MultiLabelBinarizer(); y = mlb.fit_transform(y_train) converts y to a binary array.


The suggested answer produces the error:

ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead.

So, the problem reduces to converting the labels (a np.array) to a sparse matrix.

from scipy import sparse
y_sp = sparse.csr_matrix(y) 

This produces the error:

TypeError: no supported conversion for types: (dtype('O'),)

I will open a new query for this.

Chris Parry
  • 2,937
  • 7
  • 30
  • 71
  • Could you provide an element of x and y ? `print type(x[0]) print x[0]` and `print type(y[0]) print y[0]` – dooms Apr 08 '16 at 21:21
  • Here the y is not binary. See if `mlb.classes_`gives you an array where the value 2 is present. – dooms Apr 09 '16 at 10:48
  • I have tried converting the labels to binary. It produces the error listed above: ValueError: Class label 2 not present (because all the labels are then in binary format). If I do not convert to binary, I get the error: ValueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. – Chris Parry Apr 09 '16 at 11:01

2 Answers2

5

You could use :

class_weight : {dict, ‘balanced’}, optional

Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))

clf = OneVsRestClassifier(svm.SVC(kernel='rbf', class_weight='balanced'))

source

Community
  • 1
  • 1
Till
  • 4,183
  • 3
  • 16
  • 18
  • Thanks, can you give a code example? When I try this I get the error: ValueError: Class label 2 not present, because I have converted my labels to binary. BUT, if I do not convert the labels, I get: valueError: You appear to be using a legacy multi-label data representation. Sequence of sequences are no longer supported; use a binary array or sparse matrix instead. – Chris Parry Apr 08 '16 at 12:43
  • Could you give me the error message ? Here is a [random example](https://github.com/mstampfer/Equities/blob/2c8e23d4f77c51261fe97ce53cf13d043d9ef8e5/GridSearchParams.py#L16) – Till Apr 08 '16 at 12:48
  • Is this error related to the attribute `class_weight` ? Did you had this error before ? – Till Apr 08 '16 at 12:51
  • By the way the error seems to indicate you have only one class in your training set. Is that possible ? – Till Apr 08 '16 at 12:54
  • This results in a much longer run-time for me than without using `class_weight='balanced'`. Why would that be, and is it possible to change that? I used `max_iter` to restrict the number of iterations, but the accuracy was even worse than without using balanced classes. – user124384 Oct 15 '18 at 16:50
  • I guess it has to pass through all your data first to get the proportions and it can takes time. A hint: https://stackoverflow.com/questions/30972029/how-does-the-class-weight-parameter-in-scikit-learn-work – Till Oct 16 '18 at 10:10
1

This code works fine with the 'balanced' value of class_weight attribute

>>> from sklearn.preprocessing import MultiLabelBinarizer
>>> from sklearn.svm import SVC
>>> from sklearn.multiclass import OneVsRestClassifier

>>> mlb = MultiLabelBinarizer()
>>> x = [[0,1,1,1],[1,0,0,1]]
>>> y = mlb.fit_transform([['sci-fi', 'thriller'], ['comedy']])

>>> print y
>>> print mlb.classes_
[[0 1 1]
 [1 0 0]]
['comedy' 'sci-fi' 'thriller']

>>> OneVsRestClassifier(SVC(random_state=0, class_weight='balanced')).fit(x, y).predict(x)
array([[0, 1, 1],
   [1, 0, 0]])
dooms
  • 1,537
  • 3
  • 16
  • 30