python - How to reverse sklearn.OneHotEncoder transform to recover original data? -


excuse lack of knowledge. have been dallying python less month. encoded categorical data using sklearn.onehotencoder , fed them random forest classifier. seems work , got predicted output back. there way reverse encoding , convert output original state?

a systematic way figure out start test data , work through sklearn.onehotencoder source it. if don't care how works , want quick answer, skip bottom.

x = np.array([     [3, 10, 15, 33, 54, 55, 78, 79, 80, 99],     [5, 1, 3, 7, 8, 12, 15, 19, 20, 8] ]).t 

n_values_

lines 1763-1786 determine n_values_ parameter. determined automatically if set n_values='auto' (the default). alternatively can specify maximum value features (int) or maximum value per feature (array). let's assume we're using default. following lines execute:

n_samples, n_features = x.shape    # 10, 2 n_values = np.max(x, axis=0) + 1   # [100, 21] self.n_values_ = n_values 

feature_indices_

next feature_indices_ parameter calculated.

n_values = np.hstack([[0], n_values])  # [0, 100, 21] indices = np.cumsum(n_values)          # [0, 100, 121] self.feature_indices_ = indices 

so feature_indices_ merely cumulative sum of n_values_ 0 prepended.

sparse matrix construction

next, scipy.sparse.coo_matrix constructed data. initialized 3 arrays: sparse data (all ones), row indices, , column indices.

column_indices = (x + indices[:-1]).ravel() # array([  3, 105,  10, 101,  15, 103,  33, 107,  54, 108,  55, 112,  78, 115,  79, 119,  80, 120,  99, 108])  row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features) # array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32)  data = np.ones(n_samples * n_features) # array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1., 1.,  1.,  1.,  1.,  1.,  1.,  1.])  out = sparse.coo_matrix((data, (row_indices, column_indices)),                         shape=(n_samples, indices[-1]),                         dtype=self.dtype).tocsr() # <10x121 sparse matrix of type '<type 'numpy.float64'>' 20 stored elements in compressed sparse row format> 

note coo_matrix converted scipy.sparse.csr_matrix. coo_matrix used intermediate format because "facilitates fast conversion among sparse formats."

active_features_

now, if n_values='auto', sparse csr matrix compressed down columns active features. sparse csr_matrix returned if sparse=true, otherwise densified before returning.

if self.n_values == 'auto':     mask = np.array(out.sum(axis=0)).ravel() != 0     active_features = np.where(mask)[0]  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120])     out = out[:, active_features]  # <10x19 sparse matrix of type '<type 'numpy.float64'>' 20 stored elements in compressed sparse row format>     self.active_features_ = active_features  return out if self.sparse else out.toarray() 

decoding

now let's work in reverse. we'd know how recover x given sparse matrix returned along onehotencoder features detailed above. let's assume ran code above instantiating new onehotencoder , running fit_transform on our data x.

from sklearn import preprocessing ohc = preprocessing.onehotencoder()  # default params out = ohc.fit_transform(x) 

the key insight solving problem understanding relationship between active_features_ , out.indices. csr_matrix, indices array contains column numbers each data point. however, these column numbers not guaranteed sorted. sort them, can use sorted_indices method.

out.indices  # array([12,  0, 10,  1, 11,  2, 13,  3, 14,  4, 15,  5, 16,  6, 17,  7, 18, 8, 14,  9], dtype=int32) out = out.sorted_indices() out.indices  # array([ 0, 12,  1, 10,  2, 11,  3, 13,  4, 14,  5, 15,  6, 16,  7, 17,  8, 18,  9, 14], dtype=int32) 

we can see before sorting, indices reversed along rows. in other words, ordered last column first , first column last. evident first 2 elements: [12, 0]. 0 corresponds 3 in first column of x, since 3 minimum element assigned first active column. 12 corresponds 5 in second column of x. since first row occupies 10 distinct columns, minimum element of second column (1) gets index 10. next smallest (3) gets index 11, , third smallest (5) gets index 12. after sorting, indices ordered expect.

next @ active_features_:

ohc.active_features_  # array([  3,  10,  15,  33,  54,  55,  78,  79,  80,  99, 101, 103, 105, 107, 108, 112, 115, 119, 120]) 

notice there 19 elements, corresponds number of distinct elements in our data (one element, 8, repeated once). notice these arranged in order. features in first column of x same, , features in second column have been summed 100, corresponds ohc.feature_indices_[1].

looking @ out.indices, can see maximum column number 18, 1 minus 19 active features in our encoding. little thought relationship here shows indices of ohc.active_features_ correspond column numbers in ohc.indices. this, can decode:

import numpy np decode_columns = np.vectorize(lambda col: ohc.active_features_[col]) decoded = decode_columns(out.indices).reshape(x.shape) 

this gives us:

array([[  3, 105],        [ 10, 101],        [ 15, 103],        [ 33, 107],        [ 54, 108],        [ 55, 112],        [ 78, 115],        [ 79, 119],        [ 80, 120],        [ 99, 108]]) 

and can original feature values subtracting off offsets ohc.feature_indices_:

recovered_x = decoded - ohc.feature_indices_[:-1] array([[ 3,  5],        [10,  1],        [15,  3],        [33,  7],        [54,  8],        [55, 12],        [78, 15],        [79, 19],        [80, 20],        [99,  8]]) 

note need have original shape of x, (n_samples, n_features).

tl;dr

given sklearn.onehotencoder instance called ohc, encoded data (scipy.sparse.csr_matrix) output ohc.fit_transform or ohc.transform called out, , shape of original data (n_samples, n_feature), recover original data x with:

recovered_x = np.array([ohc.active_features_[col] col in out.sorted_indices().indices])                 .reshape(n_samples, n_features) - ohc.feature_indices_[:-1] 

Comments

Popular posts from this blog

php - Magento - Deleted Base url key -

javascript - Tooltipster plugin not firing jquery function when button or any click even occur -

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -