python - How to reverse sklearn.OneHotEncoder transform to recover original data? -
excuse lack of knowledge. have been dallying python less month. encoded categorical data using sklearn.onehotencoder
, fed them random forest classifier. seems work , got predicted output back. there way reverse encoding , convert output original state?
a systematic way figure out start test data , work through sklearn.onehotencoder
source it. if don't care how works , want quick answer, skip bottom.
x = np.array([ [3, 10, 15, 33, 54, 55, 78, 79, 80, 99], [5, 1, 3, 7, 8, 12, 15, 19, 20, 8] ]).t
n_values_
lines 1763-1786 determine n_values_
parameter. determined automatically if set n_values='auto'
(the default). alternatively can specify maximum value features (int) or maximum value per feature (array). let's assume we're using default. following lines execute:
n_samples, n_features = x.shape # 10, 2 n_values = np.max(x, axis=0) + 1 # [100, 21] self.n_values_ = n_values
feature_indices_
next feature_indices_
parameter calculated.
n_values = np.hstack([[0], n_values]) # [0, 100, 21] indices = np.cumsum(n_values) # [0, 100, 121] self.feature_indices_ = indices
so feature_indices_
merely cumulative sum of n_values_
0 prepended.
sparse matrix construction
next, scipy.sparse.coo_matrix
constructed data. initialized 3 arrays: sparse data (all ones), row indices, , column indices.
column_indices = (x + indices[:-1]).ravel() # array([ 3, 105, 10, 101, 15, 103, 33, 107, 54, 108, 55, 112, 78, 115, 79, 119, 80, 120, 99, 108]) row_indices = np.repeat(np.arange(n_samples, dtype=np.int32), n_features) # array([0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9], dtype=int32) data = np.ones(n_samples * n_features) # array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]) out = sparse.coo_matrix((data, (row_indices, column_indices)), shape=(n_samples, indices[-1]), dtype=self.dtype).tocsr() # <10x121 sparse matrix of type '<type 'numpy.float64'>' 20 stored elements in compressed sparse row format>
note coo_matrix
converted scipy.sparse.csr_matrix
. coo_matrix
used intermediate format because "facilitates fast conversion among sparse formats."
active_features_
now, if n_values='auto'
, sparse csr matrix compressed down columns active features. sparse csr_matrix
returned if sparse=true
, otherwise densified before returning.
if self.n_values == 'auto': mask = np.array(out.sum(axis=0)).ravel() != 0 active_features = np.where(mask)[0] # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120]) out = out[:, active_features] # <10x19 sparse matrix of type '<type 'numpy.float64'>' 20 stored elements in compressed sparse row format> self.active_features_ = active_features return out if self.sparse else out.toarray()
decoding
now let's work in reverse. we'd know how recover x
given sparse matrix returned along onehotencoder
features detailed above. let's assume ran code above instantiating new onehotencoder
, running fit_transform
on our data x
.
from sklearn import preprocessing ohc = preprocessing.onehotencoder() # default params out = ohc.fit_transform(x)
the key insight solving problem understanding relationship between active_features_
, out.indices
. csr_matrix
, indices array contains column numbers each data point. however, these column numbers not guaranteed sorted. sort them, can use sorted_indices
method.
out.indices # array([12, 0, 10, 1, 11, 2, 13, 3, 14, 4, 15, 5, 16, 6, 17, 7, 18, 8, 14, 9], dtype=int32) out = out.sorted_indices() out.indices # array([ 0, 12, 1, 10, 2, 11, 3, 13, 4, 14, 5, 15, 6, 16, 7, 17, 8, 18, 9, 14], dtype=int32)
we can see before sorting, indices reversed along rows. in other words, ordered last column first , first column last. evident first 2 elements: [12, 0]. 0 corresponds 3 in first column of x
, since 3 minimum element assigned first active column. 12 corresponds 5 in second column of x
. since first row occupies 10 distinct columns, minimum element of second column (1) gets index 10. next smallest (3) gets index 11, , third smallest (5) gets index 12. after sorting, indices ordered expect.
next @ active_features_
:
ohc.active_features_ # array([ 3, 10, 15, 33, 54, 55, 78, 79, 80, 99, 101, 103, 105, 107, 108, 112, 115, 119, 120])
notice there 19 elements, corresponds number of distinct elements in our data (one element, 8, repeated once). notice these arranged in order. features in first column of x
same, , features in second column have been summed 100, corresponds ohc.feature_indices_[1]
.
looking @ out.indices
, can see maximum column number 18, 1 minus 19 active features in our encoding. little thought relationship here shows indices of ohc.active_features_
correspond column numbers in ohc.indices
. this, can decode:
import numpy np decode_columns = np.vectorize(lambda col: ohc.active_features_[col]) decoded = decode_columns(out.indices).reshape(x.shape)
this gives us:
array([[ 3, 105], [ 10, 101], [ 15, 103], [ 33, 107], [ 54, 108], [ 55, 112], [ 78, 115], [ 79, 119], [ 80, 120], [ 99, 108]])
and can original feature values subtracting off offsets ohc.feature_indices_
:
recovered_x = decoded - ohc.feature_indices_[:-1] array([[ 3, 5], [10, 1], [15, 3], [33, 7], [54, 8], [55, 12], [78, 15], [79, 19], [80, 20], [99, 8]])
note need have original shape of x
, (n_samples, n_features)
.
tl;dr
given sklearn.onehotencoder
instance called ohc
, encoded data (scipy.sparse.csr_matrix
) output ohc.fit_transform
or ohc.transform
called out
, , shape of original data (n_samples, n_feature)
, recover original data x
with:
recovered_x = np.array([ohc.active_features_[col] col in out.sorted_indices().indices]) .reshape(n_samples, n_features) - ohc.feature_indices_[:-1]
Comments
Post a Comment