Pythonでワンホットエンコードするにはどうすればいいですか? 質問する

Question

アプローチ 1: pandas のを使用できますpd.get_dummies。

例1:

import pandas as pd
s = pd.Series(list('abca'))
pd.get_dummies(s)
Out[]: 
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
3  1.0  0.0  0.0

例2:

以下は、指定された列を 1 つのホットダミーに変換します。複数のダミーを使用するには、プレフィックスを使用します。

import pandas as pd
        
df = pd.DataFrame({
          'A':['a','b','a'],
          'B':['b','a','c']
        })
df
Out[]: 
   A  B
0  a  b
1  b  a
2  a  c

# Get one hot encoding of columns B
one_hot = pd.get_dummies(df['B'])
# Drop column B as it is now encoded
df = df.drop('B',axis = 1)
# Join the encoded df
df = df.join(one_hot)
df  
Out[]: 
       A  a  b  c
    0  a  0  1  0
    1  b  1  0  0
    2  a  0  0  1

アプローチ2: Scikit-learnを使用する

を使用すると、同じインスタンスを使用して、一部のトレーニングデータに対してを実行し、次に他の一部のデータに対してOneHotEncoderを実行できるという利点があります。また、エンコーダーが目に見えないデータに対して何を実行するかをさらに制御する必要があります。fittransformhandle_unknown

3 つの特徴と 4 つのサンプルを含むデータセットが与えられた場合、エンコーダーは特徴ごとに最大値を見つけ、データをバイナリワンホットエンコーディングに変換します。

>>> from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit([[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]])   
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
   handle_unknown='error', n_values='auto', sparse=True)
>>> enc.n_values_
array([2, 3, 4])
>>> enc.feature_indices_
array([0, 2, 5, 9], dtype=int32)
>>> enc.transform([[0, 1, 1]]).toarray()
array([[ 1.,  0.,  0.,  1.,  0.,  0.,  1.,  0.,  0.]])

この例のリンクは次のとおりです: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html

Answer 1