What is the SMOTE?

SMOTE는 kNN기법으로 불균형한 데이터셋에서 소수 클래스를 증가시키는 OverSampling 기법입니다. 이를 통해 모델이 더 잘 일반화되고 더 나은 결과를 도출할 수 있습니다.

When use the SMOTE?

SMOTE는 일반적으로 불균형한 데이터셋에서 사용됩니다. 예를 들어, 희귀질병과 같은 걸린 사람의 수가 적은 데이터셋을 다루는 경우가 있습니다.

How use the SMOTE?

먼저 SMOTE를 import 합니다.

from imblearn.over_sampling import SMOTE

그 다음, 학습 데이터만을 OverSampling하기 위해서 train_test_split 함수를 사용하여 train 데이터와 test 데이터를 분리해줍니다. 그리고 train 데이터와 test 데이터의 크기는 어떻게 되는지 꼭 확인해줍니다.

X_train, X_test, y_train, y_test = train_test_split(train, test, test_size=0.3)

print('Shape of train : {}'.format(X_train.shape))
print('Shape of test : {}'.format(X_test.shape))
print('='*50)
print('Shape of df_train (incl. ID and Class): {}'.format(df_train_2.shape))
print('Shape of df_test (incl. ID): {}'.format(df_test_2.shape))

Shape of train : (431, 26)
Shape of test : (186, 26)
==================================================
Shape of df_train (incl. ID and Class): (617, 28)
Shape of df_test (incl. ID): (5, 27)

SMOTE 모델을 만들어서 적용합니다.

# Load the SMOTE library 
smote = SMOTE(sampling_strategy={0: 1000, 1: 1000})
# df_train_numerical = df_train.drop(['Id', 'EJ', 'Class'], axis=1)

X_smote, y_smote = smote.fit_resample(X_train, y_train)
print("length of original data is ",len(df_train_2))
print("Proportion of True data in original data is {:.2%}".format(len(y_train[y_train==1])/len(y_train)))
print("Proportion of False data in original data is {:.2%}".format(len(y_train[y_train==0])/len(y_train)))
print("length of oversampled data is ",len(X_smote))
print("Proportion of True data in oversampled data is {:.2%}".format(len(y_smote[y_smote ==1])/len(y_smote)))
print("Proportion of False data in oversampled data is {:.2%}".format(len(y_smote[y_smote ==0])/len(y_smote)))

length of original data is  617
Proportion of True data in original data is 15.78%
Proportion of False data in original data is 84.22%
length of oversampled data is  2000
Proportion of True data in oversampled data is 50.00%
Proportion of False data in oversampled data is 50.00%

OverSampling을 통해 두 클래스의 비율이 50%로 맞춰 데이터의 불균형을 해결함을 알 수 있습니다.

Why use the SMOTE?

SMOTE를 사용해야 하는 이유는 불균형 데이터의 균형을 맞춰야하는 이유와 동일합니다. 보통 불균형 데이터를 학습하게 되면 모델은 Accuracy가 높게 나오더라도 F1-Score가 처참하게 낮게 나오는 경우가 많습니다. SMOTE와 같은 OverSampling기법을 사용해 데이터의 불균형을 맞추면 Accuracy가 더 올라갈 뿐만 아니라 F1-Score의 상승이 눈에 띄게 보입니다.