[머신러닝] Titanic 데이터를 사용한 분류: Decision Tree, Binary Encoding, Entropy
• https://www.kaggle.com/competitions/titanic
train = pd.read_csv("data/Titanic/train.csv", index_col="PassengerId")
print(train.shape) #(891, 11)
test = pd.read_csv("data/Titanic/test.csv", index_col="PassengerId")
print(test.shape) #(418, 10)
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Survived 891 non-null int64
1 Pclass 891 non-null int64
2 Name 891 non-null object
3 Sex 891 non-null object
4 Age 714 non-null float64
5 SibSp 891 non-null int64
6 Parch 891 non-null int64
7 Ticket 891 non-null object
8 Fare 891 non-null float64
9 Cabin 204 non-null object
10 Embarked 889 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB
# 결측치
fig, axes = plt.subplot(nrows=2, ncols=2, figsize=(12,8))
sns.heatmap(train.isnull(), cmap="gray", ax = axes[0])
sns.heatmap(test.isnull(), cmap="gray", ax = axes[1])
sns.barplot(data = train.isnull(), ax = axes[1, 0], errorbar=None)
sns.barplot(data = test.isnull(), ax = axes[1, 1], errorbar=None)

label 값의 빈도수를 성별에 따라 비교해봤을 때 차이가 있음이 보인다. 따라서 성별에 따라 인코딩을 한다.
어떻게 보면 label encoding 혹은 ordinal encoding이라 볼 수 있다( True/False는 0,1로 표시된다)
train["Survived"].value_counts()
>>> 0 549
1 342
sns.countplot(data = train, x="Survived", hue="Sex")

# feature engineering
# binary encoding
train["Gender"] = train["Sex"] == "female"
test["Gender"] = test["Sex"] == "female"
# 머신러닝 모델은 boolean값은 수치형 데이터로 인식함
display(train["Gender"].head(2))
display(test["Gender"].head(2))
PassengerId
1 False
2 True
Name: Gender, dtype: bool
PassengerId
892 False
893 True
Name: Gender, dtype: bool
# 정답이자 예측해야 할 값
label_name = "Survived"
# 학습, 예측에 사용할 컬럼
feature_names = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Gender']
# 결측치는 0으로 대체한다
X_train = train[feature_names].fillna(0)
X_test = test[feature_names].fillna(0)
y_train = train[label_name]
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(criterion = "entropy", random_state=42)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)




from sklearn.tree import plot_tree
plt.figure(figsize=(12,6))
plot_tree(model,
max_depth=4,
fontsize=12,
filled=True,
feature_names = feature_names)
plt.show()

# 루트노드 엔트로피 구하기
-((549/891)*np.log2(549/891) + (342/891)*np.log2(342/891))
>>> 0.9607079018756469
# 지니계수 구하기
1 - (549/891) ** 2 - (342/891)**2
>>> 0.4730129578614428
sns.barplot(x = model.feature_importances_, y=model.feature_names_in_)
