[머신러닝] Titanic 데이터를 사용한 분류: Feature Engineering(파생변수 생성, 원핫인코딩,결측치 대체), Cross Validation
1. 가족의 수 == Parch + SibSp + 1(나)
train["FamilySize"] = train["Parch"] + train["SibSp"] + 1
test["FamilySize"] = test["Parch"] + test["SibSp"] + 1
display(train[["Parch", "SibSp", "FamilySize"]].sample(3))
display(test[["Parch", "SibSp", "FamilySize"]].sample(3))
2. 성별
train["Gender"] = train["Sex"] == "female"
test["Gender"] = test["Sex"] == "female"
display(train[["Sex", "Gender"]].sample(5))
display(test[["Sex", "Gender"]].sample(5))
3. 호칭
# train
train["Title"] = train["Name"].split(",",expand="True")[0]
>>> Partner, Mr. Austen -> [Partner, Mr]
train["Title"] = train["Title"].split(",",expand="True")[1]
>>> [Partner, Mr] -> [Mr]
#test
test["Title"] = test["Name"].str.split(".", expand=True)[0].str.strip()
test["Title"] = test["Title"].str.split(",", expand=True)[1].str.strip()
test[["Name","Title"]].sample(10)
train과 test의 호칭종류 및 개수가 다른 경우가 있을 수 있다. 따라서 공통된 부분을 제외하고 나머지 부분(개수가 작은 호칭들)의 경우 etc로 정리해준다.
# 호칭의 개수가 2개 이상인 값들만 들고오기
title_count = title["Name"].value_counts()
tc_under = title_count[title_count > 2].index
# train
train["TitleEtc"] = train["Title"]
train.loc(~train["Title"].isin(tc_under), "TitleEtc"] = "etc"
#test
test["TitleEtc"] = test["Title"]
test.loc[~test["Title"].isin(tc_under),"TitleEtc"] = "etc"
#확인
set(test["TitleEtc"].unique()) - set(train["TitleEtc"].unique())
💡 Pandas의 get_dummies를 사용한 One-Hot-Encoding
pd.get_dummies(train[["Fare", "Age", "Embarked", "Cabin_initial"]])
1. 결측치를 문자열로 대체한 파생변수
# train
train["Cabin_initial"] = train["Cabin"]
train["Cabin_initial"] = train["Cabin_initial"].fillna("N")
train["Cabin_initial"] = train["Cabin_initial"].str[0]
print(train["Cabin_initial"].unique())
train["Cabin_initial"].sample(5)
#test
test["Cabin_initial"] = test["Cabin"]
test["Cabin_initial"] = test["Cabin_initial"].fillna("N")
test["Cabin_initial"] = test["Cabin_initial"] .str[0]
print(test["Cabin_initial"].unique())
test["Cabin_initial"].sample(5)
['N' 'C' 'E' 'G' 'D' 'A' 'B' 'F' 'T']
Out[22]:
PassengerId 661 N 280 N 869 N 442 N 542 N Name: Cabin_initial, dtype: object
['N' 'B' 'E' 'A' 'C' 'D' 'F' 'G']
Out[23]:
PassengerId 1087 N 1120 N 1074 D 1117 N 1163 N Name: Cabin_initial, dtype: object
train에만 “T”값이 들어있는데, 이를 추출해보면 값이 하나임을 알 수 있다. 따라서 각 객실별 운임요금의 평균을 추출해 “T”와 가까운 객실로 대체해주는 방법을 쓰기로 한다.
train["Cabin_initial"] = train["Cabin_initial"].relace("T", "A")
#확인
set(train["Cabin_initial"].unique()) - set(test["Cabin_initial"].unique())
>>> set()
2. 성별에 따른 나이(Age) 결측치 채워 넣기
age_f = train.loc[train["Sex"] == "female", "Age"].mean()
age_m = train.loc[train["Sex"] == "male", "Age"].mean()
train["Age_fill"] = train["Age"]
train.loc[(train["Age_fill"].isnull())&(train["Sex"] == "female"), "Age_fill"] = age_f
train.loc[(train["Age_fill"].isnull())&(train["Sex"] == "male"), "Age_fill"] = age_m
label_name = "Survived"
feature_names = ['Pclass', 'Fare', 'Embarked',
'FamilySize', 'Gender',
'TitleEtc', 'Cabin_initial',
'Age_fill']
X_train = pd.get_dummies(train[feature_names])
X_test = pd.get_dummies(test[feature_names])
y_train = train[label_name]
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier(random_state=42,
max_depth=5,max_features=0.9)
cross valiidate
#cross valiidate
from sklearn.model_selection import cross_validate
y_validate = cross_validate(model, X_train, y_train, cv = 5, n_jobs=-1)
pd.DataFrame(y_validate) #실행, 점수 계산 시간, 점수가 도출
cross_cal_score
from sklearn.model_selection import cross_val_score
y_valid_score = cross_val_score(model, X_train, y_train, cv = 5, n_jobs=-1)
y_valid_score # 각 조각에 대한 점수
>>> array([0.8603352 , 0.79775281, 0.82022472, 0.79213483, 0.85393258])
cross_val_predict
from sklearn.model_selection import cross_val_predict
y_valid_predict = cross_val_predict(model, X_train, y_train, cv = 5, n_jobs=-1)
y_valid_predict[:5] #예측 결과 값이 나와서 직접 계산해 볼 수 있다
>>> array([0, 1, 1, 1, 0], dtype=int64)
model.fit(X_train, y_train)
y_predict = model.predict(X_test)
sns.barplot(x = model.feature_importances_, y = model.feature_names_in_)
[KMOOC-실습으로배우는머신러닝] 1. Introduction to Machine Learning (0) | 2022.11.21 |
---|---|
[머신러닝] Titanic 데이터를 사용한 분류: Feature Engineering(결측치 보간법) + 하이퍼파리미터튜닝(RandomizedSearchCV) (0) | 2022.11.17 |
[머신러닝] Titanic 데이터를 사용한 분류: Decision Tree, Binary Encoding, Entropy (0) | 2022.11.17 |
[머신러닝] pima 데이터를 사용한 분류: Random Forest (0) | 2022.11.17 |
[머신러닝] pima 데이터를 사용한 분류: Decision Tree (0) | 2022.11.17 |