[머신러닝] Bike Sharing Demand 데이터를 사용한 회귀: Feature Engineering(Binning, 로그변환) + 회귀모델평가지표(MAE, MSE, RMSE, RMSLE) + 하이퍼파라미터튜닝(RandomizedSearchCV)

멋사 AISCOOL 7기 Python

by dundunee 2022. 11. 17. 18:38

🚲DATA: Bike Sharing Demand

https://www.kaggle.com/competitions/bike-sharing-demand

Bike Sharing Demand | Kaggle

www.kaggle.com

datetime: hourly date + timestamp
season: 1 = spring, 2 = summer, 3 = fall, 4 = winter
- 즉 순서가 있는 값으로 ordinary encoding이 되어 있다.
holiday: whether the day is considered a holiday
workingday: whether the day is neither a weekend nor holiday
weather: 1: Clear, Few clouds, Partly cloudy, Partly cloudy, 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist, 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds, 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- 1이면 맑은날, 2는 흐린날, 3은 눈, 비 오는 날, 4는 폭우 폭설 우박 내리는 날
temp: temperature in Celsius, 우리가 사용하는 기온
atemp: "feels like" temperature in Celsius, 체감 온도
humidity: relative humidity
windspeed: wind speed
casual: number of non-registered user rentals initiated
registered: number of registered user rentals initiated
count: number of total rentals

데이터셋 로드

train = pd.read_csv('data/bike/train.csv')
print(train.shape)
display(train.head(3)) #(10886, 12)

test = pd.read_csv('data/bike/test.csv')
print(test.shape)
display(test.head(3)) #(6493, 9)

# 예측해야 할 값
set(train.columns) - set(test.columns) # 우리가 예측해야 할 값은 count이다

{'casual', 'count', 'registered'}

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 12 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   datetime    10886 non-null  object
 1   season      10886 non-null  int64
 2   holiday     10886 non-null  int64
 3   workingday  10886 non-null  int64
 4   weather     10886 non-null  int64
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64
 8   windspeed   10886 non-null  float64
 9   casual      10886 non-null  int64
 10  registered  10886 non-null  int64
 11  count       10886 non-null  int64
dtypes: float64(3), int64(8), object(1)
memory usage: 1020.7+ KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6493 entries, 0 to 6492
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype
---  ------      --------------  -----
 0   datetime    6493 non-null   object
 1   season      6493 non-null   int64
 2   holiday     6493 non-null   int64
 3   workingday  6493 non-null   int64
 4   weather     6493 non-null   int64
 5   temp        6493 non-null   float64
 6   atemp       6493 non-null   float64
 7   humidity    6493 non-null   int64
 8   windspeed   6493 non-null   float64
dtypes: float64(3), int64(5), object(1)
memory usage: 456.7+ KB

🛠️ 전처리

datetime 컬럼을 연,월,일,시,분,초로 만들어보자

# datetime의 데이터타입이 object이므로 datetime으로 타입을 바꿔주자
train["datetime"] = pd.to_datetime(train["datetime"])
# 파생변수 만들기
train["year"] = train["datetime"].dt.year
train["month"] = train["datetime"].dt.month
train["day"] = train["datetime"].dt.day
train["hour"] = train["datetime"].dt.hour
train["minute"] = train["datetime"].dt.minute
train["second"] = train["datetime"].dt.second

#test도 마찬가지로 바꿔준다.
test["datetime"] = pd.to_datetime(test["datetime"])
test["year"] = test["datetime"].dt.year
test["month"] = test["datetime"].dt.month
test["day"] = test["datetime"].dt.day
test["hour"] = test["datetime"].dt.hour
test["minute"] = test["datetime"].dt.minute
test["second"] = test["datetime"].dt.second

📊수치변수의 Histogram

수치변수의 히스토그램을 그려보는 이유❓

수치형 변수의 분포를 본다
- 모델을 학습하려면 정규분포일 때가 더 유리하다. 그래서 왜도와 첨도를 보고 로그변환을 할것인지를 고민해봐야한다.
수치형 변수 중 범주형 변수로 보일 만한 것이 있다.
- 막대가 연속적으로 그려지지 않는다면 범주형 변수로 변환할 수 있는 가능성을 고려해봐야 한다
- 범주형 변수도 마찬가지로 value_counts()나 nunique()를 사용해서 연속형 변수로 봐야하는지도 고려해봐야한다

day컬럼을 볼 때 train은 1 ~ 19, test는 20~이후의 값이 있음. 따라서 day가 train과 test를 나누는 기준이 된다고 볼 수 있음. 단 1~19일까지 학습한 것을 바탕으로 20일 이후의 데이터를 예측한다면 도움이 되지 않을 수 있다. 따라서 day는 피처에서 제거하는 것이 나을 수 있다
month도 test보다 traind의 값이 2배이고, 날씨가 추운 겨울에는 대여량이 적고 5~10월에는 대여량이 많은 것을 알 수 있다. 하지만 2011년과 2012년을 보면 2배까지 차이가 나는 달이 있기도 한다. 따라서 month를 예측에 넣어주는게 좋아하보기는 하지만 year가 2배정도 차이가 나기 때문에 모델이 예측하는데 도움이 되지 않을 수 있다.
season은 봄여름가을이라고 표시하긴 했지만 실제로 월을 보면 분기가 들어있다.
- train.groupby("season")["count"].describe / train.groupby("season")["count"].unique

✅EDA + RandomForest + Cross Validation + 모델성능평가지표

1️⃣ Scatterplot을 사용한 EDA

👉 이걸 왜하느냐❓ 수치형 변수의 경우 종속변수와 혹은 관련이 있다고 보이는 scatterplot을 그려보면서 이상치를 확인해볼 수 있다.

단일 변수로 이상치를 확인해볼 수 있는 방법은 박스플롯이 있다.

1.1 EDA: windspeed

현실세계에서 자전거 대여와 풍속이 관계가 있을까?

히스토그램을 그려봤을 때 풍속이 0인 값들이 많았다.

train[train["windspeed"] == 0].shape → (1313, 18)

sns.scatterplot(data = train, x="windspeed", y="count")

풍속이 높아지면 대여개수가 줄어든다
값이 중간중간 비어져있는것으로 보아 특정 값 혹은 구간으로 기록될 가능성이 있다.

1.2 EDA: humidity

현실세계에서 풍속과 대여수가 관련이 있을까?

sns.scatterplot(data = train, x="humidity", y="count")

습도와 대여량은 상관이 없어 보인다.

1.3 temp-atemp

sns.scatterplot(data = train, x="temp", y="atemp")

대체적으로 강한 양의 상관을 갖는 것으로 보인다.
다만 이상치에 대한 확인결과 기록에 대한 오류로 보이며 이에 대한 대체값을 위해서는 이전값과 다음값의 평균을 내서 넣어볼 수도 있을 것이다. 시계열 데이터 이기 때문이다.

1.4 weather

train[train["weather"] == 4]

값이 1개밖에 없으므로 weather별 barplot을 그렸을 때 값이 현실세계와는 다르게 나타나게 된다.

💡 데이터만 보지말고 현실세도 연관시켜서 생각해보자! 결국 현실세계 문제를 해결하는 것이다.

2️⃣모델링: Random Forest

2.1 Feature Engineering: 수치형변수 범주화

year와 month는 모두 수치형변수이지만 year는 2011, 2012 데이터가 전부이기때문이다.

# datetime 컬럼에서 년-월만 가져오기
train["year_date"] = train["datetime"].astype(str).str[:7]
test["year_date"] = test["datetime"].astype(str).str[:7]

# 인코딩
train["year_month_code"] = train["datetime"].astype("category").cat.codes
test["year_month_code"] = test["datetime"].astype("category").cat.codes

📍.cat.codes

판다스에는 정수기반의 범주형 데이터를 표현(인코딩)할 수 있는 categorical(”category”)형이라고 하는 데이터형이 존재한다.
이 categorical 객체는 categories와 codes의 속성을 가진다.
- codes는 숫자로 바꿔준다.
cat 범주형데이터를 담고 있는 Series는 특화된 문자열 메서드인 Series.str과 유사한 특수 메서드인 cat 속성이 있다. 이를 통해 categories와 codes나 categorical 메서드 등에 쉽게 접근할 수 있다.
그래서 Series.cat.codes 하게 되면 ‘2011-01’ → 0 이런식으로 숫자로 바꿔주게 된다.
- Series.cat.categories 도 마찬가지다.

2.2 데이터셋 만들기

# label 및 feature_names 
label_name = "count"
feature_names = train.columns.tolist()
feature_names.remove(label_name)
feature_names.remove("datetime")
feature_names.remove("casual")
feature_names.remove("registered")
# feature_names.remove("year")
feature_names.remove("month")
feature_names.remove("day")
feature_names.remove("minute")
feature_names.remove("second")
feature_names.remove("year-month")
feature_names

# 학습 및 예측 데이터셋
X_train = train[feature_names]
X_test = test[feature_names]
y_train = train[label_name]

2.3 알고리즘 불러오기

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42, n_jobs=-1)
model

RandomForestRegressor(n_jobs=-1, random_state=42)

3️⃣교차검증: CROSS VALIDATION

train data를 k개로 나눠서 학습과 검증을 하는 과정이며, test data를 학습시켜 예측하기 전 best estimator와 best score를 알 수 있다.

이는 모의고사를 풀어서 답을 구하는 과정과 유사하다.

from sklearn.model_selection import cross_val_predict

y_valid_predict = cross_val_predict(model, X_train, y_train, 
																	cv = 5, n_jobs=-1, random_state=42)

4️⃣모델 성능 평가

4.1 MAE(Mean Absolute Error)

모델의 예측값과 실제 값 차이의 절대값 평균
절대값을 취하기 때문에 가장 직관적이다.
평균 절대 오차, 오차 절댓값의 평균, 값이 작을수록 좋음

# 판다스로 구하기
mae = abs(y_train - y_val_predict).mean()

# 사이킷런
from sklearn.metrics import mean_absolure_error
mean_absolure_error(y_train, y_valid_predict)

48.45686583248034

4.2 MSE(Mean Squared Error)

모델의 예측값과 실제값 차이의 제곱합
제곱을 하기 때문에 특이치에 민감하다
기댓값으로부터 얼마나 떨어져있는지 가늠함
값이 작을수록 좋음

# 판다스로 구하기
mse = ((y_train - y_val_predict) ** 2).mean()
mse = np.square(y_train, y_predict).mean()

# 사이킷런
from sklearn.metrics import mean_squared_error
mean_absolure_error(y_train, y_valid_predict)

5255.510393476759

4.3 RMSE(Root Mean Squared Error)

오차 제곱 평균에 루트를 씌운 것
지표를 실제값과 유사한 단위로 다시 변환한 것이기 때문에 mse보다 해석이 싶다.
mae보다 특이치에 민감하다 = robust하다
표준편차와 유사한 공식이다

# 판다스
rmse = (((y_train = y_val_predict) ** 2).mean()) ** 0.5
rmse = np.sqrt(((y_train = y_val_predict) ** 2).mean())
rmse = np.sqrt(np.square(y_train, y_predict).mean())
rmse = mse ** 0.5
rmse = np.sqrt(mse)

72.4948990859133

4.4 RMSLE(Root Mean Squared Logarithmic Error)

왜 1을 더한 후에 로그를 취할까?
- x 가 1보다 작으면 음수가 나오기 때문에 1을 더해서 1이하의 값이 나오지 않게 하기위해
- 의도치 않은 큰 온차가 나올 수 있기 때문에 가장 작은 값이 될 수 있는 0에 1을 더해서 마이너스값이 나오지 않게 한다.
- train 의 최솟값은 1이지만 test 의 예측값이 0이 나올 수도 있기 때문입니다.
RMSLE는 RMSE와 거의 비슷하지만 오차를 구하기 전에 예측값과 실제 값에 로그를 취해주는 것만 다르다.
RMSE에 비교해서는 로그를 취하기 때문에 작은 값에 더 패널티가 들어가게 된다.
- Absolute Error 절대값의 차이로 보면 1) 2억 차이 2) 10억 차이
- Squared Error 제곱의 차이로 보면 1) 4억차이 2) 100억차이
- Squared Error 에 root 를 취하면 absolute error 하고 비슷해진다. 비율 오류로 봤을 때 1)은 2배 잘못 예측, 2)10% 잘못 예측한것으로 볼 수 있다.
- 자전거 대여수는 대부분 작은 값에 몰려있다. 그래서 log를 취하고 계산하게 되면 오차가 큰 값보다 작은값에 더 패널티가 들어가게 됩니다.
- 패널티가 들어간다는 말은 오차가 작을 수록 가중피를 준다는 말이다.

# 판다스
rmsle = np.sqrt(np.square(np.log(1+y_train) - np.log(1+y_val_predict)).mean())
rmsle = np.sqrt(np.square(np.log1p(y_train) - np.log1p(y_val_predict)).mean())

# 사이킷런
from sklearn.metrics import mean_squared_log_error
mean_squared_log_error(y_train, y_valid_pred) ** 0.5

0.5103498006704638

5️⃣학습과 예측

# 학습
model.fit(X_train, y_train)

# 시각화
sns.barplot(x=model.feature_importances_, y=model.feature_names_in_)

# 예측
y_predict = model.predict(X_test)

✅Feature Engineering(로그변환) + RandomForest + 하이퍼파라미터튜닝

1️⃣Feature Engineering: Long Transform

이 대회의 평가 지표가 RMSLE이다. 따라서 종속변수 혹은 예측값에 로그를 먼저 취해 로그값으로 예측하고 예측값을 복원하기 위해서이다.

로그변환은 주로 수치형변수의 분포가 정규분포에 가깝지 않을 경우 이를 보정하기 위해 사용해주는 Feature engineering기법 중 하나이다.

fig, axes = plt.subplots(nrows=1, ncols=3, figsize=(12,3))
sns.kdeplot(train["count"], ax = axes[0])
sns.kdeplot(train["count_log1p"], ax = axes[1])
sns.kdeplot(train["count_expm1"], ax = axes[2])

train[["count", "count_log1p", "count_expm1"]].describe()

2️⃣ 모델링

# 독립 및 종속변수에 사용할 미쳐 지정
label_name = "count_log1p"
feature_names = ['holiday', 'workingday', 'weather', 'temp',
       'atemp', 'humidity', 'windspeed', 'year', 'hour','dayofweek']
feature_names

# 학습 및 예측 데이터셋 나누기
X_train = train[feature_names]
X_test = test[feature_names]
y_train = train[label_name]

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(random_state=42, n_jobs=-1)
model

3️⃣하이퍼파라미터튜닝: RandomizedSearchCV

# 파라미터 값을 랜덤으로 지정해준다.
from sklearn.model_selection import RandomizedSearchCV

params_distribution = {'max_depth' : np.random.randint(3, 20, 10),
                       'max_features': np.random.uniform(0.7, 1, 10)}

reg = RandomizedSearchCV(model,
                         params_distribution = params_distribution,
												 scoring = 'neg_root_mean_squared_error',
                         n_iter = 10,
                         cv = 5,
                         n_jobs=-1,
                         verbose=2,
                         random_state = 42)
reg.fit(X_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits

reg.best_estimator_
>>> RandomForestRegressor(max_depth=23, max_features=0.8636521987264846, n_jobs=-1, random_state=42)

rmsle = abs(reg.best_score_)
rmsle
>>> 0.48870366876532

pd.DataFrame(reg.cv_results_).sort_values("rank_test_score")

4️⃣학습과 예측

# 모델불러오기 - 학습 및 예측
best_model = reg.best_estimator_
y_predict = best_model.fit(X_train, y_train).predict(X_test)

'멋사 AISCOOL 7기 Python' 카테고리의 다른 글

[KMOOC-실습으로배우는머신러닝]3. Classification (0)	2022.11.21
[머신러닝] Houde Price 데이터를 사용한 데이터 전처리 + EDA+ Feature Engineering + 회귀모델 (0)	2022.11.17
[머신러닝] House Price데이처를 사용한 데이터 탐색 + 데이터 전처리+ Feature Engineering (0)	2022.11.17

"Growth" Data를 쌓아가는 사람

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문