학습 알고리즘 구현하기

728x90

신경망의 학습 알고리즘은 신경망이 스스로 학습해서 매개변수를 업데이트를 하는 중요한 알고리즘입니다. 이번 시간에는 신경망의 학습 알고리즘 순서와 파이썬 코드로 구현해 보겠습니다.

학습 알고리즘

신경망의 학습 알고리즘 순서는 다음과 같습니다.

학습데이터와 평가데이터 분리
네트워크 구성
미니배치
기울기 산출
매개변수 갱신
3~5 반복

1. 학습 데이터와 평가데이터 분리

학습 알고리즘의 시작은 학습데이터와 평가 데이터 분리입니다. 데이터를 분리하는 이유는 신경망의 객관적인 평가를 하기 위해 데이터를 분리합니다. 저는 학습데이터는 '모의고사'라고 부르고, 평가 데이터는 '수능'이라고 부릅니다.

수능을 보기 전, 모의고사로 자신의 학습상태를 평가합니다. 모의고사에 따라 학습방법을 바꿔서 공부를 합니다. 마지막으로 수능으로 최종적으로 평가해서 대학을 진학합니다. 신경망도 마찬가지입니다. 학습 데이터로 신경망의 학습상태를 평가하고 평가에 따라 신경망을 업데이트합니다. 학습을 완료하면 최종적으로 평가 데이터로 신경망을 평가합니다.

학습데이터와 평가 데이터를 분리할 때 절대로 학습할 때 평가 데이터가 들어가면 안 됩니다.!! 들어가는 순간 커닝해서 정답을 맞히는 꼴이 됩니다.

코드는 다음과 같습니다.(밑바닥부터 시작하는 딥러닝에서 코드를 제공했습니다.)

import sys, os
sys.path.append(os.pardir)

from common.functions import *
from common.gradient import numerical_gradient

import numpy as np
from dataset.mnist import load_mnist
from tqdm import tqdm

(x_train, t_train), (x_test, t_test) = load_mnist(normalize = True, one_hot_label = True)

2. 네트워크 구성

그다음 신경망의 네트워크를 구성해야 합니다. 네트워크는 보통 class로 구현합니다. 코드는 다음과 같습니다.

class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std = 0.01) -> None:
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size,output_size)
        self.params['b2'] = np.zeros(output_size)


    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1,b2 = self.params['b1'], self.params['b2']

        a1 = x @ W1 + b1
        z1 = sigmoid(a1)

        a2 = z1 @ W2 + b2
        y = softmax(a2)

        return y
    
    def loss(self, x, t):
        y = self.predict(x)
        return cross_entropy_error(y,t)
    

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis = 1)
        t = np.argmax(t, axis = 1)

        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
    
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x,t)

        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W,self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W,self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W,self.params['b2'])

        return grads
        
  network = TwoLayerNet(input_size = 784, hidden_size = 50, output_size = 10)

3. 미니배치

미니배치는 학습 데이터 중 일부를 가져오는 것을 '미니배치'라고 합니다. 일반적으로 딥러닝의 학습데이터는 괴장히 큽니다. 학습할 때 데이터를 메모리에 저장하고 학습을 진행합니다. 하지만, 데이터가 너무 크면 메모리가 부족해서 학습을 진행할 수 없습니다. 이를 해결하기 위해 미니배치를 사용사용합니다.

    # 미니배치 획득
    train_size = x_train.shape[0]
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

4. 기울기 산출

이전 시간에 손실함수를 알아봤습니다. 손실함수를 사용하는 이유는 정답과 예측의 오차를 줄이기 위해 사용한다고 했습니다.

신경망은 손실함수의 기울기를 산출해서 오차가 줄어드는 방향으로 매개변수를 업데이트합니다. 기울기를 구할 때 편미분을 이용하는데 코드는 다음과 같습니다.

def numerical_gradient(f, x):
    h = 1e-4  # 0.0001
    grad = np.zeros_like(x)

    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        idx = it.multi_index
        tmp_val = x[idx]
        x[idx] = float(tmp_val) + h
        fxh1 = f(x)  # f(x+h)

        x[idx] = tmp_val - h
        fxh2 = f(x)  # f(x-h)
        grad[idx] = (fxh1 - fxh2) / (2*h)

        x[idx] = tmp_val  # 값 복원
        it.iternext()

    return grad

이 코드를 이용한 것이 다음과 같습니다.

    # 기울기 계산
    grad = network.numerical_gradient(x_batch, t_batch) # 오래 걸린다.

5. 매개변수 갱신

기울기를 구했으면, 매개변수를 업데이트를 해야 합니다. 이때 사용하는 방법이 '경사하강법'입니다. '경사하강법'에 대한 자세한 설명은 다음에 설명하겠습니다. 간단하게 설명하면, 기울기의 반대 방향으로 일정비율로 이동해서 손실함수를 최소화한다라고 이해하면 됩니다.

이를 코드로 구현하면 다음과 같습니다.

    # 매개변수 갱신
    for key in ('W1','b1','W2','b2'):
        network.params[key] -= learning_rate * grad[key]

learning_rate가 위에서 말한 일정비율을 그리고 -부호가 기울기의 반대방향을 의미합니다.

6. 마무리

이후 미니배치, 기울기 산출, 매개변수 갱신을 반복하면 학습알고리즘의 완성입니다.

전체코드는 다음과 같습니다.

import sys, os
sys.path.append(os.pardir)

from common.functions import *
from common.gradient import numerical_gradient

from dataset.mnist import load_mnist
from tqdm import tqdm

import numpy as np


class TwoLayerNet:
    def __init__(self, input_size, hidden_size, output_size, weight_init_std = 0.01) -> None:
        self.params = {}
        self.params['W1'] = weight_init_std * np.random.randn(input_size, hidden_size)
        self.params['b1'] = np.zeros(hidden_size)
        self.params['W2'] = weight_init_std * np.random.randn(hidden_size,output_size)
        self.params['b2'] = np.zeros(output_size)


    def predict(self, x):
        W1, W2 = self.params['W1'], self.params['W2']
        b1,b2 = self.params['b1'], self.params['b2']

        a1 = x @ W1 + b1
        z1 = sigmoid(a1)

        a2 = z1 @ W2 + b2
        y = softmax(a2)

        return y
    
    def loss(self, x, t):
        y = self.predict(x)
        return cross_entropy_error(y,t)
    

    def accuracy(self, x, t):
        y = self.predict(x)
        y = np.argmax(y, axis = 1)
        t = np.argmax(t, axis = 1)

        accuracy = np.sum(y == t) / float(x.shape[0])
        return accuracy
    
    def numerical_gradient(self, x, t):
        loss_W = lambda W: self.loss(x,t)

        grads = {}
        grads['W1'] = numerical_gradient(loss_W, self.params['W1'])
        grads['b1'] = numerical_gradient(loss_W,self.params['b1'])
        grads['W2'] = numerical_gradient(loss_W,self.params['W2'])
        grads['b2'] = numerical_gradient(loss_W,self.params['b2'])

        return grads
   
   
(x_train, t_train), (x_test, t_test) = load_mnist(normalize = True, one_hot_label = True)

train_loss_list = []
train_acc_list = []
test_acc_list = []

iters_num = 10000
train_size = x_train.shape[0]
batch_size = 100
learning_rate = 0.1


iter_per_epoch = max(train_size / batch_size, 1)


network = TwoLayerNet(input_size = 784, hidden_size = 50, output_size = 10)

for i in tqdm(range(iters_num)):
    # 미니배치 획득
    batch_mask = np.random.choice(train_size, batch_size)
    x_batch = x_train[batch_mask]
    t_batch = t_train[batch_mask]

    # 기울기 계산
    grad = network.numerical_gradient(x_batch, t_batch) # 오래 걸린다.

    # 매개변수 갱신
    for key in ('W1','b1','W2','b2'):
        network.params[key] -= learning_rate * grad[key]

    # loss 기록
    loss = network.loss(x_batch, t_batch)
    train_loss_list.append(loss)

    if i % iter_per_epoch == 0:
        train_acc = network.accuracy(x_train,t_train)
        test_acc = network.accuracy(x_test,t_test)
        train_acc_list.append(train_acc)
        test_acc_list.append(test_acc)
        print(train_acc, test_acc)
    
    if i % 100 ==0:
        print(loss)

이 코드를 돌려보면 한 번 학습하는데 굉장히 오래 걸리는 것을 알 수 있습니다. 이는 기울기를 계산할 때 오래 걸리기 때문입니다. 이 문제를 해결하기 위해서 '오차역전파법'을 이용해야 합니다. 다음 포스팅에서 '오차역전파법'을 알아보도록 하겠습니다.

728x90

'딥러닝 머신러닝 > 밑바닥 딥러닝' 카테고리의 다른 글

손실함수(Loss Function) (0)	2023.09.08
신경망 구현하기 (0)	2023.09.04
퍼셉트론 구현하기 (0)	2023.08.31
퍼셉트론이란 (0)	2023.08.30

Go!Go! Make

학습 알고리즘 구현하기