랜덤포레스트(Random Forest) Start

BioinformaticsAndMe

랜덤포레스트(Random Forest)

: Random Forest는 오버피팅을 방지하기 위해, 최적의 기준 변수를 랜덤 선택하는 breiman(2001)이 제안한 머신러닝 기법

: Random Forest는 여러 개의 Decision tree(의사결정나무)를 만들고, 숲을 이룬다는 의미에서 Forest라 불림

*Random이란 의미는 숲에 심는 의사결정나무에 쓰이는 특성들을 랜덤하게 선택하기 때문

랜덤포레스트 장점(Advantage)

ㄱ) Classification(분류) 및 Regression(회귀) 문제에 모두 사용 가능

ㄴ) Missing value(결측치)를 다루기 쉬움

ㄷ) 대용량 데이터 처리에 효과적

ㄹ) 모델의 노이즈를 심화시키는 Overfitting(오버피팅) 문제를 회피하여, 모델 정확도를 향상시킴

ㅁ) Classification 모델에서 상대적으로 중요한 변수를 선정 및 Ranking 가능

랜덤포레스트 과정

1) Training set에서 표본 크기가 n인 bootstrap sampling 수행

*bootstrap sampling - Original sample 집단에서 더 작지만 무수히 많은 집단으로 랜덤하게 뽑는 방법

2) Bootstrap sample에 대해 Random Forest Tree 모형 제작

a) 전체 변수 중에서 m개 변수를 랜덤하게 선택

b) 최적의 classifier 선정

c) classifier에서 따라 두 개의 daughter node 생성

#아래 그림은 랜덤하게 선택된 유전자를 변수로 두고 재발환자를 가르는 랜덤포레스트 모델링 과정

#OOB(Out-Of-Bag): 학습된 랜덤포레스트 모델이 우수한지의 성능 지표

3) Tree들의 앙상블 학습 결과 출력

*앙상블 학습(ensemble learning): 큰 데이터를 수많은 작은 set으로 나눠 학습시킨 후, 각 학습 모델을 연결하여 성능 좋은 머신러닝 구축

랜덤포레스트 예제

: Example - R and Data Mining by Zhao

# R randomForest package 설치 및 로딩
install.package('randomForest')
library(randomForest)

# iris 데이터 사용 data(iris) str(iris)

'data.frame':	150 obs. of  5 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
 $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

# iris 데이터를 7:3 비율로 trainData 및 testData로 샘플링
ind <- sample(2, nrow(iris), replace = TRUE, prob = c(0.7, 0.3))
trainData <- iris[ind == 1, ]
testData <- iris[ind == 2, ]

# randomForest 분석결과 100개의 tree가 생성
# 랜덤하게 두 개 변수를 선택하여 가지의 분류기준을 정함
rf <- randomForest(as.factor(Species) ~ ., data = trainData, ntree = 100, proximity = TRUE, 
                   importance = TRUE)

# Out-of-Bag sample의 오차율과 교차표 확인 가능
print(rf)
Call:
 randomForest(formula = as.factor(Species) ~ ., data = trainData,      ntree = 100, proximity = TRUE, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 100
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.46%
Confusion matrix:
           setosa versicolor virginica class.error
setosa         40          0         0  0.00000000
versicolor      0         35         3  0.07894737
virginica       0          2        32  0.05882353

# trainData의 randomForest model을 이용한 예측값 출력 predict(rf)

1 2 3 4 6 7 8 9 10 11 12 13 15 17 18 19 setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 20 21 22 23 24 25 27 30 31 32 33 34 35 37 38 41 setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa setosa 42 43 44 45 46 47 48 49 51 52 54 55 56 57 59 62 setosa setosa setosa setosa setosa setosa setosa setosa versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor 63 64 65 67 68 69 70 71 73 75 76 77 78 79 80 82 versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor versicolor virginica versicolor versicolor versicolor 83 84 85 87 88 89 91 93 94 95 96 97 98 99 101 102 versicolor virginica versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor virginica virginica 103 104 105 106 107 108 109 110 112 114 115 118 119 125 126 127 virginica virginica virginica virginica versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica 128 129 130 132 133 134 136 138 139 141 143 144 145 146 148 150 virginica virginica virginica virginica virginica versicolor virginica virginica virginica virginica virginica virginica virginica virginica virginica virginica

Levels: setosa versicolor virginicaa

# 실제 trainData값과 예측값을 비교하는 교차표 출력 table(predict(rf), trainData$Species)

             setosa versicolor virginica
  setosa         40          0         0
  versicolor      0         35         2
  virginica       0          3        32

# trainData의 randomForest model을 이용한 testData에 적합한 예측값 출력 irisPred <- predict(rf, newdata = testData) irisPred

         5         14         16         26         28         29         36         39         40         50         53         58         60 
    setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa versicolor versicolor versicolor 
        61         66         72         74         81         86         90         92        100        111        113        116        117 
versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor  virginica  virginica  virginica  virginica 
       120        121        122        123        124        131        135        137        140        142        147        149 
versicolor  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
Levels: setosa versicolor virginica

# 실제 testData값과 예측값을 비교하는 교차표 출력 table(irisPred, testData$Species)

irisPred     setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         12         1
  virginica       0          0        15

# randomForest 분석결과 변수별 중요도 확인 importance(rf)

                setosa versicolor virginica MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length  1.550198   2.904545  5.359329             6.234686         6.863905
Sepal.Width   1.616085   1.750807  1.932415             2.689961         1.582434
Petal.Length  8.829176  11.694845 11.303180            12.970271        30.346106
Petal.Width  10.915855  13.094225 14.000101            15.456179        35.045996

# 변수의 중요도를 dot 형식의 플롯팅 varImpPlot(rf)

#MeanDecreaseAccuracy: 정확도

#MeanDecreaseGini: 노드 불순도 개선

#Reference

1) https://towardsdatascience.com/understanding-random-forest-58381e0602d2

2) https://wikidocs.net/34086

3) http://rstudio-pubs-static.s3.amazonaws.com/4944_b042d59e3b174ec395bf2a20eab939d3.html

4) https://www.edureka.co/blog/random-forest-classifier/

5) https://www.codingame.com/playgrounds/7163/machine-learning-with-java---part-6-random-forest

6) https://www.researchgate.net/figure/An-example-of-bootstrap-sampling-Since-objects-are-subsampled-with-replacement-some_fig2_322179244

7) https://www.biostars.org/p/86981/

8) https://thebook.io/006723/ch10/03/04/03/

랜덤포레스트(Random Forest) End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

Feature selection vs Feature extraction (0)	2019.10.29
K-NN(최근접이웃) 알고리즘 (0)	2019.10.23
[TensorFlow] Logistic Regression (0)	2019.10.09
[TensorFlow] Linear Regression (0)	2019.10.05
[TensorFlow] 기본 연산 (0)	2019.10.04

[TensorFlow] Linear Regression Start

BioinformaticsAndMe

[TensorFlow] Logistic Regression

: TensorFlow 2.0 에서 수행되는 로지스틱 회귀분석 과정

: MNIST 데이터베이스 (Modified National Institute of Standards and Technology database)의 '0~9 손글씨' 데이터 사용

→ 60,000개의 Training 이미지 및 10,000개의 Testing 이미지 포함

→ 각 MINST 이미지 크기는 28x28 픽셀이며, 픽셀 하나는 0~255 사이의 숫자값(density)을 가짐

: 이 예제 과정에서 각 이미지는 1) float32로 변환, 2) 784개 feature(28x28)의 1차원 배열화, 3) [0, 1]로 표준화 수행

# '__future__' : python 2에서 python 3 문법 사용 가능
from __future__ import absolute_import, division, print_function

# 텐서플로우, 넘파이 라이브러리 임포트
import tensorflow as tf
import numpy as np

# MNIST 데이터셋 파라미터 설정
num_classes = 10       # 0에서부터 9까지 숫자 종류
num_features = 784    # 28*28 = 784 

# 학습 파라미터 설정
learning_rate = 0.01
training_steps = 1000
batch_size = 256
display_step = 50

# MNIST 데이터 로딩
from tensorflow.keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()    # 파이썬 튜플 자료형으로 데이터 저장

# 1) float32로 변환
x_train, x_test = np.array(x_train, np.float32), np.array(x_test, np.float32)
# 2) 이미지 포맷을 28*28=784 픽셀의 1차원 배열로 변환
x_train, x_test = x_train.reshape([-1, num_features]), x_test.reshape([-1, num_features])
# 3) 255을 나누어 [0, 1] 값으로 표준화
x_train, x_test = x_train / 255., x_test / 255.
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz
11493376/11490434 [==============================] - 1s 0us/step

# tf.data API를 사용하여 데이터 셔플링 및 배치화
train_data = tf.data.Dataset.from_tensor_slices((x_train, y_train))
train_data = train_data.repeat().shuffle(5000).batch(batch_size).prefetch(1)

# Weight matrix 생성을 위해, 총 784*10개의 weights 필요
W = tf.Variable(tf.ones([num_features, num_classes]), name="weight")
b = tf.Variable(tf.zeros([num_classes]), name="bias")

# 로지스틱 회귀식 (Wx + b) 정의
def logistic_regression(x):
    # softmax 함수 적용
    return tf.nn.softmax(tf.matmul(x, W) + b)

# Cross Entropy 손실 함수 정의
def cross_entropy(y_pred, y_true):
    # one hot vector 인코딩 (텍스트를 유의한 벡터로 변환하는 방법론)
    y_true = tf.one_hot(y_true, depth=num_classes)
    # log(0) error를 피함
    y_pred = tf.clip_by_value(y_pred, 1e-9, 1.)
    # Cross Entropy 계산
    return tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred)))

# 정확도 척도 정의
def accuracy(y_pred, y_true):
    # 예측 클래스는 예측 벡터에서 가장 높은 스코어 인덱스
    correct_prediction = tf.equal(tf.argmax(y_pred, 1), tf.cast(y_true, tf.int64))
    return tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

# Stochastic Gradient Descent (SGD;확률적경사하강법) 알고리즘
optimizer = tf.optimizers.SGD(learning_rate)

# 학습 알고리즘 최적화 과정
def run_optimization(x, y):
    # 텐서플로우는 자동 미분(주어진 입력 변수에 대한 연산의 gradient를 계산하는 것)을 위한 tf.GradientTape 함수 사용
    with tf.GradientTape() as g:
        pred = logistic_regression(x)
        loss = cross_entropy(pred, y)

    # gradients 계산
    gradients = g.gradient(loss, [W, b])
    
    # gradients에 따라 Weight(W)와 bias(b) 업데이트
    optimizer.apply_gradients(zip(gradients, [W, b]))

# 주어진 스텝에 맞춰 training 시작 for step, (batch_x, batch_y) in enumerate(train_data.take(training_steps), 1): # Weight(W)와 bias(b) 업데이트를 위해 사전 정의된 최적화 과정 실행 run_optimization(batch_x, batch_y)

# display_step(50, 100, 150...)에서 적용 중인 파라미터값 출력 if step % display_step == 0: pred = logistic_regression(batch_x) loss = cross_entropy(pred, batch_y) acc = accuracy(pred, batch_y) print("step: %i, loss: %f, accuracy: %f" % (step, loss, acc))

step: 50, loss: 666.254395, accuracy: 0.765625 step: 100, loss: 515.914062, accuracy: 0.828125 step: 150, loss: 614.817017, accuracy: 0.832031 step: 200, loss: 620.690918, accuracy: 0.820312 step: 250, loss: 520.586487, accuracy: 0.867188 step: 300, loss: 580.417847, accuracy: 0.871094 step: 350, loss: 484.761322, accuracy: 0.839844 step: 400, loss: 621.492798, accuracy: 0.820312 step: 450, loss: 791.947632, accuracy: 0.820312 step: 500, loss: 616.336060, accuracy: 0.828125 step: 550, loss: 645.748718, accuracy: 0.851562 step: 600, loss: 732.390259, accuracy: 0.761719 step: 650, loss: 631.023621, accuracy: 0.832031 step: 700, loss: 74.979805, accuracy: 0.914062 step: 750, loss: 48.739697, accuracy: 0.941406 step: 800, loss: 70.699936, accuracy: 0.910156 step: 850, loss: 60.746632, accuracy: 0.937500 step: 900, loss: 57.646683, accuracy: 0.937500 step: 950, loss: 95.919128, accuracy: 0.890625 step: 1000, loss: 72.987892, accuracy: 0.929688

# 테스트 셋을 사용해 훈련된 모델의 정확도 측정
pred = logistic_regression(x_test)
print("Test Accuracy: %f" % accuracy(pred, y_test))
Test Accuracy: 0.913800

# 예측 결과 시각화
import matplotlib.pyplot as plt

# 훈련된 모델에서 5개 이미지 예측
n_images = 5
test_images = x_test[:n_images]
predictions = logistic_regression(test_images)

# 이미지 결과 시각화 및 모델 예측
for i in range(n_images):
    plt.imshow(np.reshape(test_images[i], [28, 28]), cmap='gray')
    plt.show()
    print("Model prediction: %i" % np.argmax(predictions.numpy()[i]))

Model prediction: 7
Model prediction: 2
Model prediction: 1
Model prediction: 0
Model prediction: 4

#참고 - [TensorFlow] 원-핫 인코딩(One-hot encoding)

: 원-핫 인코딩은 표현하고 싶은 단어 인덱스에 1을, 다른 인덱스에는 0을 부여하는 단어의 벡터 표현 방식

(1) 각 단어에 고유한 인덱스를 부여 (정수 인코딩)

(2) 표현하고 싶은 단어의 인덱스 위치에 1을 부여하고, 다른 단어의 인덱스의 위치에 0을 부여

#Reference

1) https://github.com/aymericdamien/TensorFlow-Examples/tree/master/tensorflow_v2

2) http://yann.lecun.com/exdb/mnist/

3) https://ko.wikipedia.org/wiki/MNIST_%EB%8D%B0%EC%9D%B4%ED%84%B0%EB%B2%A0%EC%9D%B4%EC%8A%A4

4) https://mmlind.github.io/Simple_1-Layer_Neural_Network_for_MNIST_Handwriting_Recognition/

5) https://wikidocs.net/22647

[TensorFlow] Logistic Regression End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

K-NN(최근접이웃) 알고리즘 (0)	2019.10.23
랜덤포레스트(Random Forest) (1)	2019.10.17
[TensorFlow] Linear Regression (0)	2019.10.05
[TensorFlow] 기본 연산 (0)	2019.10.04
[TensorFlow] 문자열 출력 (0)	2019.10.03

[TensorFlow] Linear Regression Start

BioinformaticsAndMe

[TensorFlow] Linear Regression

: TensorFlow 2.0 에서 수행되는 선형회귀 low-level approach 과정

# '__future__' : python 2에서 python 3 문법 사용 가능
from __future__ import absolute_import, division, print_function

# 텐서플로우, 넘파이 라이브러리 임포트
import tensorflow as tf
import numpy as np
rng = np.random         # 난수 생성을 위한 변수 준비

# 학습 파라미터 설정

learning_rate = 0.01 # 학습률 training_steps = 1000 # train될 총 스텝 display_step = 50 # 학습과정에서 보여질 스텝

# Training 데이터 생성 (numpy 배열)
X = np.array([3.3,4.4,5.5,6.71,6.93,4.168,9.779,6.182,7.59,2.167,
              7.042,10.791,5.313,7.997,5.654,9.27,3.1])
Y = np.array([1.7,2.76,2.09,3.19,1.694,1.573,3.366,2.596,2.53,1.221,
              2.827,3.465,1.65,2.904,2.42,2.94,1.3])
n_samples = X.shape[0]   # shape 함수로 array 차원 확인
print(n_samples)
17

# 처음 학습에 사용될 Weight와 Bias 값을 랜덤하게 생성
W = tf.Variable(rng.randn(), name="weight")   #  tf.Variable: 텐서플로우 변수 생성
b = tf.Variable(rng.randn(), name="bias")        # rng.randn: 임의의 숫자(난수) 생성

# Linear regression (Wx + b) 수식 정의
def linear_regression(x):
    return W * x + b

# 손실함수인 Mean Square Error (MSE;평균 제곱 오차) 수식 정의
def mean_square(y_pred, y_true):
    return tf.reduce_sum(tf.pow(y_pred-y_true, 2)) / (2 * n_samples)

# Stochastic Gradient Descent (SGD;확률적경사하강법) 알고리즘
optimizer = tf.optimizers.SGD(learning_rate)

# 학습 알고리즘 최적화 과정 정의
def run_optimization():
    
    # 텐서플로우는 자동 미분(주어진 입력 변수에 대한 연산의 gradient를 계산하는 것)을 위한 tf.GradientTape 함수 사용
    with tf.GradientTape() as g:
        pred = linear_regression(X)
        loss = mean_square(pred, Y)

    # gradients 계산
    gradients = g.gradient(loss, [W, b])
    
    # gradients에 따라 Weight(W)와 bias(b) 업데이트
    optimizer.apply_gradients(zip(gradients, [W, b]))

# 주어진 스텝에 맞춰 training 시작

for step in range(1, training_steps + 1):

# Weight(W)와 bias(b) 업데이트를 위해 사전 정의된 최적화 과정 실행 run_optimization()

# display_step(50, 100, 150...)에서 적용 중인 파라미터값 출력 if step % display_step == 0: pred = linear_regression(X) loss = mean_square(pred, Y) print("step: %i, loss: %f, W: %f, b: %f" % (step, loss, W.numpy(), b.numpy()))

step: 50, loss: 0.133085, W: 0.117284, b: 1.751286 step: 100, loss: 0.126663, W: 0.125200, b: 1.695165 step: 150, loss: 0.120975, W: 0.132650, b: 1.642350 step: 200, loss: 0.115937, W: 0.139661, b: 1.592647 step: 250, loss: 0.111476, W: 0.146259, b: 1.545872 step: 300, loss: 0.107524, W: 0.152467, b: 1.501854 step: 350, loss: 0.104025, W: 0.158310, b: 1.460430 step: 400, loss: 0.100926, W: 0.163809, b: 1.421446 step: 450, loss: 0.098182, W: 0.168984, b: 1.384759 step: 500, loss: 0.095751, W: 0.173854, b: 1.350234 step: 550, loss: 0.093598, W: 0.178437, b: 1.317743 step: 600, loss: 0.091692, W: 0.182750, b: 1.287166 step: 650, loss: 0.090003, W: 0.186809, b: 1.258391 step: 700, loss: 0.088508, W: 0.190628, b: 1.231312 step: 750, loss: 0.087184, W: 0.194223, b: 1.205827 step: 800, loss: 0.086011, W: 0.197606, b: 1.181845 step: 850, loss: 0.084972, W: 0.200789, b: 1.159276 step: 900, loss: 0.084052, W: 0.203785, b: 1.138036 step: 950, loss: 0.083237, W: 0.206604, b: 1.118049 step: 1000, loss: 0.082516, W: 0.209258, b: 1.099238

# Linear regression(선형회귀) 결과 시각화
import matplotlib.pyplot as plt   # matplotlib: 차트(chart)나 플롯(plot)으로 데이터를 시각화하는 라이브러리
plt.plot(X, Y, 'ro', label='Original data')
plt.plot(X, np.array(W * X + b), label='Fitted line')
plt.legend()
plt.show()

#참고 - [TensorFlow] 용어

ㄱ) Epoch: 전체 데이터를 한 바퀴 돌며 학습하는 것(1 epoch)

ㄴ) Step: weight와 bias를 1회 업데이트하는 것(1 step)

ㄷ) Batch size: 1회 step에서 사용된 데이터 수

#Reference

1) https://github.com/aymericdamien/TensorFlow-Examples/tree/master/tensorflow_v2

2) https://m.blog.naver.com/PostView.nhn?blogId=wideeyed&logNo=221333529176&proxyReferer=https%3A%2F%2Fwww.google.com%2F

3) https://www.tensorflow.org/tutorials/customization/autodiff?hl=ko

[TensorFlow] Linear Regression End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

랜덤포레스트(Random Forest) (1)	2019.10.17
[TensorFlow] Logistic Regression (0)	2019.10.09
[TensorFlow] 기본 연산 (0)	2019.10.04
[TensorFlow] 문자열 출력 (0)	2019.10.03
경사하강법 종류 (0)	2019.09.26

[TensorFlow] 기본 연산 Start

BioinformaticsAndMe

[TensorFlow] 기본 연산

: TensorFlow 2.0 에서 사용되는 기본 수식 연산의 간단한 예제

# 텐서플로우 라이브러리 임포트
from __future__ import print_function   # python2에서 python3 몇몇 문법을 사용 가능하게 함 (print 함수)
import tensorflow as tf

# tf.constant() 함수로 정수인 상수 텐서 생성
a = tf.constant(2)
b = tf.constant(3)
c = tf.constant(5)

# 다양한 텐서 연산 함수 ( '+,-' 와 같은 수식 기호도 사용 가능)
add = tf.add(a, b)
sub = tf.subtract(a, b)
mul = tf.multiply(a, b)
div = tf.divide(a, b)

# 텐서 값 출력 (텐서를 넘파이 배열로 변환)
print("add =", add.numpy())
print("sub =", sub.numpy())
print("mul =", mul.numpy())
print("div =", div.numpy())
add = 5
sub = -1
mul = 6
div = 0.6666666666666666

# 차원 제거 후 평균(reduce_mean)/합계(reduce_sum)
mean = tf.reduce_mean([a, b, c])
sum = tf.reduce_sum([a, b, c])

# 텐서 값 출력
print("mean =", mean.numpy())
print("sum =", sum.numpy())
mean = 3
sum = 10

# 2차원 행렬(Matrix) 텐서 생성
matrix1 = tf.constant([[1., 2.], [3., 4.]])   # '.'은 해당 숫자를 float32(실수)로 인식되게 함
matrix2 = tf.constant([[5., 6.], [7., 8.]])

# 행렬 곱(Matrix multiplications)
product = tf.matmul(matrix1, matrix2)
product
<tf.Tensor: id=25, shape=(2, 2), dtype=float32, numpy=
array([[19., 22.],
       [43., 50.]], dtype=float32)>

# 텐서를 넘파이 배열로 변환
product.numpy()
array([[19., 22.],
       [43., 50.]], dtype=float32)

#참고 - [TensorFlow] 용어

ㄱ) Scalar(스칼라): 하나의 숫자

ㄴ) Vector(벡터): 숫자(스칼라)의 배열

ㄷ) Matrix(행렬): 2차원의 배열

ㄹ) Tensor(텐서): 다차원의 배열

#Reference

1) https://github.com/aymericdamien/TensorFlow-Examples/tree/master/tensorflow_v2

2) https://medium.com/@manish54.thapliyal/machine-learning-basics-scalars-vectors-matrices-and-tensors-e120ecd0e6f7

[TensorFlow] 기본 연산 End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

[TensorFlow] Logistic Regression (0)	2019.10.09
[TensorFlow] Linear Regression (0)	2019.10.05
[TensorFlow] 문자열 출력 (0)	2019.10.03
경사하강법 종류 (0)	2019.09.26
학습률 (Learning rate) (0)	2019.09.24

[TensorFlow] 문자열 출력 Start

BioinformaticsAndMe

[TensorFlow] 문자열 출력

: TensorFlow 2.0 으로 'hello world' 문자열을 출력하는 간단한 예제

# 텐서플로우 라이브러리 임포트
import tensorflow as tf

# "hello world" 문자열을 가지는 Tensor 생성 (Tensor: 다차원배열)
hello_tensor = tf.constant("hello world")
print(hello_tensor)
tf.Tensor(b'hello world', shape=(), dtype=string)

# Tensor 값에 접근하기 위해, numpy 형태로 변환하여 출력
print(hello_tensor.numpy())
b'hello world'

# Tensor 클래스 변환 (bytes → str)
print(hello_tensor.numpy().decode('utf-8'))
hello world

#Reference

1) https://github.com/aymericdamien/TensorFlow-Examples/tree/master/tensorflow_v2

2) https://colab.research.google.com/notebooks/mlcc/hello_world.ipynb

[TensorFlow] 문자열 출력 End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

[TensorFlow] Linear Regression (0)	2019.10.05
[TensorFlow] 기본 연산 (0)	2019.10.04
경사하강법 종류 (0)	2019.09.26
학습률 (Learning rate) (0)	2019.09.24
경사하강법 (Gradient descent) (0)	2019.09.18

경사하강법 종류 Start

BioinformaticsAndMe

경사하강법 종류 (Types of Gradient Descent)

: Gradient Descent Learning에는 기본적으로 3가지 타입이 존재

*경사하강법 배치 - 단일 반복에서 기울기를 계산하는 데 사용하는 예의 총 개수

*배치가 너무 커지면 단일 반복으로도 계산하는 데 오랜 시간이 걸림

1. Batch Gradient Descent

: 파라미터를 업데이트 할 때마다 모든 학습 데이터를 사용하여 cost function의 gradient를 계산

: Vanilla gradient descent 라고도 불림

: 불필요하게 낮은 learning 효율을 보일 수 있음

2. Stochastic Gradient Descent (SGD)

: 파라미터를 업데이트 할 때, 무작위로 샘플링된 학습 데이터를 하나씩만 이용하여 cost function의 gradient를 계산

: 모델을 훨씬 더 자주 업데이트하며, 성능 개선 정도를 빠르게 확인 가능

: Local minima 에 빠질 가능성을 줄일 수 있음

: 최소 cost에 수렴했는지의 판단이 상대적으로 어려움

3. Mini Batch Gradient Descent (Mini batch SGD)

: 파라미터를 업데이트 할 때마다, 일정량의 일부 데이터를 무작위로 뽑아 cost function의 gradient를 계산

: Batch gradient descent 와 Stochastic gradient descent 개념의 혼합

: SGD의 노이즈를 줄이면서, 전체 배치보다 효율적

: 널리 사용되는 기법

#Reference

1) https://medium.com/mindorks/an-introduction-to-gradient-descent-7b0c6d9e49f6

2) https://developers.google.com/machine-learning/crash-course

3) https://towardsdatascience.com/gradient-descent-algorithm-and-its-variants-10f652806a3

경사하강법 종류 End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

[TensorFlow] 기본 연산 (0)	2019.10.04
[TensorFlow] 문자열 출력 (0)	2019.10.03
학습률 (Learning rate) (0)	2019.09.24
경사하강법 (Gradient descent) (0)	2019.09.18
머신러닝 선형회귀 (ML, Linear regression) (0)	2019.09.16

학습률 (Learning rate) Start

BioinformaticsAndMe

학습률 (Learning rate)

: 경사하강법 알고리즘은 기울기에 학습률(Learning rate) 또는 보폭(Step size)이라 불리는 스칼라를 곱해 다음 지점을 결정

: Local minimum에 효율적으로 도달할 수 있도록, 너무 크지도 작지도 않은 적절한 학습률을 세팅해야 함

학습률이 큰 경우 : 데이터가 무질서하게 이탈하며, 최저점에 수렴하지 못함

학습률이 작은 경우 : 학습시간이 매우 오래 걸리며, 최저점에 도달하지 못함

사용된 학습률이 경사하강법에서 효율적인지 확인하기 위해, 아래 2차원 그림을 그려봄

- loss : cost function

- epoch : number of iterations

low learning rate: 손실(loss) 감소가 선형의 형태를 보이면서 천천히 학습됨

high learning rate: 손실 감소가 지수적인(exponential) 형태를 보이며, 구간에 따라 빠른 학습 혹은 정체가 보임

very high learning rate: 매우 높은 학습률은 경우에 따라, 손실을 오히려 증가시키는 상황을 발생

good learning rate: 적절한 학습 곡선의 형태로, Learning rate를 조절하면서 찾아내야 함

#Reference

1) https://en.wikipedia.org/wiki/Learning_rate

2) https://developers.google.com/machine-learning/crash-course

3) https://medium.com/mindorks/an-introduction-to-gradient-descent-7b0c6d9e49f6

4) https://taeu.github.io/cs231n/deeplearning-cs231n-Neural-Networks-3/

학습률 (Learning rate) End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

[TensorFlow] 문자열 출력 (0)	2019.10.03
경사하강법 종류 (0)	2019.09.26
경사하강법 (Gradient descent) (0)	2019.09.18
머신러닝 선형회귀 (ML, Linear regression) (0)	2019.09.16
머신러닝 용어(Machine Learning Glossary) (2)	2019.09.03

경사하강법 (Gradient descent) Start

BioinformaticsAndMe

경사하강법 (Gradient descent)

: 함수 기울기(경사)를 낮은 쪽으로 계속 이동시켜서 극값에 이를 때까지 반복시키는 것

: 제시된 함수의 기울기로 최소값을 찾아내는 머신러닝 알고리즘 (↔ 기울기 최대값을 찾는 경사상습법이 존재)

: 비용 함수(cost function)를 최소화하기 위해 매개 변수를 반복적으로 조정하는 과정

: 학습을 통해 모델의 최적 파라미터를 찾는 것이 목표

경사하강법 과정

Step 1. 특정 파라미터 값으로 시작

: 가중치 w1 대한 시작 값(시작점)을 선택

: 여러 알고리즘에서는 w1을 0으로 설정하거나 임의의 값을 선택

Step 2. Cost function 계산

: 비용함수(cost function) - 모델을 구성하는 가중치 w의 함수

: 시작점에서 곡선의 기울기 계산

Step 3. 파라미터 값 업데이트

: 파라미터 - 학습을 통해 최적화해야 하는 변수

: Wnew = W – learning rate * dW

Step 4. 반복 학습

: 앞서 과정이 n번의 iteration(절차 반복)으로 진행되고, 최소값을 향하여 수렴함

#Reference

1) https://ko.wikipedia.org/wiki/%EA%B2%BD%EC%82%AC_%ED%95%98%EA%B0%95%EB%B2%95

2) https://developers.google.com/machine-learning/crash-course

3) https://www.topcoder.com/blog/gradient-descent-in-machine-learning/

4) https://towardsdatascience.com/machine-learning-fundamentals-via-linear-regression-41a5d11f5220

경사하강법 (Gradient descent) End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

경사하강법 종류 (0)	2019.09.26
학습률 (Learning rate) (0)	2019.09.24
머신러닝 선형회귀 (ML, Linear regression) (0)	2019.09.16
머신러닝 용어(Machine Learning Glossary) (2)	2019.09.03
기계 학습(Machine learning) (0)	2019.05.30

머신러닝 선형회귀 (ML, Linear regression) Start

BioinformaticsAndMe

선형 회귀 (Linear Regression)

: 가장 기본적이고 널리 사용되는 기계 학습 알고리즘 중 하나

: 선형 회귀 분석은 가장 적합한 직선(회귀선)을 사용하여 종속 변수(Y)와 하나 이상의 독립 변수(X) 간의 관계를 모델링

: 관계의 특성은 선형

Y = w1*X1 + w0

= b1*X1 + b0

# w1: 기울기(특성 1의 가중치)

# X1: 입력값(특성)

# w0: y절편(편향값)

# Y: 출력값(라벨)

Multiple Linear Regression(다중선형회귀)은

여러 특성을 사용하여 좀 더 정교한 모델을 제시

학습 및 손실

: 모델의 학습은 라벨이 있는 데이터로부터 올바른 가중치와 편향값을 결정하는 것

: 지도 학습에서 머신러닝 알고리즘은 다양한 예를 검토하고 손실을 최소화 하도록 모델링

: 손실은 잘못된 예측에 대한 벌점 (=모델의 예측이 얼마나 잘못되었는지를 나타내는 수)

: 모델의 예측이 완벽하면 손실은 0이고, 그렇지 않으면 손실은 그보다 커짐

: 모델 학습의 목표는 모든 예에서 평균적으로 작은 손실을 갖는 가중치와 편향의 집합을 찾는 것

(from. 구글머신러닝단기집중과정)

평균제곱오차 (Mean Square Error; MSE)

: 평균 제곱 오차(MSE)는 실제 데이터 포인트와 예측 결과 간의 제곱 차이의 평균

: 이 방법은 평균에서부터 거리가 멀어질수록 벌점을 부과

: MSE는 머신러닝에서 흔히 사용되지만, 모든 상황에서 최적의 손실 함수는 아님

#Reference

1) https://acadgild.com/blog/linear-regression

2) https://developers.google.com/machine-learning/crash-course

3) https://sebastianraschka.com/faq/docs/closed-form-vs-gd.html

4) http://jacob-yo.net/tag/udemy/

5) https://towardsdatascience.com/supervised-learning-basics-of-linear-regression-1cbab48d0eba

머신러닝 선형회귀 (ML, Linear regression) End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

학습률 (Learning rate) (0)	2019.09.24
경사하강법 (Gradient descent) (0)	2019.09.18
머신러닝 용어(Machine Learning Glossary) (2)	2019.09.03
기계 학습(Machine learning) (0)	2019.05.30
강화 학습(Reinforcement Learning) (0)	2018.09.15

머신러닝 용어(Machine Learning Glossary) Start

BioinformaticsAndMe

1. 특성(Feature)과 라벨(Label)

ㄱ) 특성(Feature)

-특성은 입력 변수 (단순 선형 회귀의 x 변수)

-간단한 머신러닝 모델은 하나의 특성 사용

-복잡한 머신러닝 모델은 수백만 개의 특성 사용 가능

ㄴ) 라벨(Label)

-라벨은 예측하는 항목 (단순 선형 회귀의 y 변수)

-암의 유무, 연봉 등 알고자 하는 목적에 따라 라벨 지정

2. 학습과 추론

ㄱ) 학습

-학습은 모델을 만들거나 배우는 것

*모델: 특성과 라벨의 관계를 정의한 수식

-모델이 특성과 라벨의 관계를 점차적으로 학습해나감

ㄴ) 추론

-추론은 학습된 모델을 라벨이 없는 예에 적용하는 것

-학습된 모델을 사용하여 라벨 예측

3. 회귀(Regression)와 분류(Classification)

ㄱ) 회귀 모델

-연속적인 값의 라벨을 예측

-예: 온도, 몸무게

ㄴ) 분류 모델

-불연속적인 값의 라벨을 예측

-예: 암 유무, 인종

#Reference

1) https://www.coursera.org/learn/machine-learning

2) https://developers.google.com/machine-learning/crash-course

3) https://thenewstack.io/machine-learning-linear-regression-mere-mortals/

4) https://datawhatnow.com/pseudo-labeling-semi-supervised-learning/

5) https://towardsdatascience.com/regression-or-classification-linear-or-logistic-f093e8757b9c

머신러닝 용어(Machine Learning Glossary) End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

경사하강법 (Gradient descent) (0)	2019.09.18
머신러닝 선형회귀 (ML, Linear regression) (0)	2019.09.16
기계 학습(Machine learning) (0)	2019.05.30
강화 학습(Reinforcement Learning) (0)	2018.09.15
비지도 학습(Unsupervised learning) (0)	2018.09.05

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

BioinformaticsAndMe

Machine Learning

랜덤포레스트(Random Forest)

'Machine Learning' 카테고리의 다른 글

[TensorFlow] Logistic Regression

'Machine Learning' 카테고리의 다른 글

[TensorFlow] Linear Regression

'Machine Learning' 카테고리의 다른 글

[TensorFlow] 기본 연산

'Machine Learning' 카테고리의 다른 글

[TensorFlow] 문자열 출력

'Machine Learning' 카테고리의 다른 글

경사하강법 종류

'Machine Learning' 카테고리의 다른 글

학습률 (Learning rate)

'Machine Learning' 카테고리의 다른 글

경사하강법 (Gradient descent)

'Machine Learning' 카테고리의 다른 글

머신러닝 선형회귀 (ML, Linear regression)

'Machine Learning' 카테고리의 다른 글

머신러닝 용어(Machine Learning Glossary)

'Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역