[TensorFlow] RNN 기반의 텍스트 분류

2019. 12. 13. 16:47

[TensorFlow] RNN 기반의 텍스트 분류 Start

BioinformaticsAndMe

[TensorFlow] RNN을 이용한 텍스트 분류

: TensorFlow 2.0 에서 수행되는 RNN(순환신경망) 기반의 텍스트 분류 신경망 학습 과정

*RNN(Recurrent Neural Network() - 뉴런 출력이 다시 입력으로 재귀하는 연결 구조의 인공신경망 알고리즘

: IMDB Large Movie Review Dataset 영화 리뷰 데이터 사용

ㄱ) '긍정 리뷰는 1, 부정 리뷰는 0'으로 표시된 레이블 및 리뷰에 대한 텍스트로 구성된 데이터

ㄴ) 25000개 Training 데이터 및 25000개 Testing 데이터를 보유

1. 파이썬 라이브러리 로딩

: 텍스트 분류에 사용되는 파이썬 라이브러리 로딩

# '__future__' : python 2에서 python 3 문법 사용 가능
from __future__ import absolute_import, division, print_function, unicode_literals

# 텐서플로우 및 데이터셋 라이브러리 임포트
import tensorflow_datasets as tfds
import tensorflow as tf

# 시각화를 위한 matplotlib 라이브러리 임포트
import matplotlib.pyplot as plt

def plot_graphs(history, string):
  plt.plot(history.history[string])
  plt.plot(history.history['val_'+string], '')
  plt.xlabel("Epochs")
  plt.ylabel(string)
  plt.legend([string, 'val_'+string])
  plt.show()

2. 인풋 프로세스 설정

: Training을 위한 Input 데이터 및 파이프라인 설정

# 데이터셋 다운로드
dataset, info = tfds.load('imdb_reviews/subwords8k', with_info=True,
                          as_supervised=True)
train_dataset, test_dataset = dataset['train'], dataset['test']
Downloading and preparing dataset imdb_reviews (80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/subwords8k/0.1.0...
Dl Completed...
1/|/100% 1/1 [00:04<00:00, 4.94s/ url]
Dl Size...
80/|/100% 80/80 [00:04<00:00, 16.31 MiB/s]

Dataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/subwords8k/0.1.0. Subsequent calls will reuse this data.

# 데이터셋 info에는 인코딩 정보가 담기며, 컴퓨터가 인식하는 단어수 출력
encoder = info.features['text'].encoder
print ('Vocabulary size: {}'.format(encoder.vocab_size))
Vocabulary size: 8185

# 'Hello TensorFlow' 인코딩하기
sample_string = 'Hello TensorFlow.'

encoded_string = encoder.encode(sample_string)
print ('Encoded string is {}'.format(encoded_string))

original_string = encoder.decode(encoded_string)
print ('The original string: "{}"'.format(original_string))
Encoded string is [4025, 222, 6307, 2327, 4043, 2120, 7975]
The original string: "Hello TensorFlow."

# 예외 처리를 위한 가정 설정문(assert) 사용
assert original_string == sample_string

# 인코드된 정보 확인
for index in encoded_string:
  print ('{} ----> {}'.format(index, encoder.decode([index])))
4025 ----> Hell
222 ----> o 
6307 ----> Ten
2327 ----> sor
4043 ----> Fl
2120 ----> ow
7975 ----> .

# Training을 위한 데이터 준비
BUFFER_SIZE = 10000        #Buffer size: 1 epoch 되는 데이터 수  
BATCH_SIZE = 64             #Batch size:  1 step에서 사용되는 데이터 수

train_dataset = train_dataset.shuffle(BUFFER_SIZE)
train_dataset = train_dataset.padded_batch(BATCH_SIZE, train_dataset.output_shapes)

test_dataset = test_dataset.padded_batch(BATCH_SIZE, test_dataset.output_shapes)

3. 모델 생성

: 입/출력을 시퀀스 단위로 처리하고 순환적으로 주고 받는 RNN 모델 생성

# 텍스트를 숫자로 변환하는 Word embedding에서, Layer는 단어마다 하나의 Vector로 저장
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

# 케라스 모델을 컴파일하여, Training 과정 configure
model.compile(loss='binary_crossentropy',
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

4. 모델 Training

# model.fit 함수로 모델이 학습되면서 손실과 정확도 지표 출력
history = model.fit(train_dataset, epochs=10,
                    validation_data=test_dataset, 
                    validation_steps=30)
Epoch 1/10
391/391 [==============================] - 937s 2s/step - loss: 0.6472 - accuracy: 0.6030 - val_loss: 0.0000e+00 - val_accuracy: 0.0000e+00
Epoch 2/10
391/391 [==============================] - 994s 3s/step - loss: 0.3484 - accuracy: 0.8593 - val_loss: 0.3369 - val_accuracy: 0.8505
Epoch 3/10
391/391 [==============================] - 988s 3s/step - loss: 0.2492 - accuracy: 0.9074 - val_loss: 0.3119 - val_accuracy: 0.8698
Epoch 4/10
391/391 [==============================] - 986s 3s/step - loss: 0.2091 - accuracy: 0.9244 - val_loss: 0.3093 - val_accuracy: 0.8693
Epoch 5/10
391/391 [==============================] - 987s 3s/step - loss: 0.1793 - accuracy: 0.9371 - val_loss: 0.3262 - val_accuracy: 0.8693
Epoch 6/10
391/391 [==============================] - 988s 3s/step - loss: 0.1595 - accuracy: 0.9454 - val_loss: 0.3674 - val_accuracy: 0.8708
Epoch 7/10
391/391 [==============================] - 996s 3s/step - loss: 0.1450 - accuracy: 0.9516 - val_loss: 0.3684 - val_accuracy: 0.8672
Epoch 8/10
391/391 [==============================] - 987s 3s/step - loss: 0.1432 - accuracy: 0.9529 - val_loss: 0.4392 - val_accuracy: 0.7958
Epoch 9/10
391/391 [==============================] - 982s 3s/step - loss: 0.1345 - accuracy: 0.9560 - val_loss: 0.3832 - val_accuracy: 0.8542
Epoch 10/10
391/391 [==============================] - 990s 3s/step - loss: 0.1104 - accuracy: 0.9657 - val_loss: 0.4369 - val_accuracy: 0.8547

# model.evaludate 함수로 Test set 정확도 평가
test_loss, test_acc = model.evaluate(test_dataset)

print('Test Loss: {}'.format(test_loss))
print('Test Accuracy: {}'.format(test_acc))
391/391 [==============================] - 192s 491ms/step - loss: 0.4446 - accuracy: 0.8577
Test Loss: 0.44456175211674115
Test Accuracy: 0.8576800227165222

# 패딩(padding) - 모델 입력으로 사용하기 위해, 모든 샘플 길이를 동일하게 맞추는 작업
# 보통 0을 넣어서 길이가 다른 샘플들의 길이를 맞춰줌
def pad_to_size(vec, size):
  zeros = [0] * (size - len(vec))
  vec.extend(zeros)
  return (vec)

# 예측도 함수 정의 (prediction이 0.5이상은 리뷰 positive, 0.5이하는 리뷰 negative)
def sample_predict(sentence, pad):
  encoded_sample_pred_text = encoder.encode(sample_pred_text)

  if pad:
    encoded_sample_pred_text = pad_to_size(encoded_sample_pred_text, 64)
  encoded_sample_pred_text = tf.cast(encoded_sample_pred_text, tf.float32)
  predictions = model.predict(tf.expand_dims(encoded_sample_pred_text, 0))

  return (predictions)

# 패딩 작업없이 Sample text 분류
sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=False)
print (predictions)
[[0.5275577]]        #Positive

# 패딩 작업으로 Sample text 분류
sample_pred_text = ('The movie was cool. The animation and the graphics '
                    'were out of this world. I would recommend this movie.')
predictions = sample_predict(sample_pred_text, pad=True)
print (predictions)
[[0.57966185]]        #Positive

# 모델의 정확도값 시각화
plot_graphs(history, 'accuracy')

# 모델의 손실값 시각화
plot_graphs(history, 'loss')

#Reference

1) https://www.tensorflow.org/tutorials/text/text_classification_rnn

2) https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_classification_rnn.ipynb#scrollTo=hw86wWS4YgR2

3) https://wikidocs.net/24586

4) https://towardsdatascience.com/implementation-of-rnn-lstm-and-gru-a4250bf6c090

5) https://wikidocs.net/32105

[TensorFlow] RNN 기반의 텍스트 분류 End

BioinformaticsAndMe

저작자표시

'Machine Learning' 카테고리의 다른 글

가짜얼굴 (This person does not exist) (0)	2019.12.30
[TensorFlow] 심장질환 예측 (2)	2019.12.19
[TensorFlow] 이미지 분류 신경망 (0)	2019.12.06
[TensorFlow1.0] 인공신경망 (Artificial neural network) 기초 (0)	2019.11.29
[TensorFlow1.0] Cancer classification using gene expression (0)	2019.11.20

BioinformaticsAndMe

[TensorFlow] RNN 기반의 텍스트 분류

'Machine Learning' 카테고리의 다른 글

+ Recent posts

티스토리툴바