Bag-of-Words

Posted on Mar 29, 2023

단계

문장을 단어별로 구분한다.
- 다음과 같은 문장이 있다 하자.
- “John really really loves this movie”, “Jane really likes this song”
- 이때 두 문장에서 단어만 뽑게되면 다음과 같다.
- {“John” “really” “loves” “this” “movie” “Jane” “likes” “song”}
단어들을 one-hot vetor로 변환한다.
- 가령, John은 [1 0 0 0 0 0 0 0], Jane은 [0 0 0 0 0 1 0 0]으로 변환한다.
- 이때 각 단어는 거리는 $\sqrt 2$, 유사도는 0인 모두 동일한 조건을 가지고 있다.
- 이렇게 변환된 단어들을 이용해 이전의 문장을 변환하면 다음과 같다.
  - Sentence 1: [1 2 1 1 1 0 0 0]
  - Sentence 2: [0 1 0 1 0 1 1 1]
이를 이용하여 NaiveBayes Classification을 해보자.

$$ c_{MAP}=\argmax_{c\in C}P(c|d) $$

$$ =\argmax_{c\in C}\frac{P(d|c)P(c)}{P(d)} $$

$$ =\argmax_{c\in C}P(d|c)P(c) $$

$$ P(d|c)P(c)=P(w_1, w_2, …, w_n|c)P(c) \rightarrow P(c)\Pi_{w_{i\in W}}P(w_i|c) $$

다음과 같은 문서들이 있다 하자.

구분	Doc ID	Document	Class
Train	1	Image recognition uses convolutional neural networks	CV
	2	Transformer can be used for image classification task	CV
	3	Language modeling uses transformer	NLP
	4	Document classification task is language task	NLP
Test	5	Classification task uses transformer	?

$$ P(c_{CV}|d_5)=\frac12 \times \frac1{14} \times \frac1{14} \times\frac1{14} \times \frac1{14} $$

$$ P(c_{NLP}|d_5)=\frac12 \times \frac1{10} \times \frac2{10} \times\frac1{10} \times \frac1{10} $$