기댓값 최대화 알고리즘 문서 원본 보기

{{위키데이터 속성 추적}}
{{머신 러닝 막대}}
[[파일:EM Clustering of Old Faithful data.gif|섬네일|[[올드페이스풀 간헐천]] 분화 데이터. 데이터가 정규분포 [[혼합 모델]]에서 관측되었다고 가정하고, 기댓값 최대화 [[클러스터 분석]]을 이용해 매개변수들을 추정하는 과정이다. 기댓값 최대화 과정이 반복됨에 따라 매개변수들이 수렴하고 두 개의 군집이 나타나는 것을 볼 수 있다.]]
'''기댓값 최대화 알고리즘'''({{lang|en|expectation-maximization algorithm}}, 약자 EM 알고리즘)은 관측되지 않는 잠재변수에 의존하는 확률 모델에서 [[최대가능도]]({{lang|en|maximum likelihood}})나 [[최대사후확률]]({{lang|en|maximum a posteriori}}, 약자 MAP)을 갖는 모수의 추정값을 찾는 반복적인 알고리즘이다.
EM 알고리즘은 모수에 관한 추정값으로 [[가능도#정의|로그가능도]]({{lang|en|log likelihood}})의 기댓값을 계산하는 기댓값 (E) 단계와 이 기댓값을 최대화하는 모수 추정값들을 구하는 최대화 (M) 단계를 번갈아가면서 적용한다.
최대화 단계에서 계산한 변수값은 다음 기댓값 단계의 추정값으로 쓰인다.

== 역사 ==
기댓값 최대화 알고리즘은 1977년 아서 뎀스터, 낸 라드와 도널드 루빈에 의해 제시되었다. 하지만 특수한 경우에서의 기댓값 최대화 알고리즘은 이미 몇몇 저자들에 의해 연구되고 있었고, 특히 롤프 선드버그는 페르 마틴-뢰프, 앤더스 마틴-뢰트와 함께 [[지수족]]에 한정된 기댓값 최대화 알고리즘에 관한 연구를 한 바 있다. 1977년 뎀스터-라드-루빈 논문은 이 방법을 일반화하여 [[지수족]]뿐만 아니라 다양한 확률분포에 한해서도 알고리즘이 수렴함을 밝혔다. 왕립통계협회저널에 게재된 이 논문은 그 혁신성으로 인해 롤프 선드버그를 비롯한 여럿에게 극찬을 받았으며, 기댓값 최대화 알고리즘을 통계분석의 중요한 도구로 자리매김하게 하였다.

이후 뎀스터-라드-루빈 논문의 알고리즘 수렴성 분석에 오류가 발견되었고, 1983년 C.F. 제프 우에 의해 옳은 증명이 제시되었다. 제프 우의 증명은 기댓값 최대화 알고리즘이 [[지수족]]뿐만 아니라 다른 확률분포에도 성립함을 보였다.

== 서론 ==
기댓값 최대화 알고리즘은 통계 모델의 수식을 정확히 풀 수 없을 때 [[최대가능도]]를 구하는 데 사용된다. 이런 통계 모델은 대개 잠재변수와 관측할 수 없는 모수, 관측된 데이터로 구성되어 있고, 몇몇 관측값들이 누락되어 있거나 관측되지 않은 데이터가 존재한다고 가정할 수도 있다. 예를 들면, [[혼합 모델]]에서 매 관측값이 그에 상응하는 잠재변수 혹은 관측되지 않은 값을 가지고 있다고 가정할 수 있다.

[[최대가능도]]는 관측되지 않은 모든 변수 각각에 대하여 [[가능도#정의|로그가능도]]를 미분한 식을 풀어서 구할 수 있다. 하지만 잠재변수를 포함하는 통계 모델의 수식들은 매우 복잡하여 이와 같은 방법으로는 풀 수 없다. 각각의 변수에 관해 미분한 식은 서로 얽혀있어 한 수식의 해를 구하기 위해서는 다른 식의 해를 구해야 하는데, 이 수식의 해 또한 이전 식의 변수를 포함하기 때문이다. 따라서 이 경우 정확한 해를 계산하는 것은 불가능하다.

하지만 기댓값 최대화 알고리즘을 이용하면 이 문제의 해를 수치적으로 계산할 수 있다. 우선 서로 얽혀있는 두 식과 각각의 변수들을 가정해보자. 두 식은 얽혀있기 때문에 한 식의 변수는 다른 식의 변수일 수 있다. 첫번째 식의 변수값을 임의로 잡은 후 이 변수값으로 두번째 식의 변수값을 추정한다. 그렇게 구한 변수값으로 다시 첫번째 식의 변수값을 추정하고 이를 반복한다. 특정한 상황에서 충분히 많은 반복을 거치면 각각의 변수들이 일정한 값으로 수렴한다는 것을 보일 수 있다. 또한, 이 값에서 [[가능도#정의|로그가능도]]의 미분은 0이 됨을 보일 수 있고, 이는 이 값이 [[극점]]이거나 [[안장점]]임을 의미한다. 일반적으로 [[가능도#정의|로그가능도]]는 여러 개의 [[극점]]을 가질 수 있으므로 기댓값 최대화 알고리즘이 반드시 [[최댓값]]을 구할 수는 없다. 어떤 가능도는 [[특이점]], 즉 무의미한 극점을 갖기도 한다. 예를 들어, [[혼합 모델]]에서 한 성분의 평균 매개변수를 관측된 하나의 데이터로 놓고 분산 변수를 0으로 설정할 때 발생하는 특이점이 기댓값 최대화 알고리즘으로 구해지는 해가 될 수 있다.

== 방식 ==
관측할 수 있는 확률변수 <math>\mathbf{X}</math>와 관측할 수 없는 확률변수 <math>\mathbf{Z}</math>, 그리고 모수 <math>\boldsymbol{\theta}</math>가 있을 때, <math>(\mathbf{X}, \mathbf{Z})</math>에 대한 확률분포는 <math>L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol{\theta})</math>로 주어져 있다. 이 때 [[최대가능도]]를 계산하고 싶은 경우, 최대화하려는 [[가능도]] 함수는 <math>L(\boldsymbol{\theta}; \mathbf{X}) = p(\mathbf{X} |\boldsymbol{\theta}) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z}|\boldsymbol{\theta})</math>로 정의할 수 있다.

이 수식의 계산비용은 잠재변수 <math>\mathbf{Z}</math>가 취할 수 있는 값의 수에 비례하고, <math>\mathbf{Z}</math>의 차원이 증가할수록 취할 수 있는 값의 수는 지수적으로 증가하기 때문에 이 수식을 계산하기는 매우 어렵다.

기댓값 최대화 알고리즘은 다음 두 단계를 반복하여 가능도 함수 <math>L(\boldsymbol{\theta}; \mathbf{X})</math>의 [[최대가능도]]를 구한다.

'''기댓값 (E) 단계'''에서는 <math>\boldsymbol{\theta}^{(t)}</math>가 주어지고 새로운 <math>\boldsymbol{\theta}</math>를 사용할 때 가능도의 [[기댓값]] <math>Q</math>를 정의한다. 이 때 기댓값을 취하는 확률분포는 <math>\mathbf{X}</math>, <math>\boldsymbol{\theta}^{(t)}</math>가 주어졌을 때 <math>\mathbf{Z}</math>의 조건부 분포이다.
:<math>Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z})  \right] = \sum_{\mathbf{Z}} p(\mathbf{Z}|\mathbf{X}, \boldsymbol{\theta}^{(t)}) \log L(\boldsymbol\theta;\mathbf{X},\mathbf{Z})</math>

'''최대화 (M) 단계'''에서는 <math>Q</math>를 최대화하는 새로운 모수 <math>\boldsymbol\theta^{(t+1)}</math>를 계산한다.
:<math>\boldsymbol\theta^{(t+1)} = \underset{\boldsymbol\theta}{\operatorname{arg\,max}} \ Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \, </math>

기댓값 최대화 알고리즘은 다양한 모델에 적용 가능하지만, 적용되는 모델은 일반적으로 다음 성질을 만족한다.

# 관측값 <math>\mathbf{X}</math>는 이산적일 수도 [[연속]]적일 수도 있다.
# 누락값/잠재변수 <math>\mathbf{Z}</math>는 이산적이고, 각각의 관측값은 그에 대응하는 하나의 잠재변수를 가진다.
# 모수들은 연속적이고 두 종류로 나뉘는데, 이는 관측값들과 관련한 모수 혹은 잠재변수 각각의 값에 대응하는 모수이다.

모수 <math>\boldsymbol{\theta}</math>가 주어졌을 때, 잠재변수 <math>\mathbf{Z}</math>에 관한 가능도 함수를 최대화함으로써 <math>\mathbf{Z}</math>의 값을 어렵지 않게 추정할 수 있고, 반대로 잠재변수 <math>\mathbf{Z}</math>의 값이 주어졌을 때, 모수에 관한 가능도 함수를 최대화함으로써 모수 <math>\boldsymbol{\theta}</math>의 값을 추정할 수 있다. 모수와 잠재변수 중 하나의 값을 알면 다른 값도 쉽게 알 수 있다는 것이 기댓값 최대화 알고리즘의 핵심이다. 다음은 기댓값 최대화 알고리즘을 간략히 요약한 것이다.

# 모수 <math>\boldsymbol{\theta}</math>를 임의의 값으로 설정한다.
# 주어진 모수 값에 관한 잠재변수 <math>\mathbf{Z}</math>값을 추정한다.
# 2단계에서 얻은 잠재변수 값을 이용해 모수 <math>\boldsymbol{\theta}</math>값을 다시 추정한다.
# 모수 <math>\boldsymbol{\theta}</math>값과 잠재변수 <math>\mathbf{Z}</math>값이 수렴할 때까지 2, 3단계를 반복한다.

이 알고리즘은 [[비용함수]]의 극대점으로 단조 수렴한다.

기댓값 최대화 알고리즘은 극대점을 찾기 때문에, 구한 값이 실제 최대점은 아닐 수 있다. 이를 피하기 위해 여러 개의 임의 초기값에 대해 계산하는 기법이나 [[담금질 기법]] 등이 같이 사용될 수 있다.

== 성질 ==
비록 EM을 반복하는 것이 관측된 데이터(한계) 기능도 함수를 증가시키지만 그 순열이 [[최대가능도방법|최대가능도]]로 수렴한다는 증거는 없다. [[다항 분포]]에서는, EM 알고리즘이 초기값에 따라 관측된 데이터 가능도 함수의 [[극대값]]으로 수렴할 수도 있다. [[경사 하강법]](서로 다른 초기 추정량 <math>\boldsymbol\theta^{(t)}</math>에서 시작하는 방법)이나 [[담금질 기법]] 등 수 많은 체험적이거나 메타 발견적 학습 방법이 있다.

EM은 가능도가 [[지수족]]일 때 특히 유용하다: E 단계는 충분 통계량({{lang|en|Sufficient statistic}})의 기대값들의 합이 되며, M 단계는 선형 함수를 최대화하는 것을 포함한다. 이런 경우에는, 각각의 단계에서 선드버그 공식을 이용하여 닫힌 해를 도출하는 것이 가능하다.<ref name="Sundberg1974">{{저널 인용
 |last=Sundberg |first=Rolf
 |title=Maximum likelihood theory for incomplete data from an exponential family
 |journal=Scandinavian Journal of Statistics
 |volume=1 |year=1974 |issue=2 |pages=49–58
 |jstor=4615553 | mr = 381110
}}</ref><ref name="Sundberg1971">
Rolf Sundberg. 1971. ''Maximum likelihood theory and applications for distributions generated when observing a function of an exponential family variable''. Dissertation, Institute for Mathematical Statistics, Stockholm University.</ref><ref name="Sundberg1976">{{저널 인용
 |last=Sundberg |first=Rolf
 |year=1976
 |title=An iterative method for solution of the likelihood equations for incomplete data from exponential families
 |journal=Communications in Statistics – Simulation and Computation
 |volume=5 |issue=1 |pages=55–64
 |doi=10.1080/03610917608812007 |mr=443190
}}</ref><ref>G. Kulldorff. 1961.'' Contributions to the theory of estimation from grouped and partially grouped samples''. Almqvist & Wiksell.</ref><ref name="Martin-Löf1963">Anders Martin-Löf. 1963. "Utvärdering av livslängder i subnanosekundsområdet" ("Evaluation of sub-nanosecond lifetimes"). ("Sundberg formula")</ref><ref name="Martin-Löf1966">
Per Martin-Löf. 1966. ''Statistics from the point of view of statistical mechanics''. Lecture notes, Mathematical Institute, Aarhus University. ("Sundberg formula" credited to Anders Martin-Löf).</ref><ref name="Martin-Löf1970">Per Martin-Löf. 1970. ''Statistika Modeller (Statistical Models): Anteckningar från seminarier läsåret 1969–1970 (Notes from seminars in the academic year 1969-1970), with the assistance of Rolf Sundberg.'' Stockholm University. ("Sundberg formula")</ref><!-- * Martin-Löf, P. "Exact tests, confidence regions and estimates", with a discussion by [[A. W. F. Edwards]], [[George A. Barnard|G. A. Barnard]], D. A. Sprott, O. Barndorff-Nielsen, [[D. Basu]] and [[Rasch model|G. Rasch]]. ''Proceedings of Conference on Foundational Questions in Statistical Inference'' (Aarhus, 1973), pp. 121–138. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974. --><ref name="Martin-Löf1974a">
Martin-Löf, P. The notion of redundancy and its use as a quantitative measure of the deviation between a statistical hypothesis and a set of observational data. With a discussion by F. Abildgård, Arthur P. Dempster|A. P. Dempster, D. Basu, D. R. Cox, A. W. F. Edwards, D. A. Sprott, George A. Barnard|G. A. Barnard, O. Barndorff-Nielsen, J. D. Kalbfleisch and Rasch model|G. Rasch and a reply by the author. ''Proceedings of Conference on Foundational Questions in Statistical Inference'' (Aarhus, 1973), pp. 1–42. Memoirs, No. 1, Dept. Theoret. Statist., Inst. Math., Univ. Aarhus, Aarhus, 1974.</ref><ref name="Martin-Löf1974b">
Martin-Löf, Per The notion of redundancy and its use as a quantitative measure of the discrepancy between a statistical hypothesis and a set of observational data. ''Scand. J. Statist.'' 1 (1974), no. 1, 3–18.</ref>

EM 방법은 뎀스터-라드-루빈 논문에서 [[베이즈 추론]]에서 [[최대 사후 확률]]을 계산하기 위해 변형되었다.

[[경사 하강법]], [[켤레기울기법]], 또는 가우스-뉴턴 방법의 변형 등과 같은 [[최대가능도방법|최대가능도]]를 찾는 여러 방법들이 존재한다. EM과 달리, 이러한 방법들은 일반적으로 가능도 함수의 첫 번째와 두 번째 미분의 계산을 필요로 한다.

== 올바름 증명 ==

기댓값 최대화 알고리즘은 <math>\log p(\mathbf{X}|\boldsymbol\theta)</math>을 직접적으로 향상시키기 보다 <math>Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})</math>를 향상시키는 일이다.
여기에서 증명하는 것은 <math>Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})</math>를 향상시키는 <math>\boldsymbol\theta</math>를 찾으면 <math>\log p(\mathbf{X}|\boldsymbol\theta)</math>가 향상된다는 것이다.<ref name="Little1987">{{서적 인용| last1 = Little | first1 = Roderick J.A. | author1-link = | last2 = Rubin | first2 = Donald B. | title = Statistical Analysis with Missing Data | url = https://archive.org/details/statisticalanaly00litt | series = Wiley Series in Probability and Mathematical Statistics | year = 1987 | publisher = John Wiley & Sons | location = New York | isbn = 0-471-80254-9 | pages = [https://archive.org/details/statisticalanaly00litt/page/n145 134]–136}}</ref>

확률이 0이 아닌 <math>p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta)</math>를 만족하는 어떤 <math>\mathbf{Z}</math>에 대해서, 다음과 같이 이야기할 수 있다:
::<math>
\log p(\mathbf{X}|\boldsymbol\theta) = \log p(\mathbf{X},\mathbf{Z}|\boldsymbol\theta) - \log p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta) \,.
</math>
<math>p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)})</math>를 <math>\mathbf{Z}</math>에 대해서 합한 값(혹은 적분한 값)을 양변에 곱할 수 있다. 좌변은 상수인 기댓값이므로:
::<math>
\begin{align}
\log p(\mathbf{X}|\boldsymbol\theta) &
= \sum_{\mathbf{Z}} p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}) \log p(\mathbf{X},\mathbf{Z}|\boldsymbol\theta)
- \sum_{\mathbf{Z}} p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}) \log p(\mathbf{Z}|\mathbf{X},\boldsymbol\theta) \\
& = Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \,,
\end{align}
</math>
여기에서, <math>H(\boldsymbol\theta|\boldsymbol\theta^{(t)})</math>는 음수의 합(negated sum)으로 정의된다.
마지막 식은 <math>\boldsymbol\theta = \boldsymbol\theta^{(t)}</math>인 경우를 포함해서 모든 <math>\boldsymbol\theta</math>에 대해서도 성립한다.
::<math>
\log p(\mathbf{X}|\boldsymbol\theta^{(t)})
= Q(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)}) + H(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)}) \,,
</math>
이전의 식으로부터 이 마지막 식을 빼면,
::<math>
\log p(\mathbf{X}|\boldsymbol\theta) - \log p(\mathbf{X}|\boldsymbol\theta^{(t)})
= Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)})
 + H(\boldsymbol\theta|\boldsymbol\theta^{(t)}) - H(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)}) \,,
</math>
이다.
하지만, 깁스 [[부등식]](Gibbs' inequality)에 의해서 <math>H(\boldsymbol\theta|\boldsymbol\theta^{(t)}) \ge H(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)})</math>이다.
그러므로, 우리는 다음 결론에 도달할 수 있다:
::<math>
\log p(\mathbf{X}|\boldsymbol\theta) - \log p(\mathbf{X}|\boldsymbol\theta^{(t)})
\ge Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) - Q(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)}) \,.
</math>
그러므로, <math>Q(\boldsymbol\theta^{(t)}|\boldsymbol\theta^{(t)})</math>에 비해서 <math>Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})</math>가 커지도록 <math>\boldsymbol\theta</math>을 선택하면, 최소한 그 이상으로 <math>\log p(\mathbf{X}|\boldsymbol\theta^{(t)})</math>에 비해서 <math>\log p(\mathbf{X}|\boldsymbol\theta)</math>가 향상될 것이다.

== 대체 설명 ==
상황에 따라서는, 기댓값 최대화 알고리즘을 다음의 두 최대화 단계로 보는 것이 편리하다.<ref name="neal1999">{{저널 인용|last1=Neal |first=Radford |last2=Hinton |first2=Geoffrey |title=A view of the EM algorithm that justifies incremental, sparse, and other variants |journal=Learning in Graphical Models |editor=Michael I. Jordan |pages= 355–368 | publisher= MIT Press |location=Cambridge, MA |year=1999 |isbn=0-262-60032-3 |url=ftp://ftp.cs.toronto.edu/pub/radford/emk.pdf |accessdate=2009-03-22}}</ref><ref name="hastie2001">{{서적 인용|last1=Hastie|first1=Trevor|last2=Tibshirani|first2=Robert|last3=Friedman|first3=Jerome |year=2001 |title=The Elements of Statistical Learning |url=https://archive.org/details/elementsstatisti00hast_237|isbn=0-387-95284-5 |publisher=Springer |location=New York |chapter=8.5 The EM algorithm |pages=[https://archive.org/details/elementsstatisti00hast_237/page/n229 236]–243}}</ref> ''q''는 미관측 데이터 ''z''를 나타내는 임의의 확률 분포, ''p''<sub>''Z''|''X''</sub>(· |''x'';''θ'')는 관측 데이터 ''x''가 주어졌을 때의 조건부 확률 분포, ''H''를 [[엔트로피]], 그리고 ''D''<sub>KL</sub>를 [[쿨백-라이블러 발산]]이라고 할 때, 다음 식을 고려한다.
:<math>F(q,\theta) = \operatorname{E}_q [ \log L (\theta ; x,Z) ] + H(q) = -D_{\mathrm{KL}}\big(q \big\| p_{Z|X}(\cdot|x;\theta ) \big) + \log L(\theta;x) </math>

그러면 기댓값 최대화에서의 단계는 다음과 같이 나타낼 수 있다:
:'''기댓값 단계''': ''F''를 최대화하는 ''q''를 찾는다 :
::<math> q^{(t)} = \underset{q}\operatorname{arg\,max} \ F(q,\theta^{(t)}) </math>
:'''최대화 단계''': ''F''를 최대화하는 ''θ''를 찾는다:
::<math> \theta^{(t+1)} = \underset{\theta}\operatorname{arg\,max} \ F(q^{(t)},\theta) </math>

== 적용 ==
기댓값 최대화 알고리즘은 [[기계 학습]]과 [[컴퓨터 비전]]의 [[데이터 클러스터링]]에 자주 사용된다. [[자연 언어 처리]] 분야에는 확률적 [[문맥 자유 문법]]({{lang|en|probabilistic context-free grammar}})의 자율 유도를 위한 두 가지 중요한 알고리즘의 예로 바움-웰치 알고리즘({{lang|en|Baum-Welch algorithm}})과 안밖 알고리즘({{lang|en|inside-outside algorithm}})이 있다.

[[심리측정학]]에서 기댓값 최대화 알고리즘은 [[문항 반응 이론]] 모델의 아이템 매개변수와 잠재 능력의 추정을 위해 거의 없어서는 안된다.

기댓값 최대화 알고리즘은 누락된 데이터를 처리하고 미확인 변수를 관찰하는 능력이 있어 포트폴리오 리스크({{lang|en|portfolio risk}})의 가격을 정하고 관리하는 유용한 도구가 되고 있다.

기댓값 최대화 알고리즘은 또한 [[의학 영상]] 복원, 특히 [[양전자 방출 단층촬영]]과 단일광자방출단층활영({{lang|en|single photon emission computed tomography}})에 널리 쓰인다.

== 기댓값 최대화 알고리즘의 필터링과 평활화 ==
[[칼만 필터]]는 일반적으로 온라인 상태 추정에 사용되고 최소 분산 평활화는 오프라인 또는 배치({{lang|en|batch}}) 상태 추정에 사용된다. 그러나 이러한 최소 분산 해법들을 위해서는 상태공간 모델의 매개변수 추정이 필요하다. 기댓값 최대화 알고리즘은 절리 상태({{lang|en|joint state}})와 모수 추정 문제를 푸는데 사용될 수 있다.

기댓값 최대화 알고리즘의 필터링과 평활화는 아래의 두 단계를 반복함으로써 일어난다.

;기댓값 단계(E step)
: 현재 모수 추정으로 설계된 칼만 필터 또는 최소 분산 평활화를 적용하여 갱신된 상태 추정을 얻는다.

;최대화 단계(M step)
: 최대가능도 계산과 필터 또는 평활화 된 상태 추정을 사용해서 갱신된 매개변수 추정을 얻는다.

[[칼만 필터]] 또는 최소 분산 평활화가 하나의 입력을 받고 하나의 출력을 하는 시스템에서 잡음 섞인 측정을 한다고 가정하자. [[최대가능도]] 계산으로부터 갱신된 측정 잡음 분산 추정을 얻을 수 있다.
:<math>\hat{\sigma}^{2}_v = \frac{1}{N} \sum_{k=1}^N {(z_k-\hat{x}_{k})}^{2}</math>

이 때 <math>\hat{x}_k</math>는 N 스칼라 측정인 <math>{z_k}</math>으로부터 필터 또는 평활화에 의해 계산된 스칼라 출력 추정이다. 비슷하게 1차 자기회귀 과정을 위해, 갱신된 과정 잡음 분산 추정이 다음과 같이 계산된다.
:<math>\hat{\sigma}^{2}_w =   \frac{1}{N} \sum_{k=1}^N {(\hat{x}_{k+1}-\hat{F}\hat{{x}}_{k})}^{2}</math>

이 때 <math>\hat{x}_k</math>와 <math>\hat{x}_{k+1}</math>가 필터 또는 평활화에 의해 계산된 스칼라 상태 추정들이다. 갱신된 모델 계수 추정은 다음을 통해서 얻어진다.
:<math>\hat{F} = \frac{\sum_{k=1}^N (\hat{x}_{k+1}-\hat{F} \hat{x}_k)}{\sum_{k=1}^N \hat{x}_k^{2}} </math>.

위와 같은 모수 추정의 수렴에 대해서 잘 연구되어 있다.<ref>{{저널 인용
|last = Einicke
|first =  G.A.
|last2 = Malos
|first2 = J.T.
|last3 = Reid
|first3 =  D.C.
|last4 = Hainsworth
|first4 = D.W.
|title = Riccati Equation and EM Algorithm Convergence for Inertial Navigation Alignment
|journal = IEEE Trans. Signal Processing
|volume = 57
|issue = 1
|pages = 370–375
|date=January 2009
|postscript = <!--None-->
|doi = 10.1109/TSP.2008.2007090
}}</ref><ref>{{저널 인용
|last = Einicke
|first =  G.A.
|last2 = Falco
|first2 =  G.
|last3 = Malos
|first3 = J.T.
|title = EM Algorithm State Matrix Estimation for Navigation
|journal = IEEE Signal Processing Letters
|volume = 17
|issue = 5
|pages = 437–440
|date=May 2010
|postscript = <!--None-->
|doi = 10.1109/LSP.2010.2043151
|bibcode = 2010ISPL...17..437E }}</ref><ref>{{저널 인용
|last = Einicke
|first =  G.A.
|last2 = Falco
|first2 =  G.
|last3 = Dunn
|first3 = M.T.
|last4 = Reid
|first4 = D.C.
|title = Iterative Smoother-Based Variance Estimation
|journal = IEEE Signal Processing Letters
|volume = 19
|issue = 5
|pages = 275–278
|date=May 2012
|postscript = <!--None-->
|bibcode = 2012ISPL...19..275E
|doi = 10.1109/LSP.2012.2190278
}}</ref>

== 변종 ==
기댓값 최대화 알고리즘의 때때로 느린 수렴을 가속화 하기 위해서 몇 가지 방법이 제시되어 왔다. [[켤레기울기법]]을 사용하는 것과 수정된 [[뉴턴의 방법]] 기법들이 그 예다.<ref>{{저널 인용|first1=Mortaza |last1=Jamshidian |first2=Robert I. |last2=Jennrich|title=Acceleration of the EM Algorithm by using Quasi-Newton Methods |year=1997 |journal=Journal of the Royal Statistical Society, Series B |volume=59 |issue=2 |pages=569–587 |doi=10.1111/1467-9868.00083 |mr=1452026 }}</ref> 추가적으로 기댓값 최대화 알고리즘은 속박 추정 기법들과 함께 쓰인다.

'''조건부 기댓값 최대화({{lang|en|Expectation conditional maximization}}, 약자 ECM)'''는 각각의 M 단계를 각각의 매개변수 <math>\theta_i</math>가 개별적으로 최대화 되고 조건부로 다른 매개변수들은 고정되는 조건부 최대화 (CM) 단계의 배열로 대체하였다.<ref>{{저널 인용|last1=Meng |first1=Xiao-Li |last2=Rubin |first2=Donald B. |title=Maximum likelihood estimation via the ECM algorithm: A general framework |url=https://archive.org/details/sim_biometrika_1993-06_80_2/page/n16 |year=1993 |journal=Biometrika |volume=80 |issue=2 |pages=267–278 |doi=10.1093/biomet/80.2.267 |mr=1243503}}</ref>

이 아이디어는 후에 [[#대체 설명|대체 설명]]에서 E 단계와 M 단계 모두에 대해서 손실 함수 ''F''의 증가를 찾는 '''일반화된 기댓값 최대화({{lang|en|generalized expectation maximization}}, 약자 GEM)''' 알고리즘으로 확장된다.<ref name="neal1999"/>

기댓값 최대화 알고리즘을 '''MM''' (Majorize/Minimize 또는 Minorize/Maximize, 문맥에 따라서) 알고리즘<ref>Hunter DR and Lange K (2004), [http://www.stat.psu.edu/~dhunter/papers/mmtutorial.pdf A Tutorial on MM Algorithms] {{웹아카이브|url=https://web.archive.org/web/20110720080204/http://www.stat.psu.edu/~dhunter/papers/mmtutorial.pdf}}, The American Statistician, 58: 30-37</ref>의 하위 분류로 간주하는 것도 또한 가능하고 그러므로 더욱 일반적인 경우에 대해 개발된 어떤 기구를 사용하는 것이 가능하다.

=== α-EM 알고리즘 ===
기댓값 최대화 알고리즘에 사용되는 Q-함수는 로그가능도를 기반으로 하고 있다. 그러므로, log-EM 알고리즘으로 간주된다. 로그가능도의 사용은 α-로그가능도 비율로 일반화 될 수 있다. 그러면, 관찰된 데이터의 α-로그가능도 비율은 그것의 Q-함수와 α-발산을 사용한 등식으로 정확하게 표현된다. 이 Q-함수를 얻는 것은 일반화 된 E 단계이다. 그것의 최대화는 일반화된 M 단계이다. 이 두 단계를 α-EM 알고리즘
<ref>
{{저널 인용
|last=Matsuyama |first=Yasuo
|title=The α-EM algorithm: Surrogate likelihood maximization using α-logarithmic information measures
|journal=IEEE Transactions on Information Theory
|volume=49 | year=2003 |pages=692–706 |issue=3
|doi=10.1109/TIT.2002.808105
}}
</ref>
이라고 부르며 log-EM 알고리즘이 이에 하위 분류로 포함된다. 따라서, 야스오 마츠야마({{lang|en|Yasuo Matsuyama}})의 α-EM 알고리즘은 정확히 log-EM 알고리즘의 일반화이다. 기울기 또는 Hessian 행렬의 계산이 필요 없다. α-EM은 알맞은 α를 선택함으로써 log-EM 알고리즘 보다 더 빠르게 수렴한다. α-EM 알고리즘은 은닉 마르코프 모델 추정 알고리즘 α-HMM<ref>
{{저널 인용
|last=Matsuyama |first=Yasuo
|title=Hidden Markov model estimation based on alpha-EM algorithm: Discrete and continuous alpha-HMMs
|journal=International Joint Conference on Neural Networks
| year=2011 |pages=808–816
}}
</ref>의 더 빠른 버전으로 이어졌다.

== 변분 베이즈 근사법과의 관계 ==
기댓값 최대화 알고리즘은 부분적으로 베이즈가 아닌 최대가능도방법이다. 최종 결과는 (베이즈 방식의) 잠재변수와 ''θ'' ([[최대가능도]] 또는 사후 최빈값({{lang|en|posterior mode}}) 중 하나)의 점추정에 대한 [[확률분포]]로 나타난다. 잠재변수처럼 ''θ''에 대한 확률분포로 나타나는 완전한 베이즈 방식의 버전을 생각해 볼 수 있다. 추론을 위한 베이즈 방식의 접근법은 간단하게 ''θ''을 다른 잠재변수와 같이 취급한다. 이 패러다임에선 E 단계와 M 단계의 구분이 없어진다. 만약 위에서 말한 변분 베이즈 근사법과 같이 인수분해된 Q 근사값을 사용하면, 각각의 잠재변수(''θ''가 포함된)에 대해서 반복하고 한 번에 하나씩 최적화할 수 있다. 그러면 ''k''가 잠재변수의 개수라고 할 때, 한 번 반복할 때마다 ''k'' 단계들이 있다. 그래픽 모델({{lang|en|graphical models}})에서 앞의 방법은 각각의 변수의 새로운 ''Q''를 마르코프 블랭킷({{lang|en|Markov blanket}})에만 의존시킴으로써 쉽게 만들 수 있어서, 국소 메시지 전달({{lang|en|message passing}})이 효율적인 추론으로 사용될 수 있다.

== 기하학적 해석 ==
정보 [[기하학]]({{lang|en|information geometry}})에서, E 단계와 M 단계는 이중 [[아핀 접속]]의 사영({{lang|en|projection}})으로 해석되고, e-연결({{lang|en|e-connection}})과 m-연결({{lang|en|m-connection}})로 불린다. [[쿨백-라이블러 발산]] 또한 이 용어들을 사용해서 해석될 수 있다.
== 예제 1: 정규분포 혼합 모델 ==

[[파일:Em old faithful.gif|섬네일|240px|[[올드페이스풀 간헐천]] 데이터에 가우시안 [[혼합 모델]]에 맞추는 EM 알고리즘을 보여주는 애니메이션. 알고리즘은 난수로 만들어진 초기값부터 시작해서 수렴해 가는 과정을 보여준다. ]]

두 개의 <math>d</math>차원 [[다변량 정규분포]]가 [[혼합 모델|혼합]]되어 있는 경우, 보통 이 모델에서 <math>d</math>차원 값을 수집하는 것은 가능하지만 그 값이 두 분포 중 어느 분포에서 나온 것인지는 알 수 없는 경우가 많다. 이 문단에서는 이러한 경우에 대한 기댓값 최대화 알고리즘을 유도한다.

전체 <math>n</math>개의 점을 수집하였을 때, <math>i</math>번째 수집되는 값은 <math>(x_i, z_i)</math>로 나타낼 수 있다. 여기에서 <math>x_i</math>는 <math>d</math>차원 점을 가리키는 관측 가능한 변수, 그리고 <math>z_i</math>는 두 개의 정규분포 중 어느 분포에서 수집되었는지를 가리키는 관측 불가능한 변수이다.

<math>i</math>번째 값이 두 분포 중 어느 분포에서 수집되는지에 대한 확률은 <math>i</math>와 독립적이기 때문에, <math>P(z_i)</math>는 <math>i</math>에 관계없는 매개변수로 표현할 수 있다. 이 값을 <math>P(z_i=1) = \tau_1, P(z_i=2) = \tau_2</math>인 벡터 <math>\boldsymbol\tau = (\tau_1, \tau_2)</math>로 표기하도록 한다(<math>\tau_1 + \tau_2 = 1</math>).

마지막으로, 두 다변량 정규분포를 각각 <math>\mathcal{N}_d(\boldsymbol\mu_1, \Sigma_1)</math>, <math>\mathcal{N}_d(\boldsymbol\mu_2, \Sigma_2)</math>로 표기하기로 한다.
즉,
:<math>X_i |(Z_i = 1) \sim \mathcal{N}_d(\boldsymbol{\mu}_1,\Sigma_1)</math>이고 <math>X_i |(Z_i = 2) \sim \mathcal{N}_d(\boldsymbol{\mu}_2,\Sigma_2)</math>이다.

그렇다면 이 모델에서 매개변수는 <math>\boldsymbol{\theta} = (\boldsymbol{\mu}_1, \boldsymbol{\mu}_2, \Sigma_1, \Sigma_2, \boldsymbol\tau)</math>가 된다.

이제 [[가능도]] 함수 <math>L(\boldsymbol\theta;\mathbf{x},\mathbf{z})</math>를 다음과 같이 표현할 수 있다:
불완전한 데이터 가능도 함수는
:<math>L(\theta;\mathbf{x},\mathbf{z}) = P(\mathbf{x},\mathbf{z}| \theta) = \prod_{i=1}^n \sum_{j=1}^2 \tau_j \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j)</math>이고,

완전한 데이터 가능도 함수는
:<math>L(\boldsymbol\theta;\mathbf{x},\mathbf{z}) = P(\mathbf{x},\mathbf{z} \vert \boldsymbol\theta) = \prod_{i=1}^n  \sum_{j=1}^2  \mathbb{I}(z_i=j) \ \tau_j \ f(\mathbf{x}_i;\boldsymbol{\mu}_j,\Sigma_j) </math>이다.

여기에서 <math>\mathbb{I}</math>는 [[지시 함수]]이고, <math>f</math>는 다변량 정규분포의 [[확률 밀도 함수]]이다. 여기에서 <math>f</math>를 풀어서 쓰면 다음과 같이 정리된다.
:<math>L(\boldsymbol\theta;\mathbf{x},\mathbf{z}) = \exp \left\{ \sum_{i=1}^n \sum_{j=1}^2 \mathbb{I}(z_i=j) \big[ \log \tau_j -\tfrac{1}{2} \log |\Sigma_j| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{n}{2} \log(2\pi) \big] \right\} </math>

=== 기댓값 단계(E step) ===

모수 <math>\boldsymbol\theta^{(t)}</math>에 대하여 <math>z_i</math>에 대한 확률분포 <math>\operatorname{P}(Z_i=j | X_i=\mathbf{x}_i ;\boldsymbol\theta^{(t)})</math>는 다음과 같이 계산한다.
:<math>T_{j,i}^{(t)} := \operatorname{P}(Z_i=j | X_i=\mathbf{x}_i ;\boldsymbol\theta^{(t)}) = \frac{\tau_j^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_j^{(t)},\Sigma_j^{(t)})}{\tau_1^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_1^{(t)},\Sigma_1^{(t)}) + \tau_2^{(t)} \ f(\mathbf{x}_i;\boldsymbol{\mu}_2^{(t)},\Sigma_2^{(t)})} </math>
그러면 기댓값 단계의 <math>Q</math> 함수는 다음과 같이 계산된다.
:<math>\begin{align}Q(\boldsymbol\theta|\boldsymbol\theta^{(t)})
&= \operatorname{E} [\log L(\boldsymbol\theta;\mathbf{x},\mathbf{Z}) ] \\
&= \operatorname{E} [\log \prod_{i=1}^{n}L(\boldsymbol\theta;\mathbf{x}_i,\mathbf{z}_i) ] \\
&= \operatorname{E} [\sum_{i=1}^n \log L(\boldsymbol\theta;\mathbf{x}_i,\mathbf{z}_i) ] \\
&= \sum_{i=1}^n\operatorname{E} [\log L(\boldsymbol\theta;\mathbf{x}_i,\mathbf{z}_i) ] \\
&= \sum_{i=1}^n \sum_{j=1}^2 T_{j,i}^{(t)} \big[ \log \tau_j  -\tfrac{1}{2} \log \Sigma_j -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_j)^\top\Sigma_j^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_j) -\tfrac{1}{2} \log(2\pi) \big]
\end{align}</math>

=== 최대화 단계(M step) ===
이제 최대화 단계를 계산한다. <math>Q</math> 함수는 <math>\tau_j</math>에 대한 식과 <math>(\boldsymbol{\mu}_j,\Sigma_j)</math>에 대한 식의 선형 결합이기 때문에, <math>\tau_j</math>와 <math>(\boldsymbol{\mu}_j,\Sigma_j)</math>에 대해 독립적으로 최대화를 계산하는 것이 가능하다.

먼저 모수 <math>\boldsymbol{\tau}^{(t+1)}</math>는 다음과 같이 유도된다.
:<math>\begin{align}\boldsymbol{\tau}^{(t+1)}
&= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}}\  Q(\theta | \theta^{(t)} ) \\
&= \underset{\boldsymbol{\tau}} {\operatorname{arg\,max}} \ \left\{ \left[  \sum_{i=1}^n T_{1,i}^{(t)} \right] \log \tau_1 + \left[  \sum_{i=1}^n T_{2,i}^{(t)} \right] \log \tau_2  \right\}
\end{align}</math>

이 식은 [[이항분포]]에서의 [[최대가능도]]를 계산하는 경우와 동일하며, 따라서 다음과 같이 계산할 수 있다.
:<math>\tau^{(t+1)}_j = \frac{\sum_{i=1}^n T_{j,i}^{(t)}}{\sum_{i=1}^n (T_{1,i}^{(t)} + T_{2,i}^{(t)} ) } = \frac{1}{n} \sum_{i=1}^n T_{j,i}^{(t)}</math>

다음으로 <math>(\boldsymbol{\mu}_1^{(t+1)}, \Sigma_1^{(t+1)})</math>에 대한 식은 다음과 같다.
:<math>\begin{align}(\boldsymbol{\mu}_1^{(t+1)},\Sigma_1^{(t+1)})
&= \underset{\boldsymbol{\mu}_1,\Sigma_1} {\operatorname{arg\,max}}\  Q(\theta | \theta^{(t)} ) \\
&= \underset{\boldsymbol{\mu}_1,\Sigma_1} {\operatorname{arg\,max}}\  \sum_{i=1}^n T_{1,i}^{(t)} \left\{ -\tfrac{1}{2} \log |\Sigma_1| -\tfrac{1}{2}(\mathbf{x}_i-\boldsymbol{\mu}_1)^\top\Sigma_1^{-1} (\mathbf{x}_i-\boldsymbol{\mu}_1) \right\}
\end{align}</math>

이 식은 다변량 정규분포에서의 최대가능도를 계산하는 식과 유사하며, 다음과 같이 계산할 수 있다.
:<math>\boldsymbol{\mu}_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} \mathbf{x}_i}{\sum_{i=1}^n T_{1,i}^{(t)}} </math>
:<math>\Sigma_1^{(t+1)} = \frac{\sum_{i=1}^n T_{1,i}^{(t)} (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)}) (\mathbf{x}_i - \boldsymbol{\mu}_1^{(t+1)})^\top }{\sum_{i=1}^n T_{1,i}^{(t)}} </math>
<math>(\boldsymbol{\mu}_2^{(t+1)}, \Sigma_2^{(t+1)})</math>도 같은 방식으로 구할 수 있다.

=== 알고리즘의 종료 ===

반복과정은 각 과정에서 가능도 사이의 증가폭이 사전에 정의된 기준점 아래로 내려오면 종료한다.

즉, 사전에 정의된 기준점 <math>\epsilon</math>에 대해서, <math>\log L(\theta^{t};\mathbf{x},\mathbf{Z}) \leq \log L(\theta^{(t-1)};\mathbf{x},\mathbf{Z}) + \epsilon</math>이면 종료한다.

=== 일반화 ===

앞서 소개한 알고리즘은 두 개 이상의 다변량 정규 분포를 혼합한 경우에도 적용이 가능하다.

== 예제 2: 베르누이 혼합 모델 ==

동일한 크기의 이미지 <math>N</math>개가 주어졌을 때, 이를 <math>\mathbf{x_n} = (x_{n,1}, ..., x_{n,D})^T, n \in \{1, ..., N \}</math>라 하자. 각 이미지는 <math>D</math>개의 픽셀로 이루어져 있고 각 픽셀은 <math>0</math> 혹은 <math>1</math>의 값을 가진다. 즉, <math>x_{n,i} \in \{ 0, 1 \}</math>이다. 또한 각각의 픽셀이 서로에 대해 [[조건부 확률|조건부 독립]]적이라 가정하면, 성분 <math>k</math>의 모든 이미지에 관한 픽셀 <math>i</math>의 확률분포는 모수 <math>0 \leq \mu_{k, i} \leq 1</math>를 갖는 베르누이 분포로 모델링 할 수 있다.

[[혼합 모델]]이 <math>K</math>개의 군집(cluster)을 가진다고 가정할 때, 어떤 이미지가 군집 <math>i</math>에 속해 있을 확률을 <math>\mathbf{\pi_i}, i \in \{1, ..., K\}, \sum_{i = 1}^K \pi_i = 1</math>이라 하자. <math>\mathbf{\pi_i}</math>는 [[혼합 모델|혼합 계수]](mixing coefficient)라 불린다.

이 모델은 어떤 이미지가 특정 군집에 속할 [[확률변수]]와 그 군집에서의 픽셀 분포 변수를 모수로 갖는다. 즉, <math>\theta = (\mathbf{\mu}, \mathbf{\pi})</math>이고,

<math>\mathbf{\mu} := (\mathbf{\mu_1} \; \mathbf{\mu_2} \;... \;\mathbf{\mu_K} ) = \left( \begin{array}{cccc} \mu_{1, 1} & \mu_{2, 1} & ... & \mu_{K, 1} \\ \mu_{1, 2} & \mu_{2, 2} & ... & \mu_{K, 2} \\ \vdots & \vdots & \ddots & \vdots \\ \mu_{1, D} & \mu_{2, D} & ... & \mu_{K, D} \\ \end{array} \right), \mathbf{\pi} := ( \pi_1, ..., \pi_K )^T</math>이다.

하나의 트레이닝 이미지(training image) <math>\mathbf{x}</math>에 관한 [[가능도]] 함수는 다음과 같다.

<math>
\begin{align}
	\,\text{Pr}(\mathbf{x} | \theta) = \,\text{Pr}(\mathbf{x} | \mathbf{\mu}, \mathbf{\pi}) = \sum_{k = 1}^K \pi_k \,\text{Pr}(\mathbf{x}|\mathbf{\mu_k})
\end{align}</math>

이 때 <math>\mathbf{x}</math>가 군집 <math>k</math>에 의해 생성되었을 확률은 다음과 같다.

<math>
\begin{align}
	\,\text{Pr}(\mathbf{x}|\mathbf{\mu_k}) = \prod_{i = 1}^D \mu_{k, i}^{x_i} (1 - \mu_{k, i})^{1 - x_i}.
\end{align}</math>

<math>K</math>-차원 잠재변수 <math>\mathbf{z_i}, i \in \{1, ..., N \}</math>을 정의함으로써 주어진 이미지가 어떤 군집에 속할 확률을 모델링 할 수 있다. <math>\mathbf{z_i}</math>가 <math>1</math>-of-<math>K</math> 표기법으로 나타내어졌다고 하자. 즉,

<math>\mathbf{z_i} := (z_{i, 1}, ..., z_{i, K})^T</math>에 대하여, <math>z_{i, j} \in \{0, 1\}, i \in \{ 1, ..., N \}, j \in \{ 1, ..., K \}</math>이고, <math>\sum_{j = 1}^{K} z_{i, j} = 1</math>가 성립한다.

<math>\mathbf{z_i}</math>의 분포가 <math>\,\text{Pr}(z_{i, j} = 1) = \pi_j</math>로 주어져 있다고 가정하자. 그러면 다음이 성립한다.

<math>
\begin{align}
	\,\text{Pr}(\mathbf{z_n} | \mathbf{\pi}) = \prod_{i = 1}^K \pi_i^{z_{n, i}}.
\end{align}</math>

또한, <math>\,\text{Pr}(\mathbf{x_n} | z_{n, k} = 1) = \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})</math>라 하면,

<math>
\begin{align}
	\,\text{Pr}(\mathbf{x_n} | \mathbf{z_n}, \mathbf{\mu}, \mathbf{\pi}) = \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}}.
\end{align}</math>이다.

잠재변수 <math>\mathbf{z_i}</math>들을 포함하는 집합 <math>\mathbf{Z} := \{ \mathbf{z_1}, ..., \mathbf{z_N} \}</math>를 정의하고 위 수식을 사용하면,

<math>
\begin{align}
	\,\text{Pr}(\mathbf{Z}|\mathbf{\pi}) = \prod_{n = 1}^N \,\text{Pr}(\mathbf{z_n}|\mathbf{\pi}) = \prod_{n = 1}^N \prod_{k = 1}^K \pi_k^{z_{n, k}}.
\end{align}</math>이 성립하고,

트레이닝 이미지의 집합 <math>\mathbf{X} := \{ \mathbf{x_1}, ..., \mathbf{x_N} \}</math>를 정의하면 <math>\mathbf{Z}</math>에 관한 <math>\mathbf{X}</math>의 조건부 확률분포를 다음과 같이 유도할 수 있다.

<math>
\begin{align}
	\,\text{Pr}(\mathbf{X}|\mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) = \prod_{n = 1}^N \,\text{Pr}(\mathbf{x_n}|\mathbf{z_n},\mathbf{\mu},\mathbf{\pi}) = \prod_{n = 1}^N \prod_{k = 1}^K \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_k})^{z_{n, k}} = \prod_{n = 1}^N \prod_{k = 1}^K \left( \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}}.
\end{align}</math>

이제 위 수식들과 확률의 곱셈법칙을 이용하면 완전한 데이터 가능도({{lang|en|complete data likelihood}})를 구할 수 있고,

<math>
\begin{align}
\,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) = \,\text{Pr}(\mathbf{X} | \mathbf{Z}, \mathbf{\mu}, \mathbf{\pi}) \,\text{Pr}(\mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) = \prod_{n = 1}^N \prod_{k = 1}^K \left( \pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} \right)^{z_{n, k}}.
	\end{align}</math>

이에 로그함수를 취하면 완전한 데이터 로그가능도({{lang|en|complete data log-likelihood}}) <math>\mathcal{L}(\theta)</math>가 유도된다.

<math>
\begin{align}
	\mathcal{L}(\theta) = \ln \,\text{Pr}(\mathbf{X}, \mathbf{Z}| \mathbf{\mu}, \mathbf{\pi}) = \sum_{n = 1}^N \sum_{k = 1}^K z_{n, k} \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{align}</math>

=== 기댓값 단계(E step) ===
<br />
기댓값 단계에서 잠재변수의 값을 업데이트 하기 위해 <math>\mathbf{Z}</math>의 확률분포를 <math>\,\text{Pr}(\mathbf{Z} | \mathbf{X}, \mathbf{\theta})</math>이라 하자. 이 분포는 정확히 풀 수 없으므로 값을 추정해야 하는데, 이를 위해 현재 <math>z_{n, k}</math>의 값을 이 변수의 기댓값으로 대체한다.

<math>
\begin{align}
z_{n, k}^\text{new} \leftarrow \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}, \mathbf{\pi}}[z_{n, k}] = \sum_{z_{n, k}} \,\text{Pr}(z_{n,k} | \mathbf{x_n}, \mathbf{\mu}, \mathbf{\pi}) \, z_{n,k} = \frac{\pi_k \,\text{Pr}(\mathbf{x_n} |\mathbf{\mu_k})}{\sum_{m = 1}^K \pi_m \,\text{Pr}(\mathbf{x_n} | \mathbf{\mu_m})} = \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}.
\end{align}</math>

이와 같이 <math>z_{n,k}</math>의 값을 업데이트 하면 잠재변수 <math>\mathbf{z_{n}}^\text{new}</math>가 더 이상 <math>1</math>-of-<math>K</math>식으로 나타내어지지 않는다는 사실을 알 수 있다. 이는 이미지 <math>\mathbf{x_{n}}</math>이 하나의 군집이 아니라 여러 개의 군집에 부분적으로 속한다는 것을 의미한다.

=== 최대화 단계(M step) ===
<br />
이 모델의 모수들(혼합 계수 및 군집별 픽셀 분포 변수)을 최대화하기 위해 다음 식을 이용한다.

<math>\mathbf{\theta}^\text{new} \leftarrow \underset{\mathbf{\theta}}{\text{arg max }} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right]</math>

<math>
\begin{align}
	\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \sum_{k = 1}^K \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, k} \right] \left( \ln \pi_k + \sum_{i = 1}^D x_{n, i} \ln \mu_{k, i} + (1 - x_{n, i}) \ln (1 - \mu_{k, i}) \right).
\end{align}</math>

위 식을 <math>\mathbf{\mu_k}</math>에 관해 최대화하기 위해 미분하여 <math>0</math>이 되는 값을 찾는다.

<math>
\begin{align}
\frac{\partial}{\partial \mu_{m, j}} \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] &= \sum_{n = 1}^N \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \mathbf{\mu}^\text{old}, \mathbf{\pi}^\text{old}} \left[ z_{n, m} \right] \left( \frac{x_{n, j}}{\mu_{m, j}} - \frac{1 - x_{n, j}}{1 - \mu_{m, j}} \right) \\
&= \sum_{n = 1}^N z_{n, m}^\text{new} \frac{x_{n, j} - \mu_{m, j}}{\mu_{m, j} (1 - \mu_{m, j})} = 0 \Leftrightarrow \\
\mu_{m, j} &= \frac{1}{N_m} \sum_{n = 1}^N x_{n, j} z_{n, m}^\text{new},
\end{align}</math>

이 때, <math>N_m = \sum_{n = 1}^N z_{n, m}^\text{new}</math>는 군집 <math>m</math>에 실질적으로 속한 이미지의 개수이다.

군집 <math>m</math>에서의 픽셀 분포의 매개변수 <math>\mathbf{\mu_m}</math>는 다음과 같이 나타낼 수 있다.

<math>
\mathbf{\mu_m} = \mathbf{\bar{x}_m}, \mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m}^\text{new} \mathbf{x_n}</math>

<math>\mathbf{\bar{x}_m}</math>는 군집 <math>m</math>에 (실질적으로) 속한 이미지들의 [[평균|가중 평균]]을 의미한다.

<math>\mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right]</math>를 혼합 계수 <math>\mathbf{\pi}</math>와 조건 <math>\sum_{k = 1}^K \pi_k = 1</math>에 관해 최대화하기 위해 [[라그랑주 승수법]](Lagrange Multiplier)을 이용한다.

<math>
\Lambda(\theta, \lambda) := \mathbb{E}_{\mathbf{Z} | \mathbf{X}, \theta^\text{old}} \left[ \mathcal{L}(\theta) \right] + \lambda \left( \sum_{k = 1}^K \pi_k - 1 \right)</math>

최댓값은 다음과 같이 구할 수 있다.

<math>
\begin{align}
\frac{\partial}{\partial \pi_{m}} \Lambda(\theta, \lambda) = \frac{1}{\pi_m} \sum_{n = 1}^N z_{n,m}^\text{new} + \lambda = 0 \Leftrightarrow \\
\pi_m = -\frac{N_m}{\lambda},
\end{align}</math>

<math>
\begin{align}
\frac{\partial}{\partial \lambda} \Lambda(\theta, \lambda) = \sum_{k = 1}^K \pi_k - 1 = 0 \Leftrightarrow \\
\sum_{k = 1}^K \pi_k = 1.
\end{align}</math>

위 두 결과를 종합하면 <math>\lambda = - \sum_{k = 1}^K N_k = - N</math>이고,

<math>\pi_m = -\frac{N_m}{\lambda} = \frac{N_m}{N}</math>이다.

=== 요약 ===
베르누이 혼합 모델의 기댓값(E) 및 최대화(M) 단계의 수식들은 다음과 같다.

'''기댓값 단계(E step):'''
<br />
<math>z_{n, k} \leftarrow \frac{\pi_k \prod_{i = 1}^D \mu_{k, i}^{x_{n, i}} (1 - \mu_{k, i})^{1 - x_{n, i}} }{\sum_{m = 1}^K \pi_m \prod_{i = 1}^D \mu_{m, i}^{x_{n, i}} (1 - \mu_{m, i})^{1 - x_{n, i}}}</math>

<br />
'''최대화 단계(M step):'''
<br />
<math>\mathbf{\mu_m} \leftarrow \mathbf{\bar{x}_m},</math>
<math>\pi_m \leftarrow \frac{N_m}{N},</math>

이 때, <math>\mathbf{\bar{x}_m} = \frac{1}{N_m} \sum_{n = 1}^N z_{n, m} \mathbf{x_n}</math>이고, <math>N_m = \sum_{n = 1}^N z_{n, m}</math>이다.

== EM의 대체 알고리즘 ==
EM은 극값으로만 수렴하고 일반적으로 수렴율에 대한 범위가 없다. 고차원에서 잘 작동하지 않고 극값의 수가 다수가 존재할 가능성이 있다. 따라서 특히 고차원 상에서 다른 학습 방법을 사용할 필요가 있다. 이 때 모멘트-기반 접근 방식이나 스펙트럼 기법이 더 좋은 효과를 보장한다. 모멘트-기반 접근은<ref name="anandkumar2014tensor">{{저널 인용|last1=Anandkumar|first1=Animashree|last2=Ge|first2=Rong|last3=Hsu|first3=Daniel|last4=Kakade|first4=Sham M|last5=Telgarsky|first5=Matus|title=Tensor decompositions for learning latent variable models|journal=The Journal of Machine Learning Research|volume=15|number=1|pages=2773–2832|year=2014}}</ref> 확률 모형의 매개변수를 학습하기 위한 것으로 최근 들어 사용이 많아지기 시작했는데, 이는 EM이 극값만을 보장해주는 것에 반해 모멘트-기반 접근은 최대, 최솟값을 보장하기 때문이다. 혼합 모형, [[은닉 마르코프 모델]]<ref name=anandkumar2012method>{{저널 인용| title=A method of moments for mixture models and hidden Markov models | url=https://archive.org/details/arxiv-1203.0683 | last1=Anandkumar | first1=Animashree|last2=Hsu|first2=Daniel|last3=Kakade|first3=Sham M| journal=arXiv preprint arXiv:1203.0683| year=2012}}</ref>, 공동체 모델 등으로부터 학습을 보장하는 알고리즘들이 유도될 수 있다.<ref name="anandkumar2014tensor_community">{{저널 인용|title=A tensor approach to learning mixed membership community models|last1=Anandkumar|first1=Animashree|last2=Ge|first2=Rong|last3=Hsu|first3=Daniel|last4=Kakade|first4=Sham M|journal=The Journal of Machine Learning Research|volume=15|number=1|pages=2239–2312|year=2014}}</ref> 이 스펙트럼 방법들은 잘못된 극값을 찾아내지 않고, 모수의 참값을 특정 제약 조건 하에서 일관적으로 도출시킨다.

== 같이 보기 ==
* [[주성분 분석]]
== 참조 ==
EM 알고리즘은 MM 알고리즘<ref>{{웹 인용|last=Lange|first=Kenneth|title=The MM Algorithm|url=http://www.stat.berkeley.edu/~aldous/Colloq/lange-talk.pdf}}</ref>의 특수한 형태로 볼 수 있다.

== 각주 ==
{{각주}}

== 추가 자료 ==
* Robert Hogg, Joseph McKean and Allen Craig. ''Introduction to Mathematical Statistics''. pp.&nbsp;359–364. Upper Saddle River, NJ: Pearson Prentice Hall, 2005.
* [http://www.inference.phy.cam.ac.uk/mackay/itila/ The on-line textbook: Information Theory, Inference, and Learning Algorithms], by David J.C. MacKay includes simple examples of the EM algorithm such as clustering using the soft ''k''-means algorithm, and emphasizes the variational view of the EM algorithm, as described in Chapter 33.7 of version 7.2 (fourth edition).
* {{저널 인용 | id = {{citeseerx|10.1.1.9.9735}} | title = The Expectation Maximization Algorithm | first = Frank | last = Dellaert | postscript = , }} gives an easier explanation of EM algorithm in terms of lowerbound maximization.
* {{서적 인용
  | last1 = Bishop   | first1 = Christopher M.
  | title = Pattern Recognition and Machine Learning
  | url = https://archive.org/details/patternrecogniti0000bish   | year = 2006
  | publisher = Springer
  | ref = CITEREFBishop2006
  | isbn = 0-387-31073-8
  }}
* {{서적 인용|doi=10.1561/2000000034 | title = Theory and Use of the EM Algorithm | author = M. R. Gupta and Y. Chen | year = 2010}} A well-written short book on EM, including detailed derivation of EM for GMMs, HMMs, and Dirichlet.
* {{저널 인용 | id = {{citeseerx|10.1.1.28.613}} | title = A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models | first = Jeff | last = Bilmes | postscript = , }} includes a simplified derivation of the EM equations for Gaussian Mixtures and Gaussian Mixture Hidden Markov Models.
* [http://www.cse.buffalo.edu/faculty/mbeal/papers/beal03.pdf Variational Algorithms for Approximate Bayesian Inference], by M. J. Beal includes comparisons of EM to Variational Bayesian EM and derivations of several models including Variational Bayesian HMMs ([http://www.cse.buffalo.edu/faculty/mbeal/thesis/index.html chapters]).
* [http://www.seanborman.com/publications/EM_algorithm.pdf The Expectation Maximization Algorithm: A short tutorial], A self-contained derivation of the EM Algorithm by Sean Borman.
* [http://pages.cs.wisc.edu/~jerryzhu/cs838/EM.pdf The EM Algorithm], by Xiaojin Zhu.
* [http://arxiv.org/pdf/1105.1476.pdf EM algorithm and variants: an informal tutorial] by Alexis Roche. A concise and very clear description of EM and many interesting variants.
* {{서적 인용
 | author = Einicke, G.A.
 | year = 2012
 | title = Smoothing, Filtering and Prediction: Estimating the Past, Present and Future
 | publisher = Intech
 | location = Rijeka, Croatia
 | isbn = 978-953-307-752-9
 | url = http://www.intechopen.com/books/smoothing-filtering-and-prediction-estimating-the-past-present-and-future}}

[[분류:기계 학습 알고리즘]]
[[분류:추정 이론]]
[[분류:추정법]]
[[분류:최적화 알고리즘]]
[[분류:클러스터 분석 알고리즘]]
[[분류:통계 알고리즘]]