L-BFGS 문서 원본 보기

{{위키데이터 속성 추적}}
'''L-BFGS (혹은 Limited-memory BFGS,''' '''LM-BFGS''') 준-뉴턴 방식 ([[:en:Quasi-Newton_method|quasi-Newton methods]])의 최적화 알고리즘이다. 제한된 컴퓨터 메모리를 이용하여 기존 BFGS ([[:en:Broyden–Fletcher–Goldfarb–Shanno_algorithm|Broyden–Fletcher–Goldfarb–Shanno algorithm]]) 알고리즘을 속도면에서 개선한 알고리즘이다. (결과는 근사값). 본 알고리즘은 Adam과 함께 머신 러닝<ref name="malouf">{{콘퍼런스 인용|url=https://dl.acm.org/citation.cfm?id=1118871|title=A comparison of algorithms for maximum entropy parameter estimation|last1=Malouf|first1=Robert|date=2002|book-title=Proceedings of the Sixth Conference on Natural Language Learning (CoNLL-2002)|pages=49–55|doi=10.3115/1118853.1118871|doi-access=free}}</ref><ref name="owlqn">{{서적 인용|title=Proceedings of the 24th International Conference on Machine Learning|last1=Andrew|first1=Galen|last2=Gao|first2=Jianfeng|year=2007|chapter=Scalable training of L₁-regularized log-linear models|doi=10.1145/1273496.1273501|isbn=9781595937933|s2cid=5853259|chapter-url=http://research.microsoft.com/apps/pubs/default.aspx?id=78900}}</ref>에 있어서 널리 사용되는 파라메터 추정 알고리즘이다. 본 알고리즘의 목적함수는 <math>f(\mathbf{x})</math>의 최소값을 찾는 것이며, 여기서, <math>\mathbf{x}</math>는 unconstrained value의 real-vector이고, <math>f</math>는 미분가능한 스칼라 함수이다.

원래의 BFGS와 같이, L-BFGS 알고리즘은 variable space의 검색을 조정하는 추정된 inverse [[:en:Hessian_matrix|Hessian matrix]]를 이용한다. 하지만 BFGS는 inverse Hessian에서 추정된 전체 <math>n\times n</math> 을 이용하고, L-BFGS는 단지 몇개의 벡터만을 저장한다. 비록 일부의 벡터이지만 이런 일부의 벡터 역시 암시적으로 대표성을 지니고 있다. (주로 홀수개의 벡터를 지정하고 10 이하. 문헌에서 봤는데 이유는 기억이 가물가물.. ) 여기서 ''n은'' 문제에서 변수의 갯수를 의미한다. 선형 메모리의 요구사항으로 인해, L-BFGS는 많은 변수를 가진 최적화 문제에 적합하다고 알려져 있다. inverse Hessian '''H'''''<sub>k</sub>''  대신에 L-BFGS는 과거 m update의 기록과 gradient ∇''f''('''x''') 이용한다. 일반적으로 과거 기록인 m 의 크기는 주로 10 이하를 선정한다.

<!--문서는 Wikipedia의 영어판에 게재된 [[:en:Limited-memory_BFGS|Limited-memory BFGS]] 문서의 한국어 번역이며, 필자의 판단으로 일부 내용은 의역하였음.-->

== 알고리즘 (Algorithm) ==
알고리즘은 최적 변수의 초기 추정값, <math>\mathbf{x}_0</math>,과 함께 시작하고, 반복적으로 더 나은 추정값 <math>\mathbf{x}_1,\mathbf{x}_2,\ldots</math>을 평가하는 과정으로 진행된다. 목적 함수의 미분, <math>g_k:=\nabla f(\mathbf{x}_k)</math>, steepest descent의 방향을 정하고, 이차 미분 함수인 Hessian 메트릭스의 추정을 구성하는데 아주 중요한 요소이다.

L-BFGS는 다른 quasi-Newton 알고리즘과 많은 특징을 공유하지만, 메트릭스-벡터의 곱, <math>d_k=-H_k g_k</math>, 이 수행되는 방식에서 매우 다르다. 여기서, <math>d_k</math>는 대략적인 뉴튼의 방향이고,  <math>g_k</math>는 현재의 gradient, <math>H_k</math>는 Hessian 메트릭스의 역행렬이다. 이러한 방향 벡터를 정하기 위한 업데이트의 기록을 이용하는 많은 접근방식이 있으며, 여기서는 소위 "two loop recursion"<ref name=":0">{{저널 인용|title=The solution of non linear finite element equations|journal=International Journal for Numerical Methods in Engineering|last1=Matthies|first1=H.|last2=Strang|first2=G.|year=1979|volume=14|issue=11|pages=1613–1626|bibcode=1979IJNME..14.1613M|doi=10.1002/nme.1620141104}}</ref><ref name=":1">{{저널 인용|title=Updating Quasi-Newton Matrices with Limited Storage|journal=Mathematics of Computation|last=Nocedal|first=J.|year=1980|volume=35|issue=151|pages=773–782|doi=10.1090/S0025-5718-1980-0572855-7|doi-access=free}}</ref>이라고 불리는 일반적인 접근법 (알고리즘)을 소개한다.

k번째 iteration에서의 위치와 gradient를 <math>x_k</math>, <math>g_k\equiv\nabla f(x_k)</math>라 하자. 여기서 <math>f</math>는 최소화되어야 함수이고, 모든 벡터들은 컬럼 벡터이다. 마지막 {{mvar|m}} 업데이트는 저장했다고 가정하자.

:<math>s_k = (x_{k+1} - x_{k})</math>,
:<math>y_k = (g_{k+1} - g_k)</math>.

<math>\rho_k = \frac{1}{y^{\top}_k s_k} </math> 이고,  <math>H^0_k</math>는 inverse Hessian의 초기 가정이라고 하자. BFGS를 기반으로 inverse Hessian은 아래와 같다.

:
:<math>H_{k+1} = (I-\rho_k s_k y_k^\top)H_k(I-\rho_k y_k s_k^\top) + \rho_k s_k s_k^\top. </math>
:
하나의 고정된 {{mvar|k}}에서 벡터의 시퀀스를 <math>q_{k-m},\ldots,q_k</math> as <math>q_k:=g_k</math> 와 <math>q_i:=(I-\rho_i y_i s_i^\top)q_{i+1}</math> 로 정의하자. 그때, <math>q_{i+1}</math>로부터 <math>q_i</math>를 계산하는 번복 알고리즘은 <math>\alpha_i := \rho_i s_i^\top q_{i+1}</math> 로 정의하고, 다음과 같이 <math>q_i=(q_{i+1}-\alpha_i y_i)</math>를 정한다.

또 하나의 벡터의 시퀀스 <math>z_{k-m},\ldots,z_k</math>는 <math>z_i:=H_iq_i</math>로 정한다. <math>z_{k-m}=H_k^0 q_{k-m}</math>, <math>\beta_i:=\rho_i y_i^\top z_i</math>, <math>z_{i+1}=z_i + (\alpha_i - \beta_i)s_i</math>를 결정하는 또하나의 recursive 알고리즘이 있다.  <math>z_k</math>의 값은 이제 오르막 방향이며, 다음과 같이 내리막 방향을 계산할 수 있다.
: <math>\begin{array}{l}
q = g_k\\
\mathtt{For}\ i=k-1, k-2, \ldots, k-m\\
\qquad \alpha_i = \rho_i s^\top_i q\\
\qquad q = (q - \alpha_i y_i)\\
\gamma_k = \frac{s_{k - 1}^{\top} y_{k - 1}}{y_{k - 1}^{\top} y_{k - 1}} \\
H^0_k=  \gamma_k I\\
z = H^0_k q\\
\mathtt{For}\ i= (k-m, k-m+1, \ldots, k-1)\\
\qquad \beta_i = \rho_i y^\top_i z\\
\qquad z = z + s_i (\alpha_i - \beta_i)\\
z = -z
\end{array}</math>
:
상기의 공식은 최소화 문제 <math>z = - H_k g_k</math>를 위한 방향 검색을 제공한다. 최대화 문제에 대해서는, {{mvar|z}}의 부호를 변경해야 한다. 초기 inverse Hessian <math>H^0_k</math> 근사값은 대각 행렬값이나, 효율적인 계산 방식인 identity 메트릭스의 배수로 선택한다.

초기 메트릭스 <math>\gamma_k</math>의 스케일은 방향 검색이 잘 되도록 배율되어야 하고, 단위 스텝의 길이가 대부분의 iteration에서 적용가능하여야 한다. [[:en:Wolfe_conditions|Wolfe line search]]는 curvature조건을 만족하고, BFGS의 업데이트를 안정시키기 위해 사용된다.

어떤 소프트웨어에서는 [[:en:Backtracking_line_search|backtracking line search]]를 이용하지만, 이것은 선택된 스텝에 의해 curvature 조건 <math>y_k^{\top} s_k > 0</math> 만족되어 져야 함을 보증하지 못한다. 어떤 알고리즘 구현에서는 <math>y_k^{\top} s_k</math>가 음수이거나, zero (영)에 가까우면 건너 뛰는 방식을 취하지만, 이런 접근법은 일반적으로 권장하지는 않는다. 왜냐하면 너무 자주 건너 뛰므로써 중요한 곡면 정보를 파악하기 힘들고, 적절한 Hessian 추정을 못할 수 있기 때문이다.

이러한 two loop update는 단지 inverse Hessian 잘 작동한다. 다른 방식으로 inverse Hessian 추정하는 직접 근사 Hessian <math>B_k</math><ref>{{저널 인용|title=Representations of Quasi-Newton Matrices and their use in Limited Memory Methods|journal=Mathematical Programming|last1=Byrd|first1=R. H.|last2=Nocedal|first2=J.|year=1994|volume=63|issue=4|pages=129–156|doi=10.1007/BF01582063|last3=Schnabel|first3=R. B.|s2cid=5581219}}</ref>를 이용하는 방식도 개발되어 있다.

== 변형된 알고리즘들 (Variants) ==
BFGS와 L-BFGS는 구속조건들 없이 Smooth 함수를 최적화하도록 설계되었기 때문에, L-BFGS 알고리즘은 미분이 되지 않은 함수나 구속들을 포함하는 함수에 적용하기 위해서는 수정이 필요하다. 가장 인기있는 변형된 알고리즘은 active-set방식이라 불린다. 이 아이디어는 현재 iterate의 작은 근사값으로 구속했을 때, 함수와 구속을 단순화 할 수 있다는 것이다.

=== L-BFGS-B ===
L-BFGS-B 알고리즘은 simple box 구속들을 다루기 위해서 확장된 방식이다. 단순화된 구속은 하한과 상한 경계가 있는 {{math|''l<sub>i</sub>'' ≤ ''x<sub>i</sub>'' ≤ ''u<sub>i</sub>''}} 형식이며, 각 {{mvar|x<sub>i</sub>}}를 위해, 양 방향 혹은 한쪽 방향의 경계는 제거되어 질 수도 있다<ref name="LBFGSB1">{{저널 인용|title=A Limited Memory Algorithm for Bound Constrained Optimization|journal=[[SIAM Journal on Scientific Computing|SIAM J. Sci. Comput.]]|last1=Byrd|first1=R. H.|last2=Lu|first2=P.|url=https://digital.library.unt.edu/ark:/67531/metadc666315/|year=1995|volume=16|issue=5|pages=1190–1208|doi=10.1137/0916069|last3=Nocedal|first3=J.|last4=Zhu|first4=C.}}</ref><ref name="algo778">{{저널 인용|title=L-BFGS-B: Algorithm 778: L-BFGS-B, FORTRAN routines for large scale bound constrained optimization|journal=ACM Transactions on Mathematical Software|last1=Zhu|first1=C.|last2=Byrd|first2=Richard H.|year=1997|volume=23|issue=4|pages=550–560|doi=10.1145/279232.279236|last3=Lu|first3=Peihuang|last4=Nocedal|first4=Jorge|s2cid=207228122}}</ref> . 이 방법은 (단순 gradient 방식을 이용하는) 각 스텝에서 고정과 자유 변수들을 결정하고, 높은 정확도을 얻기 위해서 자유 변수에 L-BFGS를 이용하고, 과정을 반복함으로써 동작한다.

=== OWL-QN ===
'''Orthant-wise limited-memory quasi-Newton''' ('''OWL-QN''')는 [[Taxicab geometry|<math>\ell_1</math>]]-[[Regularization (mathematics)|regularized]] 모델을 피팅하고, inherent sparsity 모델을 이용하기 위해 변형된 L-BFGS알고리즘이다<ref name="owlqn" />. 아래의 형태의 함수를 최소화 (최적화)한다.
: <math>f(\vec x) = g(\vec x) + C \|\vec x\|_1</math>

여기서 <math>g</math>는 미분가능한 convex 손실 함수이다. 이 방식은 하나의 active-set 방식이며, 변수들의 각 요소의 부호를 평가(추정, 예측)하고, 같은 부호를 가지기 위해 서브 시퀀스 스텝을 제약한다. 일단 부호가 고정되면, 미분이 가능하지 않은 <math> \|\vec x\|_1</math> term은 하나의 smooth linear term 이 된다. 이런 smooth linear term은 L-BFGS에 의해 이용될 수 있다. 하나의 L-BFGS스텝 이후, 이 방식은 부호를 변경하기 위한 일부 변수들을 허용하고, 과정을 반복한다.

=== O-LBFGS ===
Schraudolph 둥은 BFGS와 L-BFGS에 대한 온라인 근사법을 제안했다<ref>{{콘퍼런스 인용|title=A stochastic quasi-Newton method for online convex optimization|first1=N.|last1=Schraudolph|first2=J.|last2=Yu|first3=S.|last3=Günter|conference=AISTATS|year=2007}}</ref>. [[:en:Stochastic_gradient_descent|Stochastic gradient descent (SGD)]]와 유사하게, 이 방법은 에러 함수와 각 iteration에서의 전체 데이터 셋에 대한 randomly drawn subset의 gradient를 평가함으로써, 계산의 복잡성을 감소시키기 위해 사용될 수 있다. BFGS 혹은 O-BFGS의 온라인 근사법은 반드시 수렴하지는 않지만<ref>{{저널 인용|title=RES: Regularized Stochastic BFGS Algorithm|journal=IEEE Transactions on Signal Processing|last1=Mokhtari|first1=A.|last2=Ribeiro|first2=A.|year=2014|volume=62|pages=6089–6104|arxiv=1401.7625|bibcode=2014ITSP...62.6089M|doi=10.1109/TSP.2014.2357775|number=23|citeseerx=10.1.1.756.3003|s2cid=15214938}}</ref>, O-LBFGS는 거의 글로벌하게 수렴함<ref>{{저널 인용|title=Global convergence of online limited memory BFGS|journal=Journal of Machine Learning Research|last1=Mokhtari|first1=A.|last2=Ribeiro|first2=A.|url=http://www.jmlr.org/papers/volume16/mokhtari15a/mokhtari15a.pdf|year=2015|volume=16|pages=3151–3181}}</ref>을 보여준다.

== 각주 ==
{{각주}}

== 읽을거리 ==
* {{저널 인용|title=On the Limited Memory Method for Large Scale Optimization|journal=Mathematical Programming B|last1=Liu|first=D. C.|last2=Nocedal|first2=J.|url=http://www.ece.northwestern.edu/~nocedal/PSfiles/limited-memory.ps.gz|year=1989|volume=45|issue=3|pages=503–528|doi=10.1007/BF01589116|citeseerx=10.1.1.110.6443|s2cid=5681609}}
* {{웹 인용|url=http://aria42.com/blog/2014/12/understanding-lbfgs|title=Numerical Optimization: Understanding L-BFGS|last1=Haghighi|first1=Aria|date=2 Dec 2014}}
* {{서적 인용|title=Conjugate Gradient Algorithms in Nonconvex Optimization|last=Pytlak|first=Radoslaw|year=2009|publisher=Springer|pages=159–190|chapter=Limited Memory Quasi-Newton Algorithms|isbn=978-3-540-85633-7|chapter-url=https://www.google.com/books/edition/Conjugate_Gradient_Algorithms_in_Nonconv/RhRkaDPmwVoC?hl=en&gbpv=1&pg=PA159}}

[[분류:최적화 알고리즘]]