取火之路: 用 Jensen-Shannon divergence 計算相似度

今天認識到了 Jensen-Shannon divergence，它似乎是一個常用於計算兩個機率分佈間之相似程度的方法。

where

and

根據維基百科的條目，它又被稱為 information radius (IRad) 或 total divergence to the average。儘管是筆者自己的猜測，但我想後者名字的由來應該與 M = 1/2(P+Q) 這個平均值脫不了干係。

2009 年 PNAS 刊了一篇充分運用 Jensen-Shannon divergence 方法以比較多種基因體的文章。對於想運用 Jensen-Shannon divergence 以計算 DNA 序列間相似程度的話，裡頭的公式與步驟應已經夠詳細與清楚了。

標題：Alignment-free genome comparison with featurefrequency profiles (FFP) and optimal resolutions

另外，筆者發現 R 有一個名為 textcat 的套件，裡頭含有可以計算 Jensen-Shannon divergence 的函式 ─ textcat_xdist()。這個函式的優點在於，給定 n-gram profiles，它還可以計算 Kullback-Leibler I-divergence, Kullback-Leibler J-divergence, the sum of the absolute differences in n-gram log frequencies, 以及 cosine dissimilarity 等等。

雖然有現成的 R 套件可以使用，但也不需要高興的太早。依照筆者這幾天的經驗，資料量太大(一千個維度以上) 的情況就開始可能會有「記憶體不足」的錯誤發生，差不多的意思就是這函式報廢了。

所以最好還是從頭了解，弄清楚每個算式，別擔心把手弄髒。如此一來最壞的情況下也有辦法實作演算法，把任務完成。

提醒

若要得到距離，記得將 Jensen-Shannon divergence 開平方根

The Jensen-Shannon divergence is not a distance (as it does not obey the triangle inequality), but its square root is. ─ "What Makes a Query Difficult?" Carmel et al., ACM, 2006.

取火之路

Navigation

用 Jensen-Shannon divergence 計算相似度

0 意見:

張貼留言

Popular Posts

Blog Archive

Categories