取火之路: 值得注目的 affinity propagation clustering

近來有個問題在腦中逐漸成形，出現了將 DNA 序列做分群的需求，因此著手尋找有用的套件。結果我找到了 APCluster，一個基於 affinity propagation (AP) clustering 的 R 套件。什麼是 affinity propagation clustering？它很強大嗎？查閱了一下原始出處，才發現這個演算法竟然是發表於 2007 年的《Science》。

標題："Clustering by Passing Messages Between Data Points"

《科學》裡的核心概念

在開始談 AP clustering 之前，先聊聊知名且常被使用的 K-centers clustering 演算法。K-centers 演算法會從一些隨機取得的 "exemplars" (被當做 center 的 data points)開始，並且迭代調整更新每個集合以減少整體的 suqared error。這個方法對於初始選擇的 exemplars 很敏感，所以通常需要更換初始選擇並且重跑許多次以獲得好的解答。然而這樣做通常也只有在 clusters 的數目少且恰好初始時選的種子 (seeds) 夠好才會趨近好的解答。

他們的方法中的一個創舉是，「同時考慮全部的資料點為可能的 exemplars」！藉由點與點之間的遞迴地交換資訊 ─ "responsibilities" 與 "availavilities"，讓各個節點能夠知道誰是最適合自己的 exemplar 以及自己有多大程度適合當 exemplar。

真是妙極了。

若想要知道較多實作的細節，不妨查閱原始論文。如果您偏好分析程式碼， AP cluster 也有 Matlab 的實作版本。在程式裡頭的註解檔，有交待了演算法的整體概念，比如：

Each cluster is represented by a data point called a cluster center, and the method searches for clusters so as to maximize a fitness function called net similarity. The method is iterative and stops after maxits iterations.

For N data points, there may be as many as N^2-N pair-wise similarities (note that the similarity of data point i to k need not be equal to the similarity of data point k to i). These may be passed to APCLUSTER in an NxN matrix s, where s(i,k) is the similarity of point i to point k.

APCluster：R 套件

AP clustering 演算法也被實作成了 R 的套件，發表在 2011 年的 Bioinformatics。

標題：APCluster: an R package for affinity propagation clustering.

其實這篇論文才是我的主角。想了想後，發現序列不能以有序數值表示，代表內建的 negDistMat( ) 在分群序列上頭是沒有意義的。若參考作者們在分析蛋白質 coiled coil 所採取的方式，還得詳細了解 Carsten et al. 在 2011 的研究工作才行 (似乎不脫離 BLAST pair-wise alignment 的計算分數)。

程式擁有其他內建的函式，能將資料視覺化

參考文件
APCluster 的參考文件有很多，底下僅列舉一些：

APCluster - An R Package for Affinity Propagation Clustering
http://www.bioinf.jku.at/software/apcluster/
R Graphical Manual : Affinity Propagation Package
http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=apcluster:apcluster-package
apcluster in R CRAN
http://cran.r-project.org/web/packages/apcluster/index.html

取火之路

Navigation

值得注目的 affinity propagation clustering

0 意見:

張貼留言

Popular Posts

Blog Archive

Categories