Klastering
-
Upload
linn-daa-prima -
Category
Documents
-
view
19 -
download
3
description
Transcript of Klastering
![Page 1: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/1.jpg)
CS 4333 Data Mining - IMD 1
Analisis Klastering
Imelda AtastinaCS 4333 Data Mining
![Page 2: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/2.jpg)
CS 4333 Data Mining - IMD 2
Definisi Mengelompokkan objek menjadi satu kelompok jika
objek-objek tsb “mirip”(berkaitan/dekat) dan membuat kelompok yang berbeda jika objek itu “berbeda”
JarakInter-cluster
maksimumkan
Jarak Intra-cluster
minimalkan
![Page 3: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/3.jpg)
CS 4333 Data Mining - IMD 3
Kegunaan
PemahamanMemahami karakteristik objek yang memiliki kelompok yang sama.contoh : protein yang memiliki fungsi yang sama, stock saham yang memiliki fluktuasi yang sama
Ringkasan (Summary)Sehingga data yang diolah merupakan dataset yang lebih kecil saja
![Page 4: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/4.jpg)
CS 4333 Data Mining - IMD 4
Bukan Klaterisasi
Klasifikasi (Supervised classification) Mempunyai informasi label kelas
Segmentasi Sederhana Membagi kelompok registrasi siswa berdasarkan
urutan hasil ujian penerimaan Hasil-hasil query
Pengelompokan berdasarkan spesifikasi eksternal
Graph partitioning
![Page 5: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/5.jpg)
CS 4333 Data Mining - IMD 5
Macam-macam Proses Klasterisasi Hierarchical vs Partitioning Exclusive vs Overlapping vs Fuzzy Complete vs Partial
![Page 6: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/6.jpg)
CS 4333 Data Mining - IMD 6
Jenis-jenis Klaster
Well-separated Prototype based Graph based Density based Shared property
![Page 7: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/7.jpg)
CS 4333 Data Mining - IMD 7
K-Means
Merupakan salah satu algoritma Partitional clustering
Setiap klaster berkaitan dengan sebuah titik pusat klaster (centroid)
Setiap titik dimasukkan ke dalam klaster dengan centroid terdekat.
Jumlah klaster harus ditentukan sebelumnya Algoritma dasar K-Means sangat sederhana
![Page 8: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/8.jpg)
CS 4333 Data Mining - IMD 8
Algoritma K-Means
![Page 9: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/9.jpg)
CS 4333 Data Mining - IMD 9
Contoh
Instance X Y
1 1.0 1.5
2 1.0 4.5
3 2.0 1.5
4 2.0 3.5
5 3.0 2.5
6 5.0 6.0
![Page 10: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/10.jpg)
CS 4333 Data Mining - IMD 10
Contoh (cont’)
Pilih K=2 Pilih instance 1(1.0,1.5)
sbg centroid awal klaster 1 and instance 3 (2.0,1.5) sbg centroid awal klaster 2
Hitung jarak masing-masing titik terhadap centroid yg dipilih
C1 C2
0 1
3 3.16
1 0
2.24 2
2.24 1.41
6.02 5.41
![Page 11: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/11.jpg)
CS 4333 Data Mining - IMD 11
Contoh cont’
Klaster 1 berisi : 1,2 Klaster 2 berisi : 3,4,5,6 Hitung ulang centroid
masing-masing klasterC1 : (1,3) and C2 : (3,3.375)
Hitung juga jarak masing-masing instance terhadap centroid baru
C1 C2
1.5 2.74
1.5 2.29
1.8 2.125
1.12 1.01
2.06 0.875
5 3.3
![Page 12: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/12.jpg)
CS 4333 Data Mining - IMD 12
Contoh cont’
Klaster 1 berisi : 1,2,3 Klaster 2 berisi :4,5,6 Hitung lagi centroid C1 : (1.33,2.5) C2:(3.33,4) Hitung jarak masing-masing titik … Berhenti hingga nilai C1 dan C2 tidak
berubah (atau memenuhi treshold yang diinginkan)
![Page 13: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/13.jpg)
CS 4333 Data Mining - IMD 13
SSE
SSE = Sum Squared Error Digunakan untuk menentukan hasil klasterisasi
yang lebih baik,jika inisialisasi centroidnya berbeda-beda
K
i Cxi
i
xcdistSSE1
2),(
iCxi
i xm
c 1
•Makin kecil nilai SSE, makin baik•Salah satu teknik untuk memperkecil nilai SSE adalah dengan memperbesar nilai K
![Page 14: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/14.jpg)
CS 4333 Data Mining - IMD 14
Agglomerative Hierarchical Clustering Menghasilkan klaster bersarang yang dapat
direpresentasikan sebagai pohon hirarki Juga dapat direpresentasikan sebagai dendogram
1 3 2 5 4 60
0.05
0.1
0.15
0.2
1
2
3
4
5
6
1
23 4
5
![Page 15: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/15.jpg)
CS 4333 Data Mining - IMD 15
Jenis hierarchical clustering
AgglomerativeDimulai dengan setiap titik dianggap sebagai sebuah klaster, secara bertahap setiap klaster digabungkan hingga akhirnya menjadi satu klaster saja
DivisiveKebalikan dari agglomerative, dimulai dengan semua titik dianggap berada pada satu klaster, secara bertahap dibagi hingga setiap klaster berisi satu titik
![Page 16: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/16.jpg)
CS 4333 Data Mining - IMD 16
Algoritma Dasar Agglomerative Hierarchical Clustering1. Compute the proximity matrix, if necessary2. Repeat3. Merge the closest two cluster4. Update the proximity matrix to reflect the
proximity between the new cluster and the original clusters
5. Until Only one cluster remains* Cara menghitung jarak antara 2 cluster
![Page 17: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/17.jpg)
CS 4333 Data Mining - IMD 17
Bagaimana mendefinisikan Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
.
Similarity?
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
Proximity Matrix
![Page 18: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/18.jpg)
CS 4333 Data Mining - IMD 18
Bagaimana mendefinisikan Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
![Page 19: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/19.jpg)
CS 4333 Data Mining - IMD 19
Bagaimana mendefinisikan Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
![Page 20: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/20.jpg)
CS 4333 Data Mining - IMD 20
Bagaimana mendefinisikan Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
![Page 21: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/21.jpg)
CS 4333 Data Mining - IMD 21
Bagaimana mendefinisikan Inter-Cluster Similarity
p1
p3
p5
p4
p2
p1 p2 p3 p4 p5 . . .
.
.
. Proximity Matrix
MIN MAX Group Average Distance Between Centroids Other methods driven by an
objective function Ward’s Method uses squared error
![Page 22: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/22.jpg)
CS 4333 Data Mining - IMD 22
Contoh Diketahui tabel data dan jarak antar klaster sbb :
Titik X Y
1 0.4 0.53
2 0.22 0.38
3 0.35 0.32
4 0.26 0.19
5 0.08 0.41
6 0.45 0.30
1 2 3 4 5 6
1 0.00 0.24 0.22 0.37 0.34 0.23
2 0.24 0.00 0.15 0.20 0.14 0.25
3 0.22 0.15 0.00 0.15 0.28 0.11
4 0.37 0.20 0.15 0.00 0.29 0.22
5 0.34 0.14 0.28 0.29 0.00 0.39
6 0.23 0.25 0.11 0.22 0.39 0.00
![Page 23: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/23.jpg)
CS 4333 Data Mining - IMD 23
Hierarchical Clustering: MIN
Nested Clusters Dendrogram
1
2
3
4
5
6
12
3
4
5
3 6 2 5 4 10
0.05
0.1
0.15
0.2
![Page 24: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/24.jpg)
CS 4333 Data Mining - IMD 24
Contoh menghitung jarak antar klasterMIN Dist({3,6},{2,5})
= min(dist(3,2),dist(3,5),dist(6,2),dist(6,5))= min(0.15,0.25,0.28,0.39)= 0.15
![Page 25: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/25.jpg)
CS 4333 Data Mining - IMD 25
Hierarchical Clustering: MAX
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
1
2
3
4
5
61
2 5
3
4
![Page 26: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/26.jpg)
CS 4333 Data Mining - IMD 26
Contoh menghitung jarak antar klasterMAX Dist({3,6},{4})
= max(dist(3,4),dist(6,4))= max (0.15,0.22) = 0.22
Dist({3,6},{2,5})= max(dist(3,2),dist(3,5),dist(6,2),dist(6,5))= max(0.15,0.25,0.28,0.39)= 0.39
Dist({3,6},{1})= max(dist(3,1),dist(6,1))= max(0.22,0.23)= 0.23
![Page 27: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/27.jpg)
CS 4333 Data Mining - IMD 27
Cluster Similarity: Group Average
Proximity dari 2 klaster adalah rata-rata jarak tiap 2 titik pada 2 klaster yang berbeda
||Cluster||Cluster
)p,pproximity(
)Cluster,Clusterproximity(ji
ClusterpClusterp
ji
jijjii
![Page 28: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/28.jpg)
CS 4333 Data Mining - IMD 28
Hierarchical Clustering: Group Average
Nested Clusters Dendrogram
3 6 4 1 2 50
0.05
0.1
0.15
0.2
0.25
1
2
3
4
5
61
2
5
3
4
![Page 29: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/29.jpg)
CS 4333 Data Mining - IMD 29
Contoh menghitung jarak antar klasterGroup Average dist({3,6,4},{1})
= (0.22+0.37+0.23)/(3*1)= 0.28
dist({2,5},{1})= (0.24+0.34)/(2*1)= 0.29
dist({3,6,4},{2,5})= (0.15+0.28+0.25+0.39+0.20+0.29)/(3*2)= 0.26
![Page 30: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/30.jpg)
CS 4333 Data Mining - IMD 30
Hierarchical Clustering: Group Average Merupakan jalan tengah antara MIN dan
MAX (Single Linkage & Complete Linkage)
Kelebihannya Tidak terlalu terpengaruh oleh noise &
outliers Kekurangan
Bias terhadap globular clusters
![Page 31: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/31.jpg)
Density Based Clustering
Parameter Eps : radius maksimum lingkungan/ketetanggaan sebuah titik MinPts : jumlah min titik yang berada pada lingkungan eps
NEps(p): {q D | dist(p,q) <= Eps} Directly density-reachable: Titik p dikatakan directly density-
reachable dari titik q dengan Eps, MinPts jika 1) p NEps(q) 2) q adalah core point condition,
yaitu |NEps (q)| >= MinPts
CS 4333 Data Mining - IMD 31
p
q
MinPts = 5
Eps = 1 cm
![Page 32: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/32.jpg)
Density Based Clustering Density-reachable:
Titik p dikatakan density-reachable dari titik q dgn Eps, MinPts jika terdapat barisan titik-titik p1, …, pn, p1 = q, pn = p sedemikian shg pi+1 directly density-reachable dari pi
Density-connected Titik p dikatakan density-connected
pada titik q dgn Eps, MinPts jika terdapat titik o sdmkn shg, p dan q density-reachable dari o dgn. Eps dan MinPts.
CS 4333 Data Mining - IMD 32
p
qp1
p q
o
![Page 33: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/33.jpg)
DBSCAN: Density Based Spatial Clustering of Applications with Noise Sebuah cluster didefinsikan sbg maximal
set dari titik yang bersifat density-connected
CS 4333 Data Mining - IMD 33
Core
Border
Outlier
Eps = 1cm
MinPts = 5
![Page 34: Klastering](https://reader036.fdokumen.com/reader036/viewer/2022062410/55cf9a4b550346d033a11f3f/html5/thumbnails/34.jpg)
Algoritma DBSCAN Arbitrary select a point p
Retrieve all points density-reachable from p wrt Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
CS 4333 Data Mining - IMD 34