LAPORAN AKHIR PENELITIAN DOKTOR BARU DANA ITS 2020
Transcript of LAPORAN AKHIR PENELITIAN DOKTOR BARU DANA ITS 2020
i
LAPORAN AKHIR
PENELITIAN DOKTOR BARU
DANA ITS 2020
ModifiedEnsembleLearninguntukPrediksiEksistensiInfeksiViruspadaIsolatedDioxyriboNucleidAcid(DNA)
Tim Peneliti :
Dr. Berlian Al Kindhi (Dept. Teknik Elektro Otomasi/ Fak. Vokasi)
Prof. Dr. Mauridhi Hery Purnomo (Dept. Teknik Komputer/FTEIC)
Ir. Joko Susila (Dept. Teknik Elektro Otomasi/Fak. Vokasi)
FTSPK)
DIREKTORAT RISET DAN PENGABDIAN KEPADA MASYARAKAT
INSTITUT TEKNOLOGI SEPULUH NOPEMBER
SURABAYA
2020
Sesuai Surat Perjanjian Pelaksanaan Penelitian No: 853/PKS/ITS/2020
Daftar Isi
Daftar Isi ........................................................................................................................................................ 2
Daftar Tabel ..................................................................................................................................................... 3
Daftar Gambar ................................................................................................................................................. 4
Daftar Lampiran ............................................................................................................................................. 5
BAB I RINGKASAN .................................................................................................................................. 13
BAB II HASIL PENELITIAN ..................................................................................................................... 14
2.1. Teori Penunjang ............................................................................................................................ 14
2.2 Peta Jalan Penelitian ...................................................................................................................... 19
2.3. Tahapan Penelitian .................................................................................................................. 22
2.4. Hasil yang telah dicapai ........................................................................................................... 24
2.5. DNA COVID-19 ..................................................................................................................... 26
BAB III STATUS LUARAN ....................................................................................................................... 27
BAB V KENDALA PELAKSANAAN PENELITIAN .............................................................................. 28
BAB VI RENCANA TAHAPAN SELANJUTNYA .................................................................................. 29
BAB VII DAFTAR PUSTAKA .................................................................................................................. 30
BAB VIII LAMPIRAN ................................................................................................................................ 31
Daftar Tabel
Tabel 1. State of the art dalam lingkup penelitian yang dikembangkan ............................................... 15 Tabel 2. Peta jalan penelitian pada lingkup penelitian yang sebidang yang pernah dilakukan ......... 20
Daftar Gambar
Gambar 1. Maturity level di penelitian bidang DNA khususnya pada DNA HCV ................... 14 Gambar 2. Rencana kegiatan riset selama satu tahun ............................................................... 22
Daftar Lampiran
Lampiran 1. Tabel Daftar Luaran ................................................................................................................ 32 Lampiran 2. Luaran Wajib Jurnal Internasional .......................................................................................... 34 Lampiran 3. Luaran tambahan draft paper international conference ........................................................... 35 Lampiran 4. Draft Paten ............................................................................................................................... 36
13
BAB I RINGKASAN
Penyakit Tropis adalah penyakit yang hanya terjadi pada daerah tropis dan atau ekivalensi
dengan peluang kemunculannya lebih besar terjadi pada daerah tropis (Airlangga, 2012). Bakteri pembawa
penyakit tersebut mencakup agen infeksi yang multi resistance dan atau transibility (mudah menular).
Hepatitis C Virus (HCV) merupakan salah satu jenis penyakit yang peluang penularannya mayoritas di
daerah tropis (penyakit tropis). Saat ini belum ada vaksin yang secara mutlak dapat digunakan untuk
mencegah Hepatitis C karena virus ini secara genetik amat variatif (subtype genome) dan memiliki
angka mutasi tinggi, sehingga memungkinkan generasi virus yang beraneka ragam. Menurut WHO,
angka kematian akibat infeksi HCV cukup tinggi, yaitu mencapai 399 ribu jiwa per tahun. Indonesia
merupakan salah satu negara yang memiliki jumlah pasien terinfeksi HCV tertinggi di Asia. Penyakit
ini sebagian besar menjangkit di daerah tropis namun tidak menutup kemungkinan terdapat carier agent
yang mampu menularkan penyakit hingga ke berbagai benua. Pada penelitian ini, diusulkan sebuah
perancangan dataset yang sesuai untuk analisa DNA serta usulan metode ensemble learning yang sesuai.
Namun sebelum melakukan proses clustering dan prediksi, perlu dilakukan normalisasi dengan semantic
similarity. Data sampel yang digunakan adalah 1000 data isolated DNA yang terdiri dari 500 isolated
DNA homo sapiens yang positif terinfeksi HCV dan 500 isolated DNA homo sapiens negatif HCV.
1000 data tersebut dihitung jarak kemiripannya menggunakan metode Edit Levensthein Distance. Hasil
dari penghitungan Edit Levensthein Distance kemudian dimasukkan ke dalam matriks sebagai variabel.
Matriks tersebut adalah input data pada proses prediksi menggunakan Ensenmble learning.
Pada Penelitian ini kami mengusulkan sebuah pendekatan memprediksi adanya suatu virus
dalam DNA manusia dengan Machine Learning. Dari hasil penelitian ini, diharapkan mampu
memberikan analisa perubahan genetik DNA khususnya pada DNA yang terinfeksi HCV (sebagai
data sampel) dan hasilnya dapat dimanfaatkan oleh dunia kedokteran sebagai evaluasi pembuatan vaksin.
14
BAB II HASIL PENELITIAN
2.1. Teori Penunjang
Saat ini belum ada vaksin atau penawar yang dapat mengobati infeksi HCV secara mutlak, walaupun
berbagai penelitian telah dan sedang dilakukan untuk mendapatkan vaksin tersebut. Penelitian tersebut
dilakukan baik dari segi biologi kedokteran maupun bioinformatika. Penelitian ini menggabungkan analisa
uji dari dua keilmuan tersebut dengan menggunakan data isolated DNA dan teknologi machine learning.
Gambar 1. Maturity level di penelitian bidang DNA khususnya pada DNA HCV
Tingkatan penelitian pada bidang DNA HCV atau yang disebut dengan maturity level dapat diamati
pada Gambar 1, dimana penelitian yang dilakukan ini menuju ke arah vaksin (maturity level 3). Untuk
dapat menemukan vaksin beberapa tahap penelitian pendahuluan dilakukan yaitu dengan melakukan
analisa clustering dan prediksi eksitensi HCV dalam suatu isolated DNA, dalam arti lain, penelitian ini
adalah pendahuluan yang menuju ke arah maturity level 3.
15
Tabel 1. State of the art dalam lingkup penelitian yang dikembangkan
Topik No Judul Hasil
DNA HCV
1 Sandra Iurecia et. al., “Epitope-driven DNA vaccine design employing immunoinformatics against B-cell lymphoma: A biotech's challenge” ,2011
Menghasilkan model vaksin kanker berbasis epitope DNA yang dapat memungkinkan membangun plasmid dari beberapa epitope imunogenetik
2 Hemiyanti Emmy, “Biologi Molekul Virus”, Pasca Sarjana Universitas Padjajaran, 2012
Penjelasan mengenai pola mutasi dari molecular virus, tempat hidupnya dan pola perkembang biakannya.
3 Grey Rebecca R., et al., “Evolutionary analysis of hepatitis C virus , 2013
Menganalisis dua urutan HCV subgenomic diperoleh dari individu yang terinfeksi di 1953, yang merupakan bukti genetik tertua infeksi HCV. Metodenya adalah dengan memasangkan keragaman genetik antara dua sekuens sehingga menunjukkan substansial periode penularan HCV sebelum tahun 1950-an, dan masuknya virus tersebut dalam evolusi analisis memberikan perkiraan baru dari nenek moyang HCV di Amerika Serikat. Memperkirakan bahwa saat awal mula munculnya HCV subtipe 1b di Amerika Serikat terjadi sekitar tahun 1901 (1874- 1926), yang berarti perkiraan ini konsisten dengan perkiraan sebelumnya. Namun, analisis ini memberikan hasil CI yang tinggi daripada yang dilaporkan sebelumnya untuk subtipe 1b yang menggunakan dua wilayah subgenomik (1905-1965 dan 1806-1959;). Selain itu hasil penelitian ini mencerminkan informasi meningkat diperoleh dari menggunakan seluruh genom urutan referensi dan dari masuknya dua urutan primer yaitu pada tahun 1953 .
4 Takayakagi Toshiaki, “Modeling chronic hepatitis B or C virus infection during antiviral therapy using an analogy to enzyme kinetics:
Model dasar untuk virus hepatitis B kronis (HBV) atau virus Hepatitis C (HCV) selama terapi memungkinkan kita untuk menganalisis kinetika virus jangka pendek. Namun, model ini tidak berguna untuk menganalisis jangka panjang
16
Topik No Judul Hasil Long-term viral dynamics without
rebound and oscillation”(2013)
kinetika virus. Oeh karena itu, pada penelitian ini diusulkan model baru yang diperoleh dengan memperkenalkan Michaelis-Menten kinetika ke dalam model dasar. Model baru dapat menunjukkan kinetika virus jangka panjang tanpa Rebound dan osilasi, tidak seperti model dasar. Nilai parameter K dalam model baru analog dengan Michaelis adalah konstan dan diprediksi menjadi kurang dari sekitar 1.010 / ml.
Infrastrukt ur Sistem Pakar DNA Analisis
1 Shabut et.al., “An
intelligent mobile-
enabledd expert system
for tuberculosis disease in
real time” (2018)
Suatu expert system untuk mendiagnosa penyakit tuberculosis dengan melakukan analisa gejala-gejala secara langsung berbasis aplikasi mobile.
DNA Semantic Similarity
1 Fredonnet Julie, “Dynamic PDMS inking for DNA patterning by soft lithography”(2013)
Pencetakan microcontact (LCP) digunakan sebagai teknik pola untuk menghasilkan DNA microarray sederhana, cepat dan biaya-efektif.
2 Mika Göös,et al., “Search methods for tile sets in patterned DNA self- assembly”(2014)
Pattern self Assembly Tile set Synthesis (PATS), yang muncul dalam teori terstruktur DNA self- assembly, adalah untuk menentukan satu set coloured tiles, mulai dari struktur benih berbatasan, hingga merakit diri untuk pola warna persegi panjang yang diberikan. Tugas mencari minimum ukuran tile set dikenal NP-keras. Penelitian mengeksplorasi beberapa teknik pencarian yang lengkap dan tidak lengkap untuk menemukan minimal tile set dan juga menilai keandalan solusi yang diperoleh sesuai dengan Tile kinetik Assembly Model.
3 Fernau Henning, et al.,, “Pattern matching with variables: A multivariate complexity
Dalam DNA pattern matching terdapat banyak parameter masalah antara lain: jumlah variabel, panjang w, panjang kata-kata menggantikan variabel, jumlah kejadian per variabel,
17
Topik No Judul Hasil analysis”(2015) kardinalitas alfabet terminal dan untuk semua
kemungkinan kombinasi dari parameter (dan varian yang dijelaskan sebelumnya), penelitian ini menjawab pertanyaan apakah ada masalah atau tidak pada NP-lengkap jika parameter ini dibatasi oleh konstanta. Hasil dari penelitian menunjukkan bahwa pemberian konstanta akan memudahkan analisis DNA namun dengan adanya konstanta juga akan menurunkan tingkat sensitivitas terhadap mutasi.
Pengelom pokan DNA
1 Yilmas Kaya, Murat Uyar, “A hybrid decision support system based on rough set and extreme learning machine for diagnosis of hepatitis disease”(2013)
Mengusulkan diagnosis penyakit hepatitis menggunakan metode Rough Set dan Extreme Learning Machine (RS-ELM) dalam sebuah kumpulan data diagnosa. Hasil penelitian menunjukkan bahwa model RS-ELM 100% telah cukup sukses dibandingkan dengan metode lainnya dalam literatur
2 Boeka veselva,” Clustering approaches for dealing with multiple DNA microarray datasets”(2014)
Menggabungkan empat algoritma clusterring untuk menangani multiple gene expression matrik pada DNA Microarray. Metode cluestering tersebut adalah dua unsupervised technique berbasis integrasi informasi dan dua supervised technique yaitu menggabungkan Particle Swarm Optimization dan k-means. Hasilnya Pendekatan MapReduce Clusterring melebihi tiga algoritma pengelompokan lainnya. Selain itu, versi FCA-ditingkatkan memungkinkan untuk menganalisis lebih lanjut partisi diproduksi dan untuk mengekstrak wawasan biologis yang berharga dari data.
3 Abolfazl Doostparast Torshizi, “A new cluster validity measure based on general type-2 fuzzy sets:
Meneliti pendekatan baru di bidang General Type-2 Fuzzy Sets (GT2 FS) dan aplikasi yang dikembangkan. Pada penelitian ini telah dianalisis ukuran kesamaan berdasarkan jarak yang
Application in gene expression data clustering”(2014)
melebihi pendekatan yang ada dan mencakup sebagian besar kekurangan penelitian sebelumnya. Setelah pengujian pada beberapa dataset buatan
18
Topik No Judul Hasil dengan berbagai jumlah outlier, dengan
menggunakan tiga gen nyata ekspresi dataset dan memverivikasi kualitas terhadap sejenis pendekatan baik secara visual dan komputasi. Percobaan ini terbukti akurasi dan presisi dari metode yang telah dikembangkan.
4 Dios Fransisco.,et al., “DNA clustering and genome complexity”(2014)
Mengelompokkan DNA kompleks berdasarkan sepuluh elemen genome manusia.
5 Jamal Ade, et al., “Scalability of DNA Sequence Database on Low-End Cluster using Hadoop(2014)
Skalabilitas data sequence DNA pada world gen bank untuk di akses dan di kelompokkan menggunakan hadoop. Data diambil dari NJBI kemudian di buat sebuah arsitektur jaringan untuk clustering server data
6 Dzung Dinh Nguyen, “Towards hybrid clustering approach to data classification: Multiple kernels based interval-valued Fuzzy C- Means algorithms”(2015)
Kelemahan dari Fuzzy C-Means adalah pengelompokan dapat melibatkan berbagai fitur masukan menunjukkan dampak yang berbeda pada hasil yang diperoleh. Penelitian ini mengusulkan metode baru dari Fuzzy C-Means yaitu, komposit kernel dibangun dengan memetakan setiap fitur masukan ke ruang kernel individu dan linear menggabungkan kernel ini dengan bobot dioptimalkan dari kernel yang sesuai.
Prediksi DNA
1 Wang Hongfei, et.al., “Evaluation of an artificial neural network to ascertain why there is a high incidence of hepatitis B in the Chinese population after vaccination”(2013)
Menerapkan artificial neural network untuk menganalisa kenapa angka infeksi HBV tinggi setelah vaksin. Hasil dari neural network menunjukkan tidak ada hubungannya antara tingginya infeksi dengan vaksin.
2 Sasitorn Plakumonthon, “Computational prediction of hybridization patterns between hepatitisC viral genome and human microRNAs”(2014)
Penelitian ini mengambil beberapa human RNA (MiRNA) untuk dibandingkan dengan beberapa primer dan di prediksi apakah RNA tersebut ada kemungkinan mengidap HCV (Sasitorn Plakunmonthon, Nattanan Panjaworayan T- Thienprasert, Kritsada Khongnomnana, Yong Poovorawanc, Sunchai Payungporna, 2014)
19
Topik No Judul Hasil 3 T. Feng,et.al., “A medical
cost estimation with fuzzy neural network of acute hepatitis patients in emergencyroom”(2015)
Menerapkan FNN (Fuzzy Neural Network) untuk memprediksi biaya seorang pasien hepatitis, dengan menggunakan neuron acak yang diambil berdasarkan pasien hepatitis yang ada sebanyak 110. Hasil penelitian ini menunjukkan bahwa akurasi prediksi total biaya yang dibutuhkan oleh pasien mencapai 90%. (T. Feng, T. S. Li , P. Kuo, 2015).
4 Neelam Goel,et.al., “An improved method for splice site prediction in DNA sequences using support vector machines (2015)
Melakukan prediksi pre-messenger-RNA (pre- mRNA), untuk menentukan manakah splicing yang intron (dibuang) dan exon (bergabung) untuk berbagai tujuan ahli. Mengusulkan perbaikan, dengan menggabungkan dua metode yaitu SVM dan Markov Model (Neelam Goel, Shailendra Singh, Trilok Chand Aseri, 2015 ).
2.2 Peta Jalan Penelitian
Pada Tabel 1. Dapat diamati peta jalan riset, kebaruan dan ringkasan hasil riset yang telah dilakukan sebelumnya sehingga tergambar
riset yang diusulkan telah memiliki model/purwarupa yang telah memenuhi konsep sebagai produk/teknologi/model. Seluruh penelitian yang
disebutkan dalam matriks tabel telah dipublikasikan baik di jurnal internasional, jurnal nasional maupun seminar internasional. Kolom
matriks yang ditandai dengan warna kuning menunjukkan rencana penelitian kami tiga tahun ke depan yaitu mengenai DNA HCV.
Tabel 2. Peta jalan penelitian pada lingkup penelitian yang sebidang yang pernah dilakukan
Topik
Penelitian
Capaian
sampai 2011
2012 2013 2014 2015 2016 2017 2018 2019 2020 2020-2028
Pattern
matching and
detection
Analisa voice spectrum pada
pasien laryngectomised
dengan atau tanpa electro
larynx
(*terapan dan industri)
Vaksin
Modelling
untuk
Hepatitis C
Virus
(HCV)
Deteksi kondisi abnormal
pada pankreas beta cell
penyebab diabetes
menggunakan citra iris
(*terapan dan industri)
Identifikasi malaria pada thick blood
film berbasis genetic programming
(*terapan dan industri)
CT lung image filtering
berbasis max-tree method
(*terapan)
Analisis
metode
pattern
matching
yang sesuai
untuk DNA
(*terapan)
Modifikasi
hamming
untuk DNA
similarity
(*terapan)
Medical
clustering
Implementasi fungsi
Weighted kernel untuk
klasifikasi hyperspectral
image berbasis support
vector machine
(*terapan)
Klasifikasi
Osteoarthritis
classification
menggunakan self
organizing map
berbasis gabor
kernel dan contrast-
Classification pasien diabetes
retinopathy menggunakan
support vector machine (SVM)
berbasis citra retina digital
(*terapan dan industri)
Clustering DNA
berdasarkannilaikedekatan
dengan primer HCV.
(*terapan dan industri)
Clutering
DNA
berbasis
deep
learning
ntuk
menganalis
20
limited adaptive
histogram
equalization
(*terapan dan
industri)
a pola HCV
Analisa klasifikasi lokasi
aktifitas dominan pada otak
berbasis K-means
(*terapan)
Klasifikasi cyst dan
tumor lesion using
Support Vector
Machine
berdasarkan citra
dental panoramic
(*terapan dan
industri)
Prediction
modelling
Implementasi
recurrent
neural network
untuk time
series
forecasting
(*terapan)
Rekonstruksi citra 3D rahang
bawah berdasar fitur 2D pada citra
Xray gigi
(*terapan dan industri)
EEG Signal Identification based
on root mean square and
average power spectrum by
usingbackPropagation
(*terapan dan industri)
Hubungan antara
Electromyography Signal Of
Neck Muscle dan sinyal suara
manusia untuk mengontrol
Electrolarynx
(*terapan)
Prediksi
HCV
berbasis
SVM
(*terapan
dan
industri)
Prediksi
eksistensi
HCV pada
isolated
DNA dari
berbagai
negara di
21
seluruh
dunia.
Biomedika
Modelling dan
analisis
Analisa
dosis
vaksin
Hepatitis B
Virus
(HBV)
Prevalensi
HBV
dengan
molekular
karakteristi
k pada
wanita
hamil
Mutasi HBV pada Pre-S1
dan Pre S2
HCV dan
HBV pada
transgender
Analisa genetik Hepatitis A
Virus pada siswa tingkat
SMP
Analisa HCV pada pasien
di RSUD dr. Soetomo pada
sisi genotip
2.3. Tahapan Penelitian
Metode penelitian yang direncanakan terbagi dalam dua kelompok pekerjaan selama satu tahun,
yaitu :
Gambar 2. Rencana kegiatan riset selama satu tahun
Pembangunan tahap pertama akan berfokus pada klasifikasi kecenderungan sequence tersebut
hasil dari pengenalan pola pada pleriminary penelitian. Tahap terakhir adalah analisa trend
mutasi DNA tersebut menggunakan sistem prediksi sebagai acuan terhadap pemodelan vaksin.
22
Yangtelahdikerjakan
Tahap Tahapkedua
PengumpulandataisolatesDNA
Pengumpulandataisolatehomosapiens
learningsystemantara DNAhomosapiensdan
HCV
Pengumpulandataprimer Normalisasidatasethomosapiens K-NNLearningsystem
Normalisasidatabase Pembangunanmetode adaptivefuzzyuntukklasifikasi
MultiLayerPerceptonLearningSystem
PembanggunansistempengolahDNAdata
Pembangunanmetode assemblyclustering
Integrasihasil pengelompokandan
prediksi
Perancanganinfrastrukturuntukmachinelearning
Rekomendasiprimeryangelgible
Pembelajaranpolaberbasisdeeplearning
PrediksieksistensiHCV (virus)padaisolatedDNAdariberbagainegaradidunia
AnalisapolaprimerHCV padaDNAmanusianormal
23
Pada tahun pertama penelitian, modul utama yang dibangun adalah sebagai berikut:
1. Modul Primer Modul primer berfungsi untuk mengolah data primer, yang nantinya pada proses pengolahan data
berfungsi sebagai pattern yang dibandingkan. Data primer beserta tahun ditemukannya akan disimpan
di dalam database.
2. Modul Isolate Modul Isolate berfungsi untuk mengolah data isolate. Data isolate DNA tidak tersimpan pada
database melainkan disimpan pada file .txt. Hal ini dikarenakan untuk mengurangi beban kerja
database. Ketika sistem akan melakukan pengenalan pola, modul ini akan memanggil dan
membaca file .txt isolate. Kemudian modul ini akan memecah sequence DNA sepanjang primer
yang akan dibandingkan.
3. Modul Normalisasi Primer Modul normalisasi primer adalah modul yang berfungsi untuk melakukan normalisasi pada karakter
primer. Pada tahap ini, perlu adanya normalisasi pada primer agar penghitungan pengenalan pola
seimbang.
4. Modul Pengenalan pola Modul utama dari seluruh sistem adalah modul hamming, modul ini memiliki looping kondisi yang
berlapis. Proses pengenalan pola ini terdapat dua input yaitu sequence DNA sebagai data yang
dibandingkan dan primer sebagai patter.
5. Modul Treshold Modul terakhir yang dibutuhkan pada penelitian tahap 1 adalah modul treshold. Modul ini berfungsi
untuk menggabungkan hasil penghitungan modul pengenalan pola dan modul normalisasi primer.
Kemudian dari hasil penghitungan tersebut akan dianalisa sequence mana yang memenuhi nilai
treshold. Hanya sequence yang memenuhi nilai treshold yang akan disimpan di database, prosedur
penyimpanan tersebut tetap harus menyertakan atribut yang menempel pada sequence tersebut
diantaranya adalah kode primer, kode isolate, index sequence, tahun primer dan isolate, dan
sebagainya.
Pada Hepatitis, pada setiap isolate DNA terdiri dari 9000 hingga 15.000 sequence
DNA. Data isolate tersebut akan bertambah terus seiring dengan bertambahnya pasien yang
terinfeksi Hepatitis. Selain itu, dalam satu jenis virus Hepatitis, jumlah sequence primer
Hepatitis juga akan terus bertambah seiring dengan
24
terjadinya mutasi dari virus tersebut, setiap mutasi akan dikelompokkan ke dalam subtype
genome. Beberapa metode supervise learning akan diujikan pada tahun pertama penelitian
dengan tujuan untuk menjembatani perbedaan primer yang berdampak pada tidak berhasilnya
suatu primer mendeteksi adanya HCV pada isolated DNA di seluruh dunia. Hasil dari
penelitian tahun pertama adalah pengujian metode pattern matching yang tepat untuk
mengolah data DNA. Tahap kedua penelitian ini adalah melakukan normalisasi data dan proses
clustering. Seperti yang telah dijelaskan pada bab sebelumnya bahwa sebelum melakukan proses
clustering terlebih dahulu data dinormalisasi dan disiapkan sesuai format pada metode
clustering.
Tahapan penelitian meliputi:
- Menghitung nilai kedekatan masing-masing data isolate DNA dengan masing- masing data
primer Hepatitis
- Menghitung nilai kedekatan tahun ditemukan isolate DNA dengan masing-masing data
primer Hepatitis.
- Melakukan perbandingan kedekatan masing-masing node (hasil perhitungan pada proses
sebelumnya)
- Melakukan proses clustering.
- Analisa hasil clustering.
2.4. Hasil yang telah dicapai
Penelitian ini telah berjalan hampir empat tahun dengan hasil yang cukup signifikan, kami
terus mempelajari pola-pola mutasi dari suatu yang mengifeksi pada manusia sehingga mengubah
susunan pola DNA homo sapiens tersebut. Beberapa kemajuan yang telah kami capai telah kami
publikasikan baik di jurnal maupun konferensi interansional, yaitu:
1. Manajaemen big data DNA dengan membandingkan metode string matching (Kindhi &
Arief, 2015)
2. Perancangan inrastruktur bank DNA berbasis cloud server dan database terintegrasi (Kindhi,
Arief, & Purnomo, Prototype Infrastructure Cloud Expert System DNA Analysis
(CESDA) as the Basis of Sustainability DNA Software Improvement in Indonesia, 2017)
25
3. Modifikasi metode hamming untuk pengenalan semantic similairty pada susunan nukleotida DNA yang terinfeksi virus (Kindhi, Hendrawan, Purwitasari, Arief, & Purnomo,
4. 2017)
5. Mengusulkan suatu metode hybrid clustering yang dapat menghasilkan 3 analisa
sekaligus dalam satu kali proses clustering DNA (Kindhi, Sardjono, Purnomo, &
Verkerke, 2019)
6. Mengusulkan mengoptimasi metode Support Vector Machine (SVM) untuk memprediksi
eksistensi virus di dalam DNA (Kindhi, Arief, & Purnomo, Optimasi Support Vector
Machine (SVM) untuk memprediksi adanya mutasi pada DNA Hepatitis C Virus, 2018).
7. Mengusulkan metode Recurrent Neural Network Back Pro Pagation Through Time
(RNN_BPPT) untuk mendeteksi time series pola perubahan mutasi dari suatu virsus yang
melekat pada DNA (Kindhi, Sardjono, & Purnomo, Prediction of DNA Hepatitis C Virus
based on Recurrent Neural Network-Back Propagation Through Time (RNN-BPTT), 2019).
Pada penelitian ini kami terus memperbaiki hasil yang telah dicapai, dengan menguji coba
metode lain dalam machine learning, dan bila perlu melakukan modifikasi pada metode
tersebut agar dapat menghasilkan analisa yang maksimal.
26
2.5. DNA COVID-19 Metode prediksi mutasi suatu virus selama ini diteliti dengan pendekatan biologi
kedokteran dengan percobaan di laboratorium dengan menguji larutan PCR ke dalam sampel DNA. Pada penelitian ini kami mengusulkan sebuah pendekatan baru berbasis machine learning melalui analisa pola urutan DNA yang telah tercata dalam file isolated DNA.
Pada penelitian ini, kami melakukan uji coba penerapan machine learning pada data set COVID-19. Data set yang kami gunakan diperoleh dari Bank DNA Dunia yaitu NCBI sebanyak 20 isoalted DNA positif terinfeksi COVID-19, 20 isolated DNA positif terinfeksi MERS, dan 20 islated DNA yang positif terinfeski SARS. Total nukleotida yang dibandingkan adalah kurang lebih 500.000 nukleotida
27
BAB III STATUS LUARAN
Status tercapainya luaran wajib yang dijanjikan dan luaran tambahan (jika ada). Uraian status luaran dapat diamati pada tabel berikut:
No Luaran Judul Status
1 Paper jurnal Q2 Optimization of Machine Learning Algorithms for Predicting Infected COVID19 in Isolated DNA
Publish di IJIIES Japan
2 Paper internasional
konferensi
Ensemle learning for DNA mutation
predictions
Draft siap submit
3 Paten Perangkat cerdas untuk analisis DNA Telah dikoreksi oleh
reviewer ITS dalam
tahap perbaikan dan
resubmit ke ITS
28
BAB V KENDALA PELAKSANAAN PENELITIAN
Kendala selama melaksanakan penelitian adalah terbatasnya data DNA yang positif COVID-19 dari berbagai negara, pada bank DNA dunia sebagian besar data DNA yang telah didaftarkan adalah DNA dari negara China dan Amerika Serikat sehingga menyulitkan proses analisa.
29
BAB VI RENCANA TAHAPAN SELANJUTNYA
Rencana penyelesaian penelitian dan rencana untuk mencapai luaran yang dijanjikan adalah
memenuhi luaran tambahan yaitu manuscript pada conference internasional dan menyelesaikan paten yang telah mendapat masukan dari tim reviewer paten di ITS.
30
BAB VII DAFTAR PUSTAKA
Airlangga, U. (2012). Penyakit Tropis Ilmu Ilmiah Dasar. Surabaya: Universitas Airlangga. WHO.
(2017). Global Hepatitis Report 2017. Geneva: Licence: CC BY-NC-SA 3.0 IGO.
Juniastuti et al. (2014). High Rate of Seronegative HCV infection in HIV-Positive Patients.
Biomedical Reports, 2, 79-84.
Bin Liu; Shanyi Wang; Qiwen Dong; Shumin Li; Xuan Liu. (2016). Identification of DNA- Binding
Proteins by Combining Auto-Cross Covariance Transformation and Ensemble Learning. IEEE
Transactions on NanoBioscience, 15(4), 328-334.
Kindhi, B. A., & Arief, S. T. (2015). Pattern Matching Performance Comparison as Big Data Analysis
Recomendations for Hepatitis C Virus (HCV) Sequence DNA. IEEE International Conference
of Artificial Intelligence and Modelling System (pp. 155-160). Kinabalu, Malaysia: IEEE.
Kindhi, B. A., Hendrawan, A., Purwitasari, D., Arief, S. T., & Purnomo, M. H. (2017). Distance- based
Pattern Matching of DNA Sequences for Evaluating Primary Mutation. IEEE International
Conference of ICITESE (p. 200). Jogjakarta: IEEE.
Kindhi, B. A., Arief, S. T., & Purnomo, M. H. (2017). Prototype Infrastructure Cloud Expert System
DNA Analysis (CESDA) as the Basis of Sustainability DNA Software Improvement in
Indonesia. IEEE International Conference of European Modelling Symposium (EMS).
Manchester, United Kingdom: IEEE.
Kindhi, B. A., Sardjono, T. A., Purnomo, M., & Verkerke, G. (2019). Hybrid K-Means, Fuzzy C-Means,
and Hierarchical Clustering (KFHC) for DNA Hepatitis C Virus (HCV) Trend Mutation Analysis.
Expert System with Application (ESwA),, 122.
Kindhi, B. A., Sardjono, T. A., & Purnomo, M. H. (2019). Prediction of DNA Hepatitis C Virus based on
Recurrent Neural Network-Back Propagation Through Time (RNN-BPTT). 3rd IEEE
ICAMIMIA. Batu.
Kindhi, B. A., Arief, S. T., & Purnomo, M. H. (2018). Optimasi Support Vector Machine (SVM) untuk
memprediksi adanya mutasi pada DNA Hepatitis C Virus. Journal of JNTETI, 7(3).
31
BAB VIII LAMPIRAN
32
Lampiran 1. Tabel Daftar Luaran
LAMPIRAN 1 Tabel Daftar Luaran Program : Penelitian Doktor Baru Nama Ketua Tim : Berlian Al Kindhi Judul : Modified Ensemble Learning untuk Prediksi Eksistensi Infeksi Virus pasa Isolated Dioxyribo Nucleid Acid (DNA)
1.Artikel Jurnal
No Judul Artikel Nama Jurnal Status Kemajuan*)
1 Optimization of Machine Learning Algorithms for Predicting Infected COVID19 in Isolated DNA
International Journal of Intelligent Engineering Systems (http://www.inass.org/Volume2020.html )
Published
*) Status kemajuan: Persiapan, submitted, under review, accepted, published
2. Artikel Konferensi No Judul Artikel Nama Konferensi (Nama
Penyelenggara, Tempat, Tanggal)
Status Kemajuan*)
1 Ensemble learning for DNA mutation prediction
- Siap submit
*) Status kemajuan: Persiapan, submitted, under review, accepted, presented
3. Paten
No Judul Usulan Paten Status Kemajuan 1 Parangkat Cerdas untuk Analisis DNA Sudah di revisi reviewer
paten ITS dan siap resubmit ke ITS
*) Status kemajuan: Persiapan, submitted, under review
4. Buku
No Judul Buku (Rencana) Penerbit Status Kemajuan*)
*) Status kemajuan: Persiapan, under review, published
33
5. Hasil Lain
No Nama Output Detail Output Status Kemajuan*)
*) Status kemajuan: cantumkan status kemajuan sesuai kondisi saat ini
1.Disertasi/Tesis/Tugas Akhir/PKM yang dihasilkan
No Nama Mahasiswa NRP Judul Status*)
*) Status kemajuan: cantumkan lulus dan tahun kelulusan atau in progress
34
Lampiran 2. Luaran Wajib Jurnal Internasional
Received: May 12, 2020. Revised: June 5, 2020. 423
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
Optimization of Machine Learning Algorithms for Predicting Infected COVID-
19 in Isolated DNA
Berlian Al Kindhi1*
1Institut Teknologi Sepuluh Nopember, Indonesia * Corresponding author’s Email: [email protected]
Abstract: The stipulation of the COVID-19 (Corona Virus Disease 2019) as a global pandemic by the WHO (World Health Organization) made a number of countries lockdown. Countries like Italy, Denmark, China, and Ireland have taken lockdown steps to prevent this disease from spreading and taking many lives. COVID-19, SARS (Severe Acute Respiratory Syndrome), and MERS (Middle-East Respiratory Syndrome) are viral infections in the respiratory tract that can be fatal. SARS first became an epidemic in China in 2002, while MERS first appeared in the Middle East in 2012. At the end of 2019, a new disease appeared in China called COVID-19. These three viruses are still in the same family so they have very similar nucleotide sequences. The tested COVID-19 primer was able to adhere well with a similarity level of more than 70% in all DNA SARS and MERS isolates tested. To distinguish DNA samples between MERS, SARS, and COVID-19 using the basic local alignment sequence nucleotide approach alone is not enough. We propose an optimization of machine learning methods to predict the COVID-19, the optimization method depends on the method we improved. In Discriminant Analysis, we use Wilks Lamda's approach and change Linear into Diagonal Discriminant Matrix. In the Decision Tree method, we make optimization by making gain formulation to minimize the entropy value to get more information on the result. We optimized K-NN with add weighted distance optimization, and in SVM we try several kernels and optimize the hyperplane with SRM (Structural Risk Minimization) approach to looking for the best result. Besides that, in preparation for input features, we use Edit Levenshtein Method with the calculation of the optimum similarity from each DNA sequence. The results of our test, optimization of the Decision Tree method produces an accuracy of 98.3%, optimization of Discriminant Analysis 98.3%, and optimization of SVM and KNN 100% respectively. We also find a fact in the DNA Alignment process, when the primer being compared is 'R', the nucleotides in the COVID-19 sample data are always 'A' and this approach from the bioinformatic side can be used as analytical material in the medical world.
Keywords: COVID-19, Discriminant analysis, K-NN, Decision Tree, SVM, DNA.
1. Introduction
Since the discovery of a new type of Coronavirus at the end of 2019 which call COVID-19, the number of infected patients has increased significantly by March 2020. US reports the largest number of deaths worldwide, followed by Italy. This study conducts trials and analysis of the proximity of MERS, COVID-19, and SARS in terms of DNA nucleotide patterns that can be used as decision support in biomedical research. The incubation period is the time needed by germs to multiply in a person's body to cause complaints. In other words, the incubation
period is the time span between the occurrence of infection and the appearance of symptoms [1]. Although the viruses COVID-19, SARS, and MERS are from the same family of viruses, namely coronavirus, these three diseases have different incubation periods, for SARS disease is 1–14 days (average 4-5 days). The incubation period for MERS disease is 2–14 days (average 5 days), while the incubation period for COVID-19 is 1–14 days, with an average of 5 days.
These three diseases can cause fever, cough, sore throat, nasal congestion, weakness, headaches, and muscle aches. If it gets worse, the symptoms of the
Received: May 12, 2020. Revised: June 5, 2020. 424
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
three can resemble pneumonia. The big difference between these three diseases is that COVID-19 is rarely accompanied by colds and digestive complaints, such as bowel movements, nausea, and vomiting. The spread of coronavirus from animals to humans is actually very rare, but this is what happened to COVID-19, SARS, and MERS. Humans can get the coronavirus through direct contact with animals infected with this virus. This method of transmission is called zoonotic transmission [2].
SARS is known to be transmitted from mongoose to humans and MERS is transmitted from humped camels. While in COVID-19, there are allegations that the animal that first transmitted the disease to humans was a bat. A person can become infected with the Coronavirus if he inhales a splash of saliva released by a COVID-19 sufferer when sneezing or coughing. Not only that, but transmission can also occur if someone holds an object that has been contaminated with COVID-19 saliva splashes and then holds the nose or mouth without washing hands first. SARS and COVID-19 are known to spread more easily from human to human than MERS [3]. And when compared with SARS, the transmission of COVID-19 from human to human is easier and faster. So far, the death rate from COVID-19 is not higher than SARS and MERS. The SARS death rate reaches 10%, while MERS reaches 37%. However, the transmission of COVID-19 which is faster than SARS and MERS cause the number of sufferers of this disease to increase sharply in a short time. So far, there is no proven drug that is effective in dealing with COVID-19 [4]. Several antiviral drugs, such as oseltamivir, cloroquine, lopinavir, and ritonavir, have been tried to be given to COVID-19 patients while continuing to be studied. Whereas in SARS and MERS, administration of lopinavir, ritonavir, and the latest broad-spectrum antiviral drug called Remdesivir has been proven effective as a treatment. In patients with Coronavirus infection with severe symptoms, in addition to antiviral drugs, they also need to get fluid therapy (infusion), oxygen, antibiotics, and other medicines according to symptoms that appear. Patients with COVID-19 also need to be treated in the hospital so that their condition can be monitored and not transmit the infection to others [5].
In this study, we compared the similarity patterns of the SARS and MERS nucleotide structures with COVID-19 to determine the similarity of the nucleotides with the bioinformatic approach. The data we used consisted of 20 COVID-19 DNA samples, 20 SARS DNA samples, 20 MERS DNA samples, and primers from COVID-19. The three types of DNA samples tested have a short enough
distance or in other words have a high enough similarity value when compared to the Primary COVID-19. So if we detect the presence of a coronavirus simply by matching a DNA sample with a COVID-19 primer, then all DNA samples, both SARS and MERS, will be detected as COVID-19. Apart from biomedical, if it is discussed from the perspective of bioinformatics, the process of string similarity alone or the basic sequence alignment is not enough to prove that the DNA sample includes Covid-19 because SARS and MERS still have close kinship values.
Therefore, it is necessary to add a machine learning method to study the distance pattern of each DNA sample so that it can be known and predicted where the DNA infected with COVID-19 really is. We optimize the four machine learning methods, namely Decision Tree, Discriminant Analysis, K-NN, and SVM. The optimization process of each machine learning method varies according to the need to get the best prediction results. Good input features will provide predictive analysis of machine learning with good results. For the DNA Alignment process we use the Edit Levenshtein algorithm with the addition of a DNA sequence normalization filter that meets the positive minimum limit and has the greatest similarity to the primers being compared as an input feature. We describe the optimization process in each method in Chapter 3, while we present the analysis of the results and the discussion in Chapter 4, and Chapter 5 contains conclusions from the results of our research. The results of the study show that the optimization of machine learning method is very helpful in predicting DNA samples by producing accuracy values above 98% for all methods that have been optimized, that were not able to be done in the previous string similarity process.
2. Literature study
DNA alignment is a method for analyzing the sequence of a DNA sample by aligning the sequence with another sequence. In bioinformatics, the nucleotide alignment method can also be said with the character comparison method. In one isolated file DNA can consist of tens of thousands of nucleotide sequences. In large numbers, the process of finding patterns in a sample will require significant time, therefore the speed of an algorithm in determining patterns is an important factor. Research before comparing the performance of the Brute Force, Knuth-Morris-Pratt, and Boyer Moore algorithms to find patterns in isolated DNA [6]. In the process of finding DNA patterns, there are millions of sequences that are compared, so the speed and accuracy of an
Received: May 12, 2020. Revised: June 5, 2020. 425
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
algorithm in finding these patterns is a major factor. In addition, the length of the primary characters that are not always the same can also provide different distance measurement results, one of the solution problems is by adding the normalization method to the Hamming algorithm so that the comparison process between primers can be balanced [6].
Decision Tree method is often used to determine a problem with multilevel consideration factors [7]. A condition can be chosen based on the selection of previous conditions and continues to flow until the final decision. This method can help provide a decision on the number of hospital costs to be paid by a patient by looking at the background factors of the patient [8]. Decision tree is one of the strong data mining that can be used to understand the factors that influence health condition decisions. Decision trees can be used to design factors in an urban environment that can affect health outcomes [10]. Previous research used a decision tree learning algorithm called classification and regression tree (CART) for CAD diagnosis as an alternative to the currently available diagnostic methods [10]. In machine learning, sometimes a problem occurs because of an unbalanced data set, this can be overcome by applying ensemble learning. Decision Tree method can be used as an initial classification in the ensemble learning method [11].
Beside decision tree method, machine learning methods that are also often compared are discriminant analysis and SVM [13]. Discriminant Analysis can be applied as a kernel for discrete cross-models to reduce the loss in some cases on quantization [13]. Linear Discriminant Analysis (LDA) can be used to classify patterns, this technique is often used to detect illness early in the data set being tested [14]. However, LDA sometimes cannot provide a good classification if it meets data that are matrices covariant and unseparated linear [15]. Problems in this LDA model can be overcome with a new model approach called Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis (FLDA-Lsp) [16]. Linear Discriminant Analysis is also able to classify the bent of a cell based on bispectral invariant features and the results of this classification can be analyzed in more detail by combining the SVM method [17]. For speaker recognition, Discriminant Analysis can be used by make optimization in Kernel Discriminant Analysis (KDA) in higher dimension [18]. In addition to a linear approach, to solve unstructured Covariance matrices is by applying Vanishing Non-Linear Discriminant Analysis (VNDA), this method is able to solve the ratio of trace problems on limited polynomials data [19].
KNN is one of the supervised machine learning methods that are able to solve various problems flexibly [21]. KNN can also be easily combined with other machine learning methods such as SVM, string distance, and neural network [21]. The results of the KNN classification process can increase significantly if at the time of comparison the pattern is given two paired criteria [23]. To determine the node on the KNN sometimes use the average value of the data, the disadvantage of this method is that it cannot determine the really good variable [23]. One solution to this problem is to choose sparse group features as candidates for relevant classes [24]. KNN algorithm is also able to recognize patterns in high-resolution images by calculating the similarity distance around the pixels being compared [25].
Support Vector Machine (SVM) is a supervised machine learning algorithm that is able to solve both classification and regression problems [26]. The way SVM works are to maximize the Hyperplane limit (maximum Hyperplane margin) [27]. There are a number of possible hyperplane choices for a data set, to get the best results from SVM is to determine the maximum Hyperplane [28]. Hyperplane with maximum margins will give better generalization to the classification method [30]. Hyperplane in SVM is not always linear, this model can be in the form of a quadratic curve, or Gaussian in accordance with the kernel that is applied to the data classification process [30].
3. Optimization of machine learning
algorithms
3.1 DNA alignment
Sample data from this study totaled 60 isolated DNA consisting of isolated positive DNA infected with COVID-19, MERS, and SARS each of them is 20 samples. All data is taken from the world gene bank [31]. For comparison, we use published Primary COVID-19 data [32, 33]. In one isolated DNA complete gene COVID-19, MERS, and SARS, consisting of 20,000 to 30,000 nucleotide sequences, this number is far more than the other isolated DNA in our previous study [6]. All samples will be compared with each primer, with a total of about 18,000,000 nucleotide comparison processes.
The process of comparing DNA alignment with COVID-19 primers using the Levenshtein distance Edit method. Each isolated DNA will be cut into pieces as long as the number of primary characters and then compared to the primer, calculated the distance of its proximity then shifts again to the next nucleotide. An isolated DNA is said to be positive for
Received: May 12, 2020. Revised: June 5, 2020. 426
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
a primary virus or bacterium if the similarity level of the nucleotide fragment reaches greater than 70% [34]. In this process, all isolated DNA tested at least one sequence has a similarity greater than 70% in the forward primer, so it can be said that all of the samples are Covid-19. SARS and MERS are indeed still in one group with Covid-19, which is a Coronavirus group, so it has a similar pattern. Therefore, a further predictive analysis process needs to be carried out. 𝑑𝑖𝑠𝑡𝑎,𝑏(𝑖, 𝑗) =
{
max(𝑖, 𝑗) , 𝑖𝑓 min(𝑖, 𝑗) = 0
min{
𝑑𝑖𝑠𝑡𝑎,𝑏(𝑖, 𝑗) + 1
𝑑𝑖𝑠𝑡𝑎,𝑏(𝑖, 𝑗 − 1) + 1
𝑑𝑖𝑠𝑡𝑎,𝑏(𝑖 − 1, 𝑗 − 1) + 1(𝑎𝑖≠𝑏𝑗)
}
}
(1)
𝑆𝑖𝑚𝑎,𝑏 =
𝑑𝑖𝑠𝑡𝑎,𝑏(𝑖,𝑗)
𝑛 × 100% (2)
𝑉𝑎𝑟(𝑥,𝑏) = {𝑆𝑖𝑚𝑎,𝑏 , 𝑖𝑓 max (𝑆𝑖𝑚𝑎,𝑏) ≥ 70
0 , 𝑖𝑓 max (𝑆𝑖𝑚𝑎,𝑏) ≤ 70} (3)
Eq. (1) is an algorithm to calculate the distance
between the sequence of DNA slice (a) to the Primer (b), while i is the character index of a and j is the character index of b. Then from the results of distance calculation, the similarity percentage will be calculated as in Eq (2). The variable n is the amount or length of the DNA slice being compared, so the percentage of similarity is calculated by dividing the resulting distance value by the number of characters multiplied by 100%. Eq. (3) is an explanation of how we fill the value of the variable independence in the matrix that we build as input features machine learnings.
We conducted various simulations to change the data from the comparison results so that it could be used as an appropriate input feature for machine learning. From some simulation results, the right simulation model in our opinion is to use primers as each input feature. Then every sequence comparison that produces higher similarity from 70% will be entered into the application database. From all data stored in the database, one comparison result that has the highest similarity value on each isolated DNA (has the shortest distance) to a primary will be selected as an input feature. If there is one isolated DNA that does not have a similarity level greater than 70% in a particular primer, then the dataset will be written 0. The amount of training data is the amount of isolated DNA compared to 60 and the number of
input features is eight (the number of primers compared), for the target output, there are three classes namely 0 for COVID-19, 1 for SARS, and 2 for MERS.
3.2 Decision tree optimization
The first Machine Learning algorithm that we tried is the Decision Tree. Decision trees use a hierarchical structure for supervised learning. The process of the decision tree starts from the root node to the leaf node which is done recursively. Where each branching states a condition that must be met and at each end of the tree states the class of data.
We use the Entropy concept which is used to measure "how informative" a node (which is usually called how good it is). Entropy (S) = 0, if all the examples in S are in the same class. Entropy (S) = 1, if the number of examples positive and the number of negative examples in S is the same. 0 <Entropy (S) <1, if the number of positive and negative examples in S is not the same. S is the case dataset and k is the number of S partitions, while 𝑝𝑗 is the probability obtained from Sum (Yes / values more than 70%) divided by Total Cases. k is the number of input features being selected, and P is the condition of the input feature. The Entropy algorithm can be analyze in Eq. (4-5). After getting the entropy value, the attribute selection is done with the largest information gain value.
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) = − ∑ 𝑝𝑗𝑙𝑜𝑔2
𝑘𝑗=1 𝑝𝑗 (4)
which can be applied to this case study: 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆) = −(𝑃𝑐𝑜𝑣19𝑙𝑜𝑔2𝑝𝑐𝑜𝑣19 +
𝑃𝑠𝑎𝑟𝑙𝑜𝑔2𝑝𝑠𝑎𝑟 + 𝑃𝑚𝑒𝑟𝑙𝑜𝑔2𝑝𝑚𝑒𝑟 (5)
So the Gain (A) value in this case study can be calculated with:
𝐺𝑎𝑖𝑛 (𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆)
− ∑|𝑆𝑖|
|𝑆|𝑘𝑖=1 × 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑖) (6)
In Eq. (6), S is the sample data space used for
training. Variable A is the number of attributes, |Si| is the number of samples for values V and |S| is the sum of all sample data, both of which have absolute values. Whereas Entropy (Si) is entropy for samples that have a value of i. From the application of the formula (6), it can be concluded that the greater the information gain we get, the greater the entropy value that we delete. Because the main purpose of applying this
Received: May 12, 2020. Revised: June 5, 2020. 427
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
gain is to get an entropy value close to 0 or equal to 0.
3.3 Discriminant analysis optimization
The next method that we tested was a discriminant analysis. In our case study, the dataset tested will be divided into three classes, so it cannot use the linear discriminant analysis method. Then we do the discriminant analysis optimization process by forming an optimal discriminant function with several assumptions about the data used. These assumptions include the data on our independent variables, the multivariate normal distribution and the similarity of variance-covariance matrices between groups. In the preparation of discriminant functions, there are two methods that can be used, namely simultaneous estimation and stepwise estimation. The general model of discriminant analysis is a linear combination of data that can be observed in Eq. (7). �� and 𝑥 are two vectors whose distances are measured using the diagonal discriminant method. To find out the independent variables that can discriminate against a group we use Wilks Lambda method as in Eq. (8).
𝑆𝑗𝑘 = 𝑎 + �� 𝑗 ∙ 𝑥 𝑖𝑘 +⋯+ �� 𝑛 ∙ 𝑥 𝑛𝑘 (7)
To find out which independent variables can be discriminated against:
λ =det(𝐴)
det (𝐴+𝐵)=
|∑ ∑ (𝑥𝑖𝑗−��𝑖)(𝑥𝑖𝑗−��𝑖)′𝑛𝑖𝑗=1
𝑘𝑖=1 |
|∑ ∑ (𝑥𝑖𝑗−��)(𝑥𝑖𝑗−��)′𝑛𝑖𝑗=1
𝑘𝑖=1 |
(8)
In this case study, because there are three groups,
so the linear model is converted into a diagonal model. With a diagonal matrix 𝐷𝑖𝑐𝑟 = 𝑑𝑖𝑎𝑔(𝑎1, … , 𝑎2) and
a vector for this dataset become 𝑣𝑒𝑐 = [𝑥1⋮𝑥𝑛] . So
vector operations can be observed in Eq. (9).
𝐷𝑖𝑐𝑟𝑣𝑒𝑐 = 𝑑𝑖𝑎𝑔(𝑦1, … , 𝑦𝑛) [
𝑥1⋮𝑥𝑛]
= [𝑦1
⋱𝑦𝑛
] [
𝑥1⋮𝑥𝑛] = [
𝑦1𝑥1⋮
𝑦𝑛𝑥𝑛] (9)
In the process of optimization, we tested several kernel analysis including linear, multiple, and diagonal. The test results show that the diagonal discriminant analysis gives the best results compared to other kernels in this case study.
3.4 K-NN optimization
The K-Nearest Neighbor algorithm uses Neighborhood Classification as the predicted value of the new instance value. In this case, the variables we use are independent variables (variables that are not related to each other) so it can be said that these variables are input features. To calculate the distance between nodes and surrounding neighbors we use the Euclidean distance algorithm, we add weighted distance optimization between one node and another [35]. The kNN optimization algorithm can be observed in Eqs. (10)-(12), where L is the data set to be grouped.
𝐿 = {(𝑦𝑖 , 𝑥𝑖), 𝑖 = 1,… , 𝑛𝐿 (10) 𝑑(𝑥, 𝑥(1)) = 𝑚𝑖𝑛𝑖(𝑑𝑖𝑠𝑡𝑎,𝑏(𝑥, 𝑥𝑖)) with distance: 𝑑𝑖𝑠𝑡𝑎,𝑏 = √(𝑥𝑏 − 𝑥𝑎)
2 + (𝑦𝑏 − 𝑦𝑎)2
= (∑ (𝑥𝑖𝑎 − 𝑥𝑗𝑎)2𝑏
𝑎=1 )1
2 node turn into the class by weighted
�� = 𝑚𝑎𝑥𝑟(∑ 𝑤(𝐼)𝑘𝑖=1 𝐼(𝑦(𝑖) = 𝑟)) (11)
�� is the max value of a node to the neighbor value
compared whether the node has a similarity to the neighbor. rom our test results analysis, the amount of K that we determined also determines the results of the classification. The number of output classes produced can be influenced by the number of distance neighbors or the specified number of K. It can be observed a pattern that by using an odd K, our test results produce a more precise predictive value, the K we use in this study is 1.
3.5 SVM optimization
Support Vector Machine (SVM) is a learning system that uses hypothetical spaces in the form of linear functions in a high-dimensional feature space, trained with learning algorithms based on optimization theory by implementing learning bias derived from statistical learning theory. To classify data that cannot be separated linearly the SVM formula must be modified because no solution will be found. Therefore, the two bounding fields must be changed so that they are more flexible (for certain conditions) by adding the variable 𝑆𝑖 (𝑆𝑖 ≥0, ∀𝑖: 𝑆𝑖 = 0 if 𝑥𝑖 is classified correctly) to be 𝑥𝑖𝑤 +𝑏 ≥ 1 − 𝑆𝑖 for class 1 and 𝑥𝑖𝑤 + 𝑏 ≤ − 1 + 𝑆𝑖 for
Received: May 12, 2020. Revised: June 5, 2020. 428
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
class 2. Finding the best separator field by adding the variable 𝑆𝑖 is often also called the soft margin hyperplane. In this study we use a Gaussian kernel that can be optimized as in Eqs. (12)-(13).
𝑘(𝑥𝑖 , 𝑥𝑗) = exp ( −𝛾‖𝑥𝑖, 𝑥𝑗‖)2 (12)
Can be applied for 𝛾 = 0 , if the parameter is
different then 𝛾 =1
(2𝜎2) and the hyperplane
optimization become:
𝑦𝑖(�� ∙ 𝑥1 − 𝑏) ≥ 1, 𝑓𝑜𝑟 𝑖 = 1,… , 𝑛
[1
2∑max (0.1 − 𝑦𝑖(�� ∙ 𝑥1 − 𝑏))
𝑛
𝑖=1
] + 𝛾‖�� ‖2
𝑚𝑖𝑛1
2|𝑤|2 + 𝐶 (∑ 𝑆𝑖
𝑛𝑖=1 ) or
𝑠. 𝑡. 𝑦𝑖(𝑤. 𝑥𝑖 + 𝑏) ≥ 1 − 𝑆𝑖 or 𝑆𝑖 ≥0 (13)
C is the parameter that determines the large
selection and the data value is determined by the user. This optimization process follows the rules of Structural Risk Minimization (SRM). SRM principle is finding a subset of space. The hypothesis is chosen so that the upper limit is the actual risk by using that subset minimized. SRM aims to minimize actual risk by minimizing error in training data. In this study, minimizing 1
2|𝑤|2 are equivalent to minimizing VC
dimension and minimize 𝐶(∑ 𝑆𝑖𝑛𝑖=1 ) means
minimizing error in training data [36].
4. Result and discussion In the string similarity process, the results of
matching the character of each primer to each isolated DNA tested give varying degrees of similarity. What's interesting about this study is that all isolated DNA tested both SARS, MERS, and COVID-19 all yield a similarity percentage of higher than 69% at least in one of the COVID-19 primers compared, so it can be said that the sample is positive for COVID-19. Table 1 shows the number of sequences that have a higher similarity percentage of 69% in each primer. It can be observed that the sequence tends to have a high similarity value in the forward primary, but some also have a high similarity value in the primary refers.
Below is a piece of positive DNA COVID-19 accession code LR757996.1 on index 15850, MERS accession code MG923468.1 on index 1858, SARS accession code NC_004718 on index 15798. On the MERS DNA, there is one insert command that is
adding T nucleotides blue characters) to get the shortest distance. Primer : GTGARATGGTCATGTGTGGCGG
COVID-19 : GTGAAATGGTCATGTGTGGCGG
MERS : GTGACATTGTCAGGTGTGGGGG
SARS : GTGAGATGGTCATGTGTGGCGG
Through DNA alignment above, it can be
observed that the distance difference lies in the nucleotide R, where R is one component of RNA that can be transformed into nucleotides A, T, G, C. From the observations above, that the changes are not always specific to certain nucleotides. But from our deeper observation, from 20 COVID-19 samples, all of them turned into nucleotide A (Adenine). It can be concluded that the pattern of COVID-19 tends to be A, as in some of the alignment examples below:
Primer : GTGARATGGTCATGTGTGGCGG
LC528232.1 : GTGAAATGGTCATGTGTGGCGG
LC528233.1 : GTGAAATGGTCATGTGTGGCGG
LR757995.1 : GTGAAATGGTCATGTGTGGCGG
LR757996.1 : GTGAAATGGTCATGTGTGGCGG
LR757997.1 : GTGAAATGGTCATGTGTGGCGG
MN908947.3 : GTGAAATGGTCATGTGTGGCGG
MN996531.1 : GTGAAATGGTCATGTGTGGCGG
MN994468.1 : GTGAAATGGTCATGTGTGGCGG
Table 1 describes the number of sequences that
have a percentage similarity of ≥ 70% with respect to each primer. This amount is cumulative of all isolated DNA grouped according to the type of virus that infected it. Indeed, the COVID-19 DNA sample has the greatest number of similar sequences because what is tested is the COVID-19 primer. However, SARS also has a sequence of similarity above 70% in some primers, and MERS, although only two types of primers, can still be said to have a high degree of similarity to primers COVID-19.
In the Decision Tree algorithm, to decide on an isolated DNA including the type of which virus are quite difficult because the percentage of similarity in COVID-19 and SARS is almost the same, therefore this Decision Tree algorithm needs to add an entropy approach to measure how informative the value is given from the measurement results similarity distance in the previous process, this process also decide the value of gini index. The optimization process of Decision Tree algorithm can be observed in Fig. 1, the result of this system show that maximum split is tree using maximum deviance reduction method.
Received: May 12, 2020. Revised: June 5, 2020. 429
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
Table 1. The number of DNA Sequence having similarity level ≥ 70% in each of the COVID-19 primer tested
Primer COV
ID-19
ME
RS
SA
RS
5’-TGGGGYTTTACRGGTAAC
CT-3’(Forward) 80 0 98
5’-AACRCGCTTAACAAAGCA
CTC-3’(Reverse) 26 0 0
5’-TAATCAGACAAGGAACTG
ATTA-3’(Forward) 141 0 151
5’-CGAAGGTGTGACTTCCAT
G-3’(Reverse) 14 28 0
5′- GTGARATGGTCATGTGTG
GCGG-3’ (Forward) 90 2 86
5’- CARATGTTAAASACACTA
TTAGCATA-‘3 (Reverse) 0 0 0
5’- ACAGGTACGTTAATAGTT
AATAGCGT-3’ (Forward) 122 0 126
5’- ATATTGCAGCAGTACGCA
CACA-3’ (Reverse) 41 0 1
Figure. 1 Optimization process of decision Tree algorithm
Discriminant analysis algorithms usually use a
linear approach to determine which node belongs in which class. However, because the class needed in this study amounted to three, so the linear discriminant analysis is less precise in solving problems. We add the Wilks Lambda method to determine the independence variable used as input features of discriminant analysis. The test results using the Optimize Discriminant Analysis algorithm produce an accuracy rate of more than 98%. The optimization process of Discriminant Analysis algorithm can be observed in Fig. 2.
In K-NN algorithm, determining the value of K,
Figure. 2 Optimization process of discriminant analysis
Figure. 3 Optimization process of K-NN algorithm
Figure. 4 Optimization process of SVM algorithm
if the sum of our classifications is even then we better use even K values, and vice versa if our total classifications are odd then we better use even K values because if it is not so, there is a possibility that we will not be optimal results from testing. In this study, we use K = 1, which is choosing 1 neighbor who have high proximity values with the node that we are comparing. The training result of K-NN can be observe in Fig. 3.
Similar to the Discriminant Analysis Algorithm, the SVM algorithm also provides results with low accuracy in linear SVM. The advantage of SVM, this method has a kernel that can be adjusted to the needs-
Received: May 12, 2020. Revised: June 5, 2020. 430
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
Figure. 5 Confusion matrix results of (the order of images from left to right) Opt. decision tree, Opt. discriminant
analysis, Opt. K-NN, and Opt. SVM
Table 2. Sensitivity, specificity, precision (Positive Predictive Value), and negative predictive value (NPV) values for each class
Algorithm Class Sensitivity Specificity Precision NPV
Opt. Decision Tree COVID-19 0.950 1.000 1.000 0.976 SARS 1.000 1.000 1.000 1.000 MERS 1.000 0.975 0.952 1.000
Opt. Discriminant Analysis COVID-19 0.950 1.000 1.000 0.976 SARS 1.000 0.975 0.952 1.000 MERS 1.000 1.000 1.000 1.000
Opt. K-Nearest Neighbors COVID-19 1.000 1.000 1.000 1.000 SARS 1.000 1.000 1.000 1.000 MERS 1.000 1.000 1.000 1.000
Opt. Support Vector Machine COVID-19 1.000 1.000 1.000 1.000 SARS 1.000 1.000 1.000 1.000 MERS 1.000 1.000 1.000 1.000
based on input data or the number of output classes desired. In the SVM optimization process, we tested several kernels to produce the best hyperplane. The optimization process to get the best hyperplane uses the SRM principle and considers the actual risk factor. Kernel testing can be observed in Fig. 4.
The validation process in this study uses the Cross-Validation approach with K as many as 10. In each of our tested optimization methods, we divided the data into two groups, 90% for training data and 10% for test data. Then our application will form the composition of the data randomly 10 times to test its accuracy. The test results showed that the optimization of the Decision Tree algorithm and Discriminant analysis each resulted in 1 data error prediction. In the Decision Tree, one data that should be DNA infected with COVID-19 is predicted to be DNA infected with MERS. Whereas in Discriminant Analysis, one data which should be DNA infected with COVID-19 is predicted to be DNA infected with SARS. Comparison of these data uses the COVID-19 primer, but instead, the error data is found in COVID-19, while MERS and SARS can be predicted well, this shows that the pattern on COVID-19 is still changing more.
In the SVM and K-NN algorithms, each data can be predicted well and produces 0 prediction errors.
The confusion matrix of the four optimization algorithms can be observed in Fig. 5. From the confusion matrix in Fig. 5, the sensitivity, specificity, precision/Positive Prediction Value (PPV), and Negative Prediction Value (NPV) values can be calculated. Sensitivity values are obtained from correctly predicted data values divided by the amount of correct data in real conditions.
The specificity value is obtained from dividing correctly predicted data not the class divided by real data that is not the class. Then Precision is obtained from all data that is in the class and correctly predicted divided by the amount of true data that is predicted correctly and incorrectly. Calculating the value of sensitivity, specificity, precision, and NPV on a multi-class matrix is different from the calculation of a two-class matrix basically. In this case, when calculating the sensitivity value for the COVID-19 class, the MERS and SARS data will be considered True Negative (TN) data, as well as the calculations on MERS, and SARS. In Table 2, the sensitivity value for the COVID-19 class using the Decision Tree optimization algorithm is 0.95. A COVID-19 data is predicted to be wrong into MERS data, which causes the specificity and precision values in the MERS class to be imperfect. Whereas
Received: May 12, 2020. Revised: June 5, 2020. 431
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
Fig. 6. The accuracy value of each class uses decision
tree, discriminant analysis, K-NN, and SVM with optimization methods
in the Discriminant Analysis Optimization method, one member of the COVID-19 class is predicted to be wrong in the SARS class and also results in an imperfect precision and specificity value. For the K-NN and SVM Optimization methods, they can correctly predict data into each class. Fig. 6. is the accuracy value of each class for the tested methods.
5. Conclusion
The similarity in DNA structure between COVID-19, MERS, and SARS is one of the obstacles in predicting samples that are actually infected with COVID-19. The DNA alignment method with Primer produces a positive value of COVID-19 in all MERS and SARS samples. Machine learning methods can help the prediction process by observing changes in the pattern of DNA alignment that are included as input features. The results of predictions show Optimization of SVM and KNN are able to predict 100% correctly, while optimization of Discriminant Analysis and Decision Tree produces an accuracy of 98.3%. The prediction error is precisely in the COVID-19 sample data, even though the Primer tested was the COVID-19 primer. This shows that the composition of DNA in COVID-19 samples is still diverse and there is a possibility that mutations will continue to occur. In the process of DNA alignment between COVID-19 Primer and isolated DNA samples, we analyzed that when tested with certain primers containing RNA 'R', the sequence in isolated DNA infected COVID-19 always becomes 'A'
Conflicts of Interest
The authors declare no conflict of interest
Author Contributions
Berlian Al Kindhi in this study contributed to the entire process of machine learning and data set processing and writing paper.
Acknowledgments
This research is partially funded by the Institut Teknologi Sepuluh Nopember for research grants No. 853/PKS/ ITS/2020.
References
[1] T. Lupia, S. Scabini, S. M. Pinna, G. D. Perri, F. G. Rosa and S. Corcione, “2019 novel coronavirus (2019-nCoV) outbreak: A new challenge”, Journal of Global Antimicrobial Resistance, Vol. 21, pp. 22-27, 2020.
[2] M. A. Shereen, S. Khan, A. Kazmi, N. Bashir and R. Siddique, “COVID-19 infection: Origin, transmission, and characteristics of human coronaviruses”, Journal of Advanced Research, Vol. 24, pp. 91-98, 2020.
[3] J. A. Al-Tawfiq and P. Gautret, “Asymptomatic Middle East Respiratory Syndrome Coronavirus (MERS-CoV) infection: Extent and implications for infection control: A systematic review”, Travel Medicine and Infectious Disease, Vol. 27, pp. 27-32, 2019.
[4] P. B. Tim Smith, P. Jennifer Bushek and P. Tony Prosser, “COVID-19 Drug Therapy – Potential Options, Clinical Drug Information”, Clinical Drug Information, Clinical Solutions, 2020.
[5] C. Sohrabi, Z. Alsafi, N. O'Neill, M. Khan, A. Kerwan, A. Al-Jabir, C. Iosifidis and R. Agha, “World Health Organization declares global emergency: A review of the 2019 novel coronavirus (COVID-19)”, International Journal of Surgery, Vol. 76, pp. 71-76, 2020.
[6] B. A. Kindhi and T. A. Sardjono, “Pattern Matching Performance Comparisons as Big Data Analysis Recommendations for Hepatitis C Virus (HCV) Sequence DNA”, In: Proc. of the 3rd International Conference on Artificial Intelligence, Modelling and Simulation (AIMS), Kota Kinabalu, Malaysia, 2015.
[7] B. A. Kindhi, T. A. Sardjono, M. H. Purnomo and G. J. Verkerke, “Hybrid K-means, fuzzy C-means, and hierarchical clustering for DNA hepatitis C virus trend mutation analysis”, Expert Systems with Applications, Vol. 121, pp. 373-381, 2019.
[8] N. E. I. Karabadji, I. Khelf, H. Seridi, S. Aridhi, D. Remond, and W. Dhifli, “A data sampling and attribute selection strategy for improving
Received: May 12, 2020. Revised: June 5, 2020. 432
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
decision tree construction”, Expert Systems with Applications, Vol. 129, pp. 84-96, 2019.
[9] G. A. Kundakçi, M. Yılmaz, and M. KaanSözmen, “Determination of the costs of falls in the older people according to the decision tree model”, Archives of Gerontology and Geriatrics, Vol. 87, p. 104007, 2020.
[10] A. C. Hillar, L. C. Donna, B. Charles, and M. M. Heather, “Using decision trees to understand the influence of individual- and neighborhood-level factors on urban diabetes and asthma”, Health & Place, Vol. 58, p. 102119, 2019.
[11] M. M.Ghiasi, S. Zendehboudi, and A. A. Mohsenipour, “Decision tree-based diagnosis of coronary artery disease: CART model”, Computer Methods and Programs in Biomedicine, Vol. 192, p. 105400, 2020.
[12] J. Obregon, A. Kim, and J.-Y. Jung, “RuleCOSI: Combination and simplification of production rules from boosted decision trees for imbalanced classification”, Expert Systems with Applications, Vol. 126, pp. 64-82, 2019.
[13] C. L. M. Morais, K. M. G. Lima, and F. L. Martin, “Uncertainty estimation and misclassification probability for classification models based on discriminant analysis and support vector machines”, Analytica Chimica Acta, Vol. 1063, pp. 40-46, 2019.
[14] Y. R. Y. Fang, “Supervised discrete cross-modal hashing based on kernel discriminant analysis”, Pattern Recognition, Vol. 98, No. 1, p. 107062, 2020.
[15] S. Yang, J. Bian, Z. Sun, L. Wang, H. Zhu, H. Xiong, and Y. Li, “Early Detection of Disease Using Electronic Health Records and Fisher’s Wishart Discriminant Analysis”, Procedia Computer Science, Vol. 140, No. 1, pp. 393-402, 2018.
[16] Z. Jing, G. Wang, S. Zhang, and C. Qiu, “Building Tianjin driving cycle based on linear discriminant analysis”, Transportation Research Part D: Transport and Environment, Vol. 53, pp. 78-87, 2017.
[17] Q. Ye, L. Fu, Z. Zhang, H. Zhao, and M. Naiem, “Lp- and Ls-Norm Distance Based Robust Linear Discriminant Analysis”, Neural Networks, Vol. 105, No. 1, pp. 393-404, 2018.
[18] V. C. K. Al-Dulaimi, K. Nguyen, J. Banks, and I. Tomeo-Reyes, “Benchmarking HEp-2 specimen cells classification using linear discriminant analysis on higher order spectra features of cell shape”, Pattern Recognition Letters, Vol. 1251, pp. 534-541, 2019.
[19] R. K. Das, A. B. Manam, and S. R. M. Prasanna, “Exploring kernel discriminant analysis for
speaker verification with limited test data”, Pattern Recognition Letters, Vol. 98, pp. 26-31, 2017.
[20] Y. Shao, G. Gao, and C. Wang, “Nonlinear discriminant analysis based on vanishing component analysis”, Neurocomputing, Vol. 218, pp. 172-184, 2016.
[21] S. B. Chen, Y. L. Xu, C. H. Q. Ding, and B. Luo, “A Nonnegative Locally Linear KNN model for image recognition”, Pattern Recognition, Vol. 83, pp. 78-90, 2018.
[22] J. Xiao, “SVM and KNN ensemble learning for traffic incident detection”, Physica A: Statistical Mechanics and its Applications, Vol. 517, pp. 29-35, 2019.
[23] G. Bhattacharya, K. Ghosh, and A. S. Chowdhury, “Granger Causality Driven AHP for Feature Weighted kNN”, Pattern Recognition, Vol. 66, p. 4250436, 2017.
[24] J. N. Myhre, K. Ø. Mikalsen, S. Løkse, and R. Jenssen, “Robust clustering using a kNN mode seeking ensemble”, Pattern Recognition, Vol. 76, pp. 491-505, 2018.
[25] C. D. S. Zheng, “A group lasso based sparse KNN classifier”, Pattern Recognition Letters, Vol. 131, pp. 227-233, 2020.
[26] N. Liu, X. Xu, Y. Li, and A. Zhu, “Sparse representation based image super-resolution on the KNN based dictionaries”, Optics & Laser Technology, Vol. 110, pp. 135-144, 2019.
[27] B. Lin, X. Wei, and Z. Junjie, “Automatic recognition and classification of multi-channel microseismic waveform based on DCNN and SVM”, Computers & Geosciences, Vol. 123, pp. 111-120, 2019.
[28] T. I. Dhamecha, A. Noore, R. Singh, and M. Vatsa, “Between-subclass piece-wise linear solutions in large scale kernel SVM learning”, Pattern Recognition, Vol. 95, pp. 173-190, 2019.
[29] D. Zhang, L. Jiao, X. Bai, S. Wang, and B. Hou, “A robust semi-supervised SVM via ensemble learning”, Applied Soft Computing, Vol. 65, pp. 632-643, 2018.
[30] R. Sundar and M. Punniyamoorthy, “Performance enhanced Boosted SVM for Imbalanced datasets”, Applied Soft Computing, Vol. 83, p. 105601, 2019.
[31] U. Khan, L. Schmidt-Thieme, and A. Nanopoulos, “Collaborative SVM classification in scale-free peer-to-peer networks”, Expert Systems with Applications, Vol. 691, pp. 74-86, 2017.
[32] Gene Bank, 10 2 2020. [Online]. Available: https://www.ncbi.nlm.nih.gov/genbank/sars-cov-2-seqs/.
Received: May 12, 2020. Revised: June 5, 2020. 433
International Journal of Intelligent Engineering and Systems, Vol.13, No.4, 2020 DOI: 10.22266/ijies2020.0831.37
[33] J.-M. Kim, Y.-S. Chung, H. J. Jo, N.-J. Lee, M. S. Kim, S. H. Woo, S. Park, J. W. Kim, H. M. Kim, and M.-G. Han, “Identification of Coronavirus Isolated from a Patient in Korea with COVID-19”, Osong Public Health and Research Perspectives, Vol. 11, No. 1, pp. 3-7, 2020.
[34] LKS Faculty of Medicine, School of Public Health, Hongkong University, “Detection of 2019 novel coronavirus (2019-nCoV) in suspected human cases by RT-PCR”, https://www.who.int/docs/default-source/coronaviruse/peiris-protocol-16-1-20.pdf?sfvrsn=af1aac73_4, Hongkong, 2020.
[35] S. Carson and D. Robertson, Manipulation and Expression of Recombinant DNA, chapter: III Expression, Detection, and Purification of Recombinant Proteins from Bacteria, Elsevier Academic Press, California, pp. 130–168, 2006.
[36] K. Hechenbichler and K. Schliep, “Weighted k-Nearest-Neighbor Techniques and Ordinal Classification”, Sonderforschungsbereich, Vol. 386, p. 399, 2004.
[37] E. E. Osuna, R. Freund, and F. Girosi, “Support vector machines; training and applications”, A. I. Memos No. 1602, CBCL Memos No. 144, Artificial Intelligence laboratory, Massachusetts Institute of Technology, 1997.
Lampiran 3. Luaran tambahan draft paper international conference
59 60
I
4 5
9
12
Page 1 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1
1 2 3 Hepatitis C Virus Detection in Human DNA 6 using Random Sequence Patterns and Machine 7 8 Learning 10 11
B. Al Kindhi, T. A. Sardjono, Member, IEEE, M. Amin, G. J. Verkerke, M. H. Purnomo, and M. A.
13 Wiering., Senior Member, IEEE 14 15 16 often called DNA micro-array analysis [1]. DNA analysis can 17 Abstract—The Hepatitis C Virus (HCV) is a virus that has 18 one of the highest mutation rates in the world. One way to know
about HCV infection is to analyze sequence patterns in isolated 19 DNA. In this paper, we propose a novel method for computing 20 input features from the DNA sequences by using reasonably small 21 random DNA sequences and the Levenshtein distance. The
22 Levenshtein distance is used to compute how much a randomly
23 generated pattern is present in a human DNA sequence. This 24 results in an input vector of a particular length based on the
number of generated random DNA sequences. Our method is 25 compared to the use of HCV primers, which are sub sequences 26 known to often occur in DNA sequences of humans infected by 27 HCV. Four machine learning algorithms are compared on their 28 ability to use the novel input features for classifying the 29 presence of HCV in isolated DNA data. We compare Multi-Layer 30 Perceptrons, K-Nearest Neighbors, Logistic Regression, and the
Mean Centroid classifier to detect HCV-positive isolated DNA. 31 The dataset consists of thousand isolated DNA sequences for HCV 32 infected and for non-infected people. The results show that the use 33 of many and reasonably long randomly generated DNA sequences
34 leads in general to similar or higher accuracies compared to using 35 HCV primers and leads to almost perfect classification scores. 36 Furthermore, the results show that the multi-layer perceptron 37 and logistic regression classifier obtain the best results from the
four machine learning algorithms we compared. 38 39 Index Terms— DNA Prediction, DNA Sequence Matching, HCV 40 Classification, Machine Learning 41 42 43 I. INTRODUCTION 44 SOLATED DNA sequences can consist of tens of 45 thousands of sequence proteins, and its data processing is 46 47 48 49 50 51 52 53 54 55 56 57 58
be used as a decision support system for many problems such as diagnosing a cancer disease [2] or detecting the presence of bacteria or viruses [3]. One commonly used technique is classifying the DNA sequence [4],[5],[6], where a sequencing process is usually the first step.
Feature selection [7], exon prediction [8], and gene struc- ture prediction [9], [10] have been applied to handle DNA prediction problems. Furthermore, for DNA prediction, fuzzy learning methods have been used to classify protein-based or nucleotide sequence patterns [11], [12]. Other research explains that classifying the Hepatitis B Virus (HBV) can be based on search indexes on the internet so that every sequence that has a pattern similarity with positive HBV sequences can be decided positively infected [13]. This technique is based on a semantic similarity metric, but may suffer from the problem that DNA sequences can mutate over time. Another prediction technique is to simulate the sequence of nucleotides [14] and perform an analysis based on the outcomes of these simulations [15],[16], [17]. In addition to sequence-based pre- diction, DNA prediction can use epigenetic methylation [18], for example the search for a methylated region can be based on one of the nucleotides, which may occur with mutations and cause a disease [19]. One commonly used method for predicting DNA is the splice site technique [20] and using DNA binding proteins [21], that is, by slashing the DNA in the proper sequence and predicting based on the sequence [22], [23].
In bio-medicine, primers are often used as predictors of the presence of a virus or bacteria in isolated DNA [24], [25], [26]. The primer pattern may be replicated in certain index sequences. Primers can be found by the Polymerization Chain Reaction (PCR) method. A useful primer is a primer that is present in all the positive isolated DNA. Hence, quite a lot of research aims to extract a short primer that can be used to detect the presence of a virus in isolated DNA [27], [28], [29].
In bioinformatics, primers are used as comparison param- eters in pattern matching processes for isolated DNA. The results of this pattern matching process can be improved upon by using machine learning methods. In this paper, we use four machine learning methods for our experiments: a
B. Al Kindhi is with Department of Electrical Automation Engineering, Institut Teknologi Sepuluh Nopember, Indonesia (e-mail: [email protected]). (Corresponding author).
T. A. Sardjono is with Department of Biomedical Engineering, Institut Teknologi Sepuluh Nopember, Indonesia (e-mail: [email protected]).
M. Amin is with Institute of Tropical Disease, Airlangga University, Indonesia
G. J. Verkerke is with Department of Rehabilitation Medicine, University of Groningen, Netherlands (e-mail: [email protected]).
M. H. Purnomo is with Department of Computer Engineering, Institut Teknologi Sepuluh Nopember, Indonesia (e-mail: [email protected]).
M. A. Wiering is with Department of Artificial Intelligence, University of Groningen, Netherlands (e-mail: [email protected]).
59 60
Page 2 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2
1 2 Multi-Layer Perceptron (MLP), K-Nearest Neighbors (K-NN), 3 Logistic Regression, and a Mean Centroid classifier to map 4 the extracted input features to the target class (HCV infected, 5 not HCV infected). The main originality of our paper is 6 how we compute informative input features that help the 7 classification algorithms to obtain high accuracies. Our method 8 combines the generation of a number of random DNA sub- 9 sequences and the Levenshtein distance (edit distance). The 10 edit distance is a state-of-the-art method to compute distances 11 between strings and is used here to derive information about 12 how present each randomly generated DNA sequence is in 13 the whole isolated human DNA sequence. Our method can 14 use shorter or longer sequences and in the experiments we 15 used random DNA sequences of lengths 25, 100, and 200. 16 Furthermore, the number of sub sequences is a parameter, for 17 which we tried 37 and 100. The advantage of our proposed 18 method is its simplicity, as it does not require an analysis 19 to obtain primers. Furthermore, our method can be used for 20 many different DNA classification problems. Our method has 21 some similarities to using random projections, in which feature 22 vectors are transformed to other features by using dot products 23 with random vectors [16]. 24 The data samples used were isolated DNA Hepatitis C Virus 25 (HCV) data and Homo Sapiens data and as comparative 26 patterns we used HCV primers and random DNA sequences. 27 We are primarily interested to study how these patterns can 28 be effectively used by the machine learning algorithms for 29 classifying the presence of HCV in DNA sequences. 30 Outline of this paper. In Section 2, the used machine 31 learning techniques are shortly described. Section 3 describes 32 the dataset and our method of extracting input features from 33 very long DNA sequences. In Section 4, we show the results 34 and discuss them. Finally, Section 5 concludes this paper and 35 gives some recommendations for future work 36 37 38 II. MACHINE LEARNING ALGORITHMS 39 In this section, we will present a description of the used 40 machine learning algorithms. The four supervised learning 41 methods use a dataset of known HCV infected DNA and not 42 infected DNA to train a classification model. Because there 43 are two classes: the DNA sequence is infected or not, this is 44 a binary classification problem. 45 The K-Nearest Neighbor (K-NN) classifier is a classification 46 method that classifies new examples based on their distances 47 to the training examples in feature space [30]. The classi- 48 fication process is done by first determining the K closest 49 examples [17] and then the majority class of these K examples 50 is used to classify the unseen test example [31]. The K-NN 51 classifier is simple and only requires to set K, the number of 52 used neighbors, and a choice of distance function for which 53 often the Euclidean distance is used. 54 Another supervised learning method that is very often used 55 is the Multi-Layer Perceptron (MLP). An MLP consists of two 56 or more layers, including an input layer, several hidden layers, 57 and one output layer. Each layer has a number of processing 58
units connected to the next layer with adaptive connections (weights). The weights between layers are trained based on the error between targets and predicted outputs [32]. After training on a dataset for multiple epochs, the MLP can output classifications on unseen examples.
Logistic regression is a supervised learning method that uses multivariate analysis to predict dependent variables based on independent variables that have been determined [33]. In logistic regression, the dependent variable is the dichotomy variable (categorical). In this study, the dichotomy variable is binary logistic, because the dependent variable consists of two categories: positive and negative HCV. Other types of dependent variables are multinomial logistic (for more than two dependent variables) and ordinal logistic (for more than two dependent variables in the form of ranking) [34]. Logistic regression is a good alternative learning method if the assumption of having a multivariate normal distribution on the free variables is not fulfilled [35]. This assumption is not always fulfilled, because the independent variables may consist of continuous variables and categorical variables [36], [37].
The idea underlying the Mean Centroid classifier is to com- pute a mean (centroid) vector for all input vectors belonging to each class. In our case, the positive centroid is obtained by calculating the average of the input vectors of all positive data and vice versa for the negative class. For a novel example, the method computes the Euclidean distances to each centroid. Then the class belonging to the centroid with the minimum distance will be used as predicted class for the test example.
III. DATA SETS AND FEATURE EXTRACTION
HCV is one of the tropical diseases that undergoes a genetic mutation cycle (molecular clock) quickly and is easily trans- mitted. HCV is a type of RNA virus. RNA is a molecule in the body that is responsible as a messenger code for the formation of new DNA proteins. If the RNA is infected with the virus, the DNA will also change. The cell’s natural property is to defend itself by altering new DNA code patterns through changing RNA patterns. The probability of DNA changes may also vary depending on natural adaptations or current treatments. For example, viruses are resistant to the same drug, because the DNA patterns have changed (as a survival adaptation), forming a new genome sub-type again and so on. In general, differences between DNA and RNA can be seen: DNA contains polymers that are longer than RNA.
DNA isolation is the process of purifying DNA from cell samples using a combination of physical and chemical methods [38]. The sample is obtained by breaking the walls of cells or tissues to be isolated by DNA, such as red blood cells, bacterial cell cultures, and animal or plant tissue. The first DNA isolation was done in 1869 by Friedrich Miescher [38]. The datasets used in this study are obtained from the World Gene Bank [39]. The dataset that we used from the World Gene Bank is from the nucleotide category with the keyword Hepatitis C Virus and we also used the general Homo Sapiens dataset. Some examples of the isolated DNA data code that
59 60
Page 3 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3
1 2 we use are AB426117.1, AB828704.1, and AY956468.1. We 3 used 1000 isolated DNA sequences consisting of 500 4 isolated DNA sequences that were positively infected with 5 HCV and 500 negative sequences. 6 We converted the isolated data into the FASTA format and 7 saved the sequences into text files. The FASTA format is a 8 text-based format to represent the sequence of nucleotides or 9 protein sequences represented by a letter code. These files are 10 then processed by feature extraction methods to compute a 11 fixed number of input features from each sequence. For this, 12 five feature extraction methods are used: using 37 known HCV 13 primers, using 37 random DNA sequences of length 25, 100, 14 or 200, and using 100 random DNA sequences of length 200. 15 A primer is a key character sequence used by medical 16 experts to detect the presence of a virus or bacteria in isolated 17 DNA. The primer must be complementary to the sequence to 18 be amplified, especially in the Polymerization Chain Reaction 19 (PCR) [40]. If a primer can be attached to a nucleotide 20 sequence, then the sequence can be said to be positive to 21 the primer. Due to the high mutation rates in HCV, HCV 22 tends to have dozens of primary types. In total, we use 37 23 primers where each primer has different character lengths, 24 ranging from 15 to 50 characters. The primers obtained with 25 PCR methods are from the Institute of Tropical Disease, 26 Airlangga University and we also used some primers from 27 other research [41], [42]. 28 For our novel method, we use a random number generator 29 for generating random sequences consisting of four types of 30 nucleotides, which are Adenine (A), Guanine (G), Cytosine 31 (C) and Thymine (T). In this study, the lengths of the gener- 32 ated random sequences are 25, 100, and 200 characters. We 33 generate random DNA sequences with 3 different lengths to 34 get more insight in the usefulness of random DNA sequences 35 and also to obtain different and more results. Because we used 36 37 sequences for the HCV primers, we will also generate 37 37 sequences for the random DNA sequences with lengths 25 and 38 100. For the random DNA sequence with length 200, we will 39 also compare using 37 and 100 sequences. 40 In Figure 1, we show the feature extraction process. First, 41 the method begins with reading the FASTA file according to 42 the sequence of the IDs in the database. One FASTA file 43 can consist of 10,000 to 15,000 nucleotide sequences and each 44 file will be compared with each primer or random DNA 45 sequence. Each primer has different character lengths and the 46 DNA similarity matching process will also vary according to 47 the length of the character string of the comparison pattern. 48 After the process of calculating the distance between one file 49 and one pattern is finished, the process of calculating distances 50 will be repeated with the next pattern until the last pattern is 51 compared with the isolated DNA. 52 We use the Levenshtein distance to calculate the presence of 53 each primer or random DNA sub sequence in a DNA sequence. 54 The Levenshtein distance computes a distance between two 55 strings s and t by counting the minimum amount of deletions, 56 insertions, or changes to make the strings the same. Suppose 57 the primer pattern is a and the DNA sequence is b. The pattern 58
a can occur at many places in the DNA sequence b. We perform the matching process of pattern a with a part of sequence b by shifting a window over sequence b. For each window, we compute the Levenshtein distance between pattern a and the sub sequence in the window. From all these distances, we compute the minimum computed distance, which denotes how present pattern a is in sequence b.
By using the Levenshtein distance, the use of all primers lead to 37 input features for the machine learning algorithms. The computation of the Levenshtein distance is illustrated in Figure 2, which was taken from Isolated DNA with the code: EU255978. The pattern taken was a sequence starting from index 4924, namely TTTGACTCGTCTGTCCTCTG. The primary comparison is TTTGACTCAACCGTCACTGA. The Levenshtein distance that is calculated is 6, which uses substitutions at positions 9, 10, and 12. Then "C" is inserted at position 15, a substitution is performed at position 17 and the last position is deleted, so there are a total of 6 changes, which is the Levenshtein distance between the two sequences.
Fig. 1. Data Set Preprocessing Procedure
59 60
Page 4 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Fig. 2. Example of a Levenshtein distance computation matrix in isolated 24 DNA 25 26 The comparison results of one FASTA file with one pattern 27 produces thousands of sequences with different distances, 28 from which the smallest distance is taken as input feature 29 30 31 for that pattern. Because we use 37 patterns (or 100 in one 32 experiment), we end up with 37 (or 100) input features for the 33 classifiers. 34 The primer with the pattern TTTGACTCAACCGT- 35 CACTGA is a primer that has a close proximity to all isolated 36 DNA sequences which are positive HCV infected, as 37 evidenced by its smallest average distance compared to other 38 primers. If an isolated DNA sequence consists of around 12,000 39 characters, then in the experiment with random DNA 40 sequences with length 200 there are around 10 billion charac- 41 ter comparisons for the whole dataset (300,000 comparisons
A. Hyperparameter Selection We first optimized the different hyperparameters of the four
machine learning algorithms using preliminary experiments. For the Multi-Layer Perceptron (MLP), the number of input
neurons is 37 (or 100) according to the number of patterns used. The target outputs in our dataset are 0 or 1. For the MLP, we use a single output neuron with a sigmoid activation function and the binary cross entropy as loss function. To convert its output to a Boolean decision, we used a threshold of 0.5. We did some testing with the number of hidden units, we tried using 5, 10, and 15 units and tried 250, 500 and 1000 training epochs. The results of these preliminary experiments showed that the best setting uses 10 hidden units and 250 training epochs.
For K-NN, also 37 (or 100) values of the shortest distances are used as input vector. We selected the Euclidean distance function and tested the K-NN with different values for K, ranging from 1 to 9, and from all tests, the best value was obtained if the value of K was 1, or 3, and for the experiments we selected K = 3.
For Logistic Regression, the dependent variable y is a value of 0 or 1, where 1 means positive and 0 means negative. Therefore, we used binomial logistic regression. Significant Omnibus test values should be below 0.05 as we use the trust level 95%. The Omnibus test with the number of independent variables yields a significance value lower than 0.05. The independent variables are able to explain 97% of the dependent variable as shown by the Square Nagelkerke value of 0.97.
The Mean Centroid method does not have any hyperparam- eters except for the choice of distance function, for which we selected the Euclidean distance.
B. Results We used five different evaluation metrics: we computed the
sensitivity (recall), specificity, precision, accuracy and the F1- score. For this we first compute the number of true positives (T P), true negatives (T N ), false positives (FP) and false negatives (FN ) on the test data. Then the metrics can be computed in Equations (1.a), (1.b), (1.c), (1.d) and (1.e) :
42 between one pattern and a DNA sequence, for 1000 DNA files, 43 and for 37 sequences). 44
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = *+
*+,*-
*-
(1.a)
45 IV. RESULT AND DISCUSSION 46 We performed experiments with 4 machine learning algo- 47 rithms and 5 methods for computing input features (using the 48 primers, using 37 random sequences with lengths 25, 100, 200,
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = (1.b) *-,1-
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛=*+ (1.c)
*+,1+
49 and using 100 random sequences of length 200). This results 50 in 20 different experiments. For each experiment, we use 10- 51 fold cross validation. This means that the thousand isolated
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = *+,*-
*+,*-,1+,1- <∙>?@ABCBDE∙+F?GBABH@
(1.d)
52 DNA data examples (with 500 positive and 500 negative 53 instances) are divided randomly into 10 groups, where each
𝐹1− 𝑆𝑐𝑜𝑟𝑒 = (1.e) >?@ABCBDE,+F?GBABH@
54 group consists of 100 isolated DNA examples. Then, for each 55 fold 100 isolated DNA examples are used as train data. 56 57 58
The results of all experiments can be found in Table I. For all classifiers, using the primers or randomly generated sequences with length 200 lead to the best results. The use of 100 sequences with randomly generated sequences of length
59 60
Page 5 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5
1 2 3 4 TABLE I 5 SENSI TI VYTY, SPECIFI CIT Y, PRECISIO N, AND F1 - SCO RE OF EAC H LEA RNI N G METHOD WITH 5 FEA T UR E 6 EXTR AC TION A LG ORI T HMS : PH= PRIM ER HCV; 25 DR= 25 LEN GT H DNA RANDOM, 200 DR= 100 LENGTH DNA 7 RANDOM , 200 DR= 200 LENGTH DNA RANDOM, 100 -200 DR= 100 INPUT FEATURES WITH 200 LENGTH DNA RANDOM
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
Fig. 3. Mean accuracies of the MLP, K-NN, Logistic Regression, and Mean Centroid methods using 10-fold cross validation with standard deviations
Learning Methods Comparator Pattern
PH
Sensitivy
0.990
Specificity
1.000
Precision
1.000
F1
0.995
25 DR 0.848 0.838 0.840 0.844 MLP 100 DR 0940 0.992 0.992 0.965
200 DR 0.996 0.990 0.990 0.993 100 – 200 DR 0.992 0.998 0.998 0.995
PH 0.964 1.000 1.000 0.982 25 DR 0.908 0.488 0.639 0.750
K-NN 100 DR 0.912 0.996 0.996 0.952 200 DR 1.000 0.972 0.973 0.986 100 – 200 DR 0.978 0.996 0.996 0.987 PH 0.996 1.000 1.000 0.998 25 DR 0.770 0.784 0.781 0.775 Logistic Regression 100 DR 0.918 0.962 0.960 0.939 200 DR 0.988 1.000 1.000 0.994 100 – 200 DR 0.992 0.998 0.998 0.995
PH 0.926 1.000 1.000 0.962
25 DR 0.760 0.832 0.819 0.788
Mean Centroid 100 DR 0.918 0.988 0.987 0.951
200 DR 0.916 0.980 0.979 0.946
100 – 200 DR 0.998 0.968 0.969 0.983
59 60
Page 6 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6
1 2 200 leads to a small increase in performance compared to 3 using 37 of these sequences. The best performing classifier 4 is in general the multi-layer perceptron, closely followed by 5 logistic regression. The K-NN and Mean Centroid classifiers 6 perform worse, but still obtain decent scores. 7 The accuracies of each learning method can be observed 8 in Figure 2. Of all methods, the MLP obtains in general the 9 best performance and obtains accuracies around 99.5%, 10 meaning that there is only one error on average in 200 test 11 examples. The next highest performances are obtained with 12 Logistic Regression, although it obtains the highest accuracy 13 of 99.8% when combined with the 37 primers. The lowest 14 performance of the four tested methods is obtained with the 15 Mean Centroid method, although it still obtains quite high 16 accuracies with a mean accuracy at 96.3% when using the 17 primers and 98.3% for the 100-200 DR dataset. 18 19 C. Discussion 20 Due to the high mutation rate of HCV, more HCV subtypes 21 are produced in the world and this leads to an increase of 22 the number of primers. Not all of these primers can be useful 23 to recognize HCV in isolation, because the pattern of HCV 24 mutation may differ from region to region. This causes a 25 problem in determining which primer is most appropriate for 26 analyzing the isolated DNA. 27 In this study we used machine learning algorithms that 28 combined multiple known primers. This is a more robust and 29 effective method, but requires the use of known primers, which 30 may not be available for all kinds of diseases. Therefore, the 31 use of randomly generated DNA sequences in combination 32 with the Levenshtein distance is a much simpler method for 33 computing input features for machine learning algorithms. 34 Although no a-priori known primers are needed for this 35 method, the results show that this method is as effective and 36 in some cases even more effective than the use of primers. 37 The obtained accuracies are in general very high. The best 38 methods obtain accuracies of 99.5%, which shows the 39 effectiveness of using the machine learning algorithms for 40 detecting HCV infection in human DNA. 41 42 V. CONCLUSION AND FUTURE WORK 43 This paper proposed the use of randomly generated DNA 44 sequences to detect HCV infection in human isolated DNA. 45 Four machine learning algorithms were compared when us- 46 ing known primer sequences and the randomly generated 47 sequences. The results show that longer randomly generated 48 DNA sequences lead to comparable or better results compared 49 to the use of primers. The best classifiers are the multi-layer 50 perceptron followed by a logistic regression classifier. These 51 methods obtain near-optimal accuracies (∼ 99.5%) when using 52 the best input features. 53 Although the methods obtain very good performances, we 54 have not yet analyzed the correlation between single random 55 sequences and the presence of HCV infection. In future work, 56 we want to have a better understanding of which sequences 57 have most predictive value. Furthermore, although the results 58
for this dataset are very promising, there are many other DNA sequence classification problems. Therefore, in future work, we want to analyze the performance of the proposed methods on different problems.
ACKNOWLEDGEMENT This research is partially funded by the Indonesian Ministry of
Technology and Higher Education under WCU Program, managed by Institut Teknologi Bandung.
REFERENCES
[1] K.-C. Wong, C. Peng, and Y. Li, “Evolving Transcription Factor Binding Site Models From Protein Binding Microarray Data,” IEEE Transactions on Cybernetics, vol. 47, no. 2, pp. 415–424, 2017.
[2] E. Sanchez, W. Peng, C. Toro, C. Sanin, M. Grana, E. Szczerbickid,E. Carrasco, F. Guijarro, and L. Brualla, “Decisional DNA for modeling and reuse of experiential clinical assessments in breast cancer diagnosis and treatment,” Neurocomputing, vol. 146, pp. 308–318, 2014. [Online]. Available: https://doi.org/10.1016/j.neucom.2014.06.032
[3] S. Plakunmonthon, N. P. T-Thienprasert, K. Khongnomnan, Y. Poovorawan, and S. Payungporn, “Computational prediction of hybridization patterns between hepatitis C viral genome and human microRNAs,” Journal of Computational Science, vol. 5, no. 3, pp. 327–331, 2014. [Online]. Available: https://doi.org/10.1016/j.jocs.2013.12.005
[4] W. Wang, J. Liu, Y. Xiong, L. Zhu, and X. Zhou, “Analysis and classification of DNA-binding sites in single-stranded and double- stranded DNA-binding proteins using protein information,” IET Systems Biology, vol. 8, no. 4, pp. 176 –183, 2014. [Online]. Available: 10.1049/iet-syb.2013.0048
[5] R. N. G. Naguib, H. A. M. Sakim, M. S. Lakshmi, V. Wadehra, T. W. J. Lennard, J. Bhatavdekar, and G. V. Sherbet, “DNA ploidy and cell cycle distribution of breast cancer aspirate cells measured by image cytometry and analyzed by artificial neural networks for their prognostic significance,” IEEE Transactions on Information Technology in Biomedicine, vol. 3, no. 1, pp. 61–69, 1999.
[6] U. Maulik and D. Chakraborty, “Fuzzy Preference Based Feature Selection and Semisupervised SVM for Cancer Classification,” IEEE Transactions on NanoBioscience, vol. 13, no. 2, pp. 152–160, 2014. [Online]. Available: 10.1109/TNB.2014.2312132
[7] A. A. Raweh, M. Nassef, and A. Badr, “A Hybridized Feature Selection and Extraction Approach for Enhancing Cancer Prediction Based on DNA Methylation,” IEEE Access, vol. 6, pp. 15 212–15 223, 2018.
[8] S. Putluri, Z. U. Rahman, and S. Y. Fathima, “Cloud-based adaptive exon prediction for DNA analysis,” IET Journals & Magazines, vol. 5, no. 1, pp. 25–30, 2018.
[9] N. Yu, X. Guo, F. Gu, and Y. Pan, “Signalign: An Ontology of DNA as Signal for Comparative Gene Structure Prediction Using Information- Coding-and-Processing Techniques,” IEEE Transactions on NanoBio- science, vol. 15, no. 2, pp. 119–130, 2016.
[10] X. Ma, J. Guo, H.-D. Liu, J.-M. Xie, and X. Sun, “Sequence-Based Prediction of DNA-Binding Residues in Proteins with Conservation and Correlation Information,” IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 9, no. 6, pp. 1766–1775, 2012.
[11] B. Park, J. Im, N. Tuvshinjargal, W. Lee, and K. Han, “Sequence- based prediction of protein-binding sites in DNA: Comparative study of two SVM models,” Computer Methods and Programs in Biomedicine, vol. 117, no. 2, pp. 158–167, 2014. [Online]. Available: 10.1016/j.cmpb.2014.07.009
[12] B. A. Kindhi, T. A. Sardjono, M. H. Purnomo, and G. J. Verkerke, “Hybrid K-means, fuzzy C-means, and hierarchical clustering for DNA hepatitis C virus trend mutation analysis,” Expert Systems with Applications, vol. 121, pp. 373–381, 2019. [Online]. Available: https://doi.org/10.1016/j.eswa.2018.12.019
[13] Y. Zhang, Y. Zhao, X. Lin, A. Wang, and D. Che, “Modeling for the prediction of Hepatitis B incidence based on integrated online search indexes,” Informatics in Medicine Unlocked, vol. 10, pp. 143–148, 2018. [Online]. Available: doi.org/10.1016/j.imu.2018.01.004
59 60
[14] R. Elber, A. Roitberg, C. Simmerling, R. Goldstein, H. Lia,
G. Verkhivker, C. Keasar, J. Zhang, and A. Ulitsky, “MOIL: A program for simulations of macromolecules,” Computer Physics Communications, vol. 91, no. 1-3, pp. 159–189, 1995. [Online]. Available: 10.1016/0010-4655(95)00047-J
[15] Y. Zhao, G. Fu, J. Wang, M. Guo, and G. Yu, “Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing,” Genomics, 2018.
59 60
9
12
15
40
53
Page 7 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7
1 2 [16] D. Fradklin and D. Madigan, “Experiments with Random Projections
for Machine Learning,” in The Ninth ACM SIGKDD International 3 Conference on Knowledge Discovery and Data Mining , New York, 4 USA, 2003, pp. 517–522. 5 [17] F. Martíneza, P. Frías, M. D. Pérez-Godoy, and A. J. Rivera, “Dealing
with seasonality by narrowing the training set in time series 6 forecasting with kNN,” Expert Systems with Applications, vol. 103, pp. 7 38–48, 2018. 8 [18] I. Saif, Y. Kasmi, M. M. Ennaji, and K. Allali, “Prediction of DNA
methylation in the promoter of gene suppressor tumor,” Gene, vol. 651, pp. 166–173, 2018. [Online]. Available:
10 10.1016/j.gene.2018.01.082 11 [19] X. Zhou, Z. Li, Z. Dai, and X. Zou, “Prediction of methylation CpGs
and their methylation degrees in human DNA sequences,” Computers in Biology and Medicine, vol. 42, no. 4, pp. 408–413, 2012. [Online].
13 Available: 10.1016/j.compbiomed.2011.12.008 14 [20] N. Goel, S. Singh, and T. C. Aseri, “An Improved Method for Splice
Site Prediction in DNA Sequences Using Support Vector Machines,” Procedia Computer Science, vol. 57, pp. 358–367, 2015.
16 [21] W. Leyi, T. Jijun, and Z. Quan, “Local-DPP: An improved DNA-binding 17 protein prediction method by exploring local evolutionary 18 information,” Information Sciences, vol. 384, pp. 135–144, 2017.
[Online]. Available: 10.1016/j.ins.2016.06.026 19 [22] J. N.Andersen, R. L. Vecchio, N. Kannan, J. Gergel, A. F. Neuwal, 20 and N. K. Tonks, “Computational analysis of protein tyrosine 21 phosphatases: practical guide to bioinformatics and data resources,”
Methods, vol. 35, pp. 90–114, 2005. 22 [23] O. Nikdelfaz and S. Jalili, “Disease genes prediction by HMM based 23 PU-learning using gene expression profiles,” Journal of Biomedical
24 Informatics, vol. 81, pp. 102–111, 2018. [24] M. Redrejo-Rodríguez, C. D.Ordóñez, M. Berjón-Otero, J. Moreno-
25 González, C. Aparicio-Maldonado, P. Forterre, M. Salas, 26 and M. Krupovic, “Primer-Independent DNA Synthesis by a Family 27 B DNA Polymerase from Self-Replicating Mobile Genetic
Elements,” Cell Reports, vol. 21, no. 6, pp. 1574–1587, 2017. 28 [Online]. Available: 10.1016/j.celrep.2017.10.039 29 [25] M. A. Pacheco, A. S. Cepeda, R. Bernotiene, I. A. Lotta, N. E. 30 Matta, G. Valkiu nas, and A. A.Escalante, “Primers targeting
mitochondrial genes of avian haemosporidians: PCR detection and 31 differential DNA amplification of parasites belonging to different 32 genera,” International Journal for Parasitology, 2018. [Online]. 33 Available: 10.1016/j.ijpara.2018.02.003
[26] S. Dorn-In, R. Bassitta, K. Schwaiger, J. Bauer, and C. S. Hölzel, “Specific amplification of bacterial DNA by optimized so-called uni-
35 versal bacterial primers in samples rich of plant DNA,” Journal of 36 Microbiological Methods, vol. 113, pp. 50–56, 2015. 37 [27] L. Zhang, Y.-H. Wu, J. Li, W. Li, B. Liu, and G. Wu, “A self-
probing primer PCR method for detection of very short DNA fragments,” 38 Analytical Biochemistry, vol. 514, pp. 55–63, 2016. 39 [28] W. Jiang, S. Yue, S. He, C. Chen, S. Liu, H. Jiang, H. Tong, X.
Liu, J. Wang, F. Zhang, H. Sun, M. Li, and C. Wang, “New design of probe and central-homo primer pairs to improve TaqManTM PCR
41 accuracy for HBV detection,” Journal of Virological Methods, vol. 42 254, pp. 25–30, 2018. 43 [29] I. S. Weisberg and I. M. Jacobson, “Primer on Hepatitis C Virus
Resistance to Direct-Acting Antiviral Treatment: A Practical Approach 44 for the Treating Physician,” Clinics in Liver Disease, vol. 21, no. 4, pp. 45 659–672, 2017. 46 [30] S. Zhang, D. Cheng, Z. Deng, M. Zong, and X. Deng, “A novel
kNN algorithm with data-driven k parameter computation,” Pattern 47 Recognition Letters, 2017. [Online]. Available: 10.1016/j.patrec.2017. 48 09.036 49 [31] S. Bing, H. Lixin, and Y. Hon, “Adaptive clustering algorithm based
on kNN and density,” Pattern Recognition Letters, vol. 104, pp. 37– 50 44, 2018. 51 [32] A. Rezai, P. Keshavarzi, and R. Mahdiye, “A novel MLP network imple- 52 mentation in CMOL technology,” International Journal of Engineering
Science and Technology, vol. 17, no. 3, pp. 165–172, 2014. [33] M. Ekström, P. Esseen, B. Westerlund, A. Grafström, B. Jonsson,
54 and G. Ståhl, “Logistic regression for clustered data from 55 environmental monitoring programs,” Ecological Informatics, vol. 43,
56 pp. 165–173, 2018. [34] M. Rausch and M. Zehetleitner, “Should metacognition be measured
57 by logistic regression?” Consciousness and Cognition, vol. 49, pp. 291– 58 312, 2017.
[35] G. Ertas, “Detection of high GS risk group prostate tumors by diffusion tensor imaging and logistic regression modelling,” Magnetic Resonance Imaging, vol. 50, pp. 125–133, 2018.
[36] J. Shang, M. Chen, H. Ji, D. Zhou, H. Zhang, and M. Li, “Dominant trend based logistic regression for fault diagnosis in nonstationary
processes,” Control Engineering Practice, vol. 66, pp. 156–168, 2017. [37] E. Park and J. Wang, “Active learning for penalized logistic regression via sequential experimental design,” Neurocomputing, vol. 222, pp. 183–
190, 2017. [38] R. Dahm, “Discovering DNA: Friedich Miescher and the early years of
mucleic acid research”, Human Genetics, vol. 122, no. 6, pp. 565-581. 2008. [online]. Available: 10.1007/s00439-007-0433-0
[39] World Gene Bank, “Isolated DNA Samples”, 2017, [online]. Available: https://www.ncbi.nlm/nih.gov/gene.
[40] S. Carson and D. Robertson, “Expression, Detection, and Purification of recombinant Proteins from Bacteria”, in Manipulation and Expression of Recombinant DNA. Academic. Academic Press, December 2005, pp. 130-168.
[41] J. Bukh, R.H. Purcell, and R. H. Miller, “Importance of primer selection for the detection of Hepatitis C Virus RNA with the polymerase chain reaction assay”, Proceedings of the National Academy of Science, vol. 89, no.1, pp.187-191, 1992. [online]. Available: 10.1073/pnas.89.1.187
[42] N. Anggorowati, Y. Yano, D. S. Heriyanto, H. T. Rinonce, T. Utsumi, D. P. Mulya, Y.W. Subronto, and Y. Hayashi, “ Clinical and Virogical Characteristic of Hepatitis B or C Virus Co-infection with HIV in Indonesian patients”, Journal of Medical Virology, vol. 84, no. 6, pp. 857-865, 2012. [online]. Available: 10.1002/jmv.23293
Berlian Al Kindhi, receive the Ph.D. degree in Electrical Engineering Department, with Bioinformatics subject, in Institut Teknologi Sepuluh Nopember, Indonesia. She is currently an assistant profesor and machine learning researcher in Institut Teknologi Sepuluh Nopember. Her current research interest include artificial intelligence, deep learning, image processing, and bioinformatics.
Tri Arief Sardjono, receive the Ph.D. degree in Biomedical Engineering Department, University Medical Center Groningen, University of Groningen, The Netherland. He is currently an associate profesor and biomedical engineering researcher in Institut Teknologi Sepuluh Nopember. His current research interest include biomedical products.
M. Amin, receive the master degree in Microbiology Department, Medical Faculty, University of Airlangga, Indonesia. He is currently a microbiology and DNA researcher in Laboratorium Hapititis, Tropical Disease Center, Indonesia. His current research interest include Hepatitis mutation and vaccine.
G. J. Verkerke, is professor in BioMedical Product Development at both the University Medical Center Groningen and University of Twente. He is technical-scientific director of SPRINT, a Dutch Center of Research Excellence, focusing on mobility of elderly. Research interests are biomechanics and the methodical design of medical products, including teamwork± how to work together with many different disciplines in an efficient and effective way. He is also Director of the Groningen Biomedical Engineering master’s curriculum.
M. H. Purnomo, is a profesor in Biomedical Engineering, Institut Teknologi Sepuluh Nopember, Indonesia. He is a member of national university assesment committee. His research interest are artificial intelligence, machine learning, biomedical engineering, and smart system.
M. A. Wiering, Preceive Ph.D. degree in the Istituto Dalle Molle di Studi sull'Intelligenza Artificiale (IDSIA) in the area of reinforcement learning. After his graduation in 1999, he did a short post-doc in the Intelligent Systems Group from the University of
34
59 60
Page 8 of 8 > REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8
1 2 Amsterdam. He became assistant professor in 2000 at Utrecht 3 University, and joined ALICE in Groningen in 2007 as tenure
4 tracker. 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
Lampiran 4. Draft Paten
1
Deskripsi
PERANGKAT ELEKTRONIK PINTAR (INTELLIGENT ELECTRONIC
DEVICE /IED) SEBAGAI PENGELOMPOKAN DAN ANALISIS DATA
5 DNA VIRUS HEPATITIS C
Bidang Teknik Invensi
Invensi ini berhubungan dengan perangkat elektronik
pintar yang dapat menghitung tingkat kemiripan pada susunan
10 nukleotida DNA dengan virus Hepatitis C primer, menganalisa
pola perubahan nukleotida virus Hepatitis C serta memberikan
rekomendasi arah trend mutasi virus Hepatitis C.
Latar Belakang Invensi
15 Virus Hepatitis C adalah jenis virus RNA. RNA
merupakan molecular dalam tubuh yang bertanggung jawab
sebagai pembawa pesan kode untuk pembentukan protein DNA
baru. Jika RNA terinfeksi virus, akibatnya DNA yang
terbentuk juga akan berubah. Sifat mutasi alami sel adalah
20 mempertahankan diri supaya tetap hidup dengan mengubah pola
kode DNA yang baru melalui pola RNA. Probabilitas perubahan
yang terjadi padad DNA juga dapat berubah-ubah tergantung
adaptasi alami atau pengobatan yang dilakukan saat itu.
Sebagai contoh satu virus atau penyakit dapat sudah resis-
25 ten terhadap antibiotik yang sama karena pola kode dari
DNA tersebut telah berubah, pada virus Hepatitis C,
peristiwa ini adalah pembentukan subtype genome baru.
Pola perkembang-biakan dan tingkat mutasi yang semakin
cepat pada virus Hepatitis C merupakan salah satu permasa-
30 lahan pada dunia kedokteran. Hingga saat ini belum ada obat
atau vaksin yang secara mutlak dapat menyembuhkan pasien
yang terinfeks. Meskipun penelitian baik dengan pendekatan
biologi kedokteran maupun bioinformatika yang menuju
penemuan vaksin tersebut sedang dilakukan. Namun hingga
saat ini, kemajuan penelitian pada sisi bioinformatika
2
masih dalam tahap analisa dan evaluasi mutasi DNA yang
terinfeksi HCV.
Pada satu file isolated DNA homo sapiens, terdapat se-
5 kitar 9000 hinga 15.000 urutan nukleotida. Pada pengolahan
machine learning, setidaknya dibutuhkan 500 file isolated
DNA sebagai data masukan. Sedang proses analisa atau
pengolahan data DNA tersebut dapat dilakukan untuk berbagai
tujuan. Jutaan urutan nukleotida menjadikan proses
10 analisa membutuhkan cukup waktu.
Perangkat ini mampu memberikan tiga keluaran analisa
yang berbeda dalam satu kali proses machine learning. Yaitu
nilai kedekatan suatu isolated DNA terhadap suatu primer,
analisa tren primer HCV, dan runutan infeksi HCV terhadap
15 isolated DNA. Mengingat jumlah data nukleotida DNA yang
mencapai jutaan, maka proses satu kali pembelajaran mesin
yang dapat menghasilkan tiga analisa sekaligus ini
merupakan salah satu terobosan baru dari sisi efisiensi dan
efektifitas waktu pada bidang analisis DNA.
20 Hasil dari pengolahan pembelajaran mesin ini juga
memiliki tingkat akurasi lebih baik jika dibandingkan
dengan tujuh metode pengelompokan yang sering digunakan
lainnya. Metode pengelompokan lain yang dibandingkan adalah
Decision Tree, Apriori, Support Vector Machine (SVM), Ex-
25 pectation-Maximitation (EM) Algorithm, k-Nearest Neighbor
(k-NN), Classification and Regression Trees CART), metode
Bayes, dan k-Means. Dari Sembilan metode pengelompokan dan
klasifikasi yang dibandingkan, terdapat tiga metode yang
memiliki nilai paling unggu dan salah satunya adanya metode
30 yang kami usulkan dalam paten ini. Ketiga metode tersebut
adalah metode SVM, EM Algorithm, dan metode yang kami
usulkan. Nilai akurasi metode yang kami usulkan
mencapai0,4% lebih presisi dari metode SVM dan 1,6% lebih
presisi dari EM Algorithm.
3
Ketika semua fitur dibuat secara online, spesifikasi
perangkat keras yang diperlukan juga harus diselaraskan
dengan persyaratan perangkat lunak. Dibutuhkan Server Data
5 yang dapat terus tumbuh seiring dengan meningkatnya
isolated DNA yang terdaftar ke dalam basis data, dan di
mana ketika basis data sedang melakukan pemeliharaan,
aplikasi akan tetap dapat menyediakan layanan selama
permintaan pengguna alih-alih mengakses basis data. Untuk
10 mengakses data DNA yang bisa mencapai jutaan, tentu saja
membutuhkan waktu lama dalam proses analisis. Proses
analisis dapat dilakukan dengan cepat jika ada beberapa PC
yang memproses perangkat lunak secara bersamaan, sehingga
hasil analisis dapat diketahui dengan cepat.
15
Ringkasan Invensi
Invensi ini berhubungan dengan perangkat yang
berupa basis data DNA dan perangkat server untuk pengolah
20 dan penganalisa data DNA. Dalam hal ini data DNA yang
disimpan ke dalam Bank DNA merupakan DNA homo sapiens normal
dan yang terinfeksi HCV, serta primer dari penyakit HCV
tersebut. Komponen-komponen dari perangkat ini berupa
desain basis data untuk penyimpan DNA, server untuk meneri-
25 ma permintaan pengolahaan data (OLAP) atau perubahan data
(OLTP). Serta satu unit perangkat untuk Graphic User
Interface (GUI). Ke-dua server ini akan saling dihubungkan
dengan melakukan tugas masing-masing sesuai dengan desain
yang telah diimplementasikan.
30 Pengolah data DNA dan bank DNA ini memiliki kecepatan
analisa yang lebih cepat jika dibandingkan dengan pengolah
data DNA yang tidak berbayar, dikarenakan pengolah data DNA
yang dibangun memiliki server yang berbeda sesuai tugasnya,
sehingga proses pengolahan data lebih ringan.
4
Uraian Singkat Gambar
Gambar 1 adalah desain sistem OLAP dan OLTP yang dibangun
untuk pengolahan dan penerimaan data.
Gambar 2 adalah hirarki server yang disebut dengan three
5 tier server dimana penyimpanan data aplikasi dan data
DNA dilakukan secara terpisah.
Gambar 3 adalah perangkat pararel computing yang terdiri
dari komputer master sebagai pembagi tugas dan
computer slave sebagai penerima tugas/proses
10
Uraian Lengkap Invensi
Invensi ini berhubungan perangkat elektronik pintar
yang melakukan normalisasi data isolated DNA dan disimpan
15 menjadi DNA bank data, selain itu perangkat ini mampu
memberikan analisa yang dapat membantu kebutuhan medis dalam
mempelajari pola kemiripan suatu sequence DNA terhadap
primer.
Perangkat elektronik pintar pintar ini terdiri dari dua
20 komputer inti sebagai master dan empat komputer pembantu
sebagai slave nodes dalam pararel computing.
Perangkat Elektronik Pintar yang diusulkan tidak hanya
mampu melakukan OLTP (Online Transaction Processing),
tetapi juga mampu melakukan OLAP (Online Analytical Pro-
25 cessing), sehingga ketika ada pengguna yang memasukkan data
baik data primer maupun isolated DNA, sistem dapat dengan
cepat melakukan pembaruan analisis transaksi dan
memperbarui data di OLAP. Transaksi dari OLTP tersebut
antara lain pemasukan, perubahan, penghapusan, maupun peng-
30 aturan trigger database oleh admin, yang memungkinkan
dilakukan oleh banyak pengguna dalam waktu yang bersamaan,
sehingga OLTP juga berperan mengatur session pada masing-
masing pengguna yang sedang bertransaksi. Pada sistem yang
diusulkan OLAP dan OLTP di lakukan pada server yang
5
berbeda, hal ini bertujuan untuk menghindari adanya
deadlock ketika sistem sedang melayani transaksi input-
output data, dan disisi lain user lain melakukan permintaan
analisa data yang cukup besar.
5 Model desain database client-server yang diusulkan
adalah Basis Data Tiga Tingkat. Database dan aplikasi
dimasukkan ke dalam server terpisah. Sehingga ketika basis
data berkembang pesat dan membutuhkan ruang baru, aplikasi
tidak akan terganggu karena berada di tempat yang terpisah.
10 Sebaliknya, ketika aplikasi membutuhkan perbaikan modul,
basis data tidak akan terganggu dan proses backup data
tetap dapat berjalan.
Perangkat komputasi parallel adalah kondisi di mana
server ada lebih dari satu PC dan bersama-sama melakukan
15 tugas analisis dengan membagi sistem kerja. Model komputasi
paralel mampu meredam kinerja pemrosesan data semantik dan
membuat perangkat keras dari server lebih tahan lama karena
tidak dipaksa untuk melakukan tugas yang berat. Model
komputasi paralel yang digunakan dalam penelitian ini
20 adalah Hadoop. Beberapa studi pendahulu menggunakan Hadoop
untuk mengelola data besar karena sistem hirarkis setiap
node dapat diatur sesuai kebutuhan. Seperti pada Gambar
3.10., PC utama yang bertugas menerima perintah dan
mengirim hasil pengerjaan yang disebut Master Node dan PC
25 lain yang membantu Master Node dalam melakukan tugas
disebut Slave Node. Pada Hadoop, karena data semakin besar,
Node Slave dapat ditambahkan lagi tanpa harus mematikan
atau membangun kembali sistem yang telah berjalan.
Dalam dunia industri, perangkat pintar ini dapat
30 dimanfaatkan sebagai dasar dalam penentuan mutasi. Selain
itu, perangkat ini juga dapat digunakan menjadi bahan
pertimbangan dalam pembuatan vaksin pada industri obat-
obatan.
6
Klaim
1. Perangkat elektronik pintar analisis DNA yang
diusulkan terdiri dari perangkat server dan aplikasi
5 cerdas
2. Perangkat server seperti pada klaim 1 dari tiga
server yaitu server Online Analytical Processing
(OLAP), server Online Transaction Processing (OLTP),
dan server Data Warehouse (DW).
10 3. Membagi tugas server menggunakan sistem Hadoop, dimana
satu computer bertugas sebagai Master dan empat
komputer lainnya sebagai komputer yang menerima
pekerjaan dari master node atau disebut juga dengan
slaves node.
15 4. Proses normalisasi data set DNA yang diperoleh dari
Gen Bank Dunia dilakukan dengan metode penghitungan
jarak perbedaan antara nukleotida pada isolated DNA
dengan menambahkan satu jarak untuk setiap aksi yang
dilakukan
20 5. Perangkat Elektronik Pintar melakukan proses analisa
DNA dengan metode hybrid yaitu dengan menggabungkan
tiga kelebihan metode pengelompokan dalam satu proses.
6. Perangkat Elektronik Pintar sebagai analisa DNA mampu
memberikan tiga hasil analisa sekaligus dalam suatu
25 proses pengolahan, yaitu pertama yang diperoleh ada-
lah analisa tren mutasi dengan melihat jumlah sequence
node yang tergabung dalam centroid (primer). Hasil ke-
dua yang diperoleh adalah analisa primer mampu
mengenali adanya virus/bakteri pada hampir semua data
30 isolated DNA yang diujikan. Hasil ke-tiga adalah
menganalisa analisa runutan keterkaitan suatu primer
dengan isolated DNA yang berasal dari berbagai negara
di dunia.
7
Abstrak
PERANGKAT ELEKTRONIK PINTAR (INTELLIGENT ELECTRONIC
DEVICE /IED) SEBAGAI PENGELOMPOKAN DAN ANALISIS DATA
DNA HEPATITIS C VIRUS (HCV)
5 Invensi ini berhubungan dengan teknologi cerdas yang
terdiri dari perangkat server dan aplikasi pembelajaran
mesin. Perangkat ini berfungsi untuk melakukan analisis DNA
yang mampu menghasilkan analisa yang dibutuhkan pada jutaan
nukleotida DNA.
10 Peralatan server terdiri dari server untuk aplikasi
dan database. Perangkat Database dibagi menjadi database
untuk aplikasi dan penyimpanan data, sedangkan untuk
pengolahannya menggunakan pengolahan data warehouse. Untuk
15 aplikasi di bagi menjadi pernormaliasi data masukan dan
pengolahan data untuk mendapatkan analisa.
Data yang di inputkan ke dalam aplikasi adalah data
primer penyakit dan data isolated DNA. Inputan data
tersebut di simpan ke dalam database sebagai bank data.
20 Kemudian sistem akan melakukan normalisasi terhadap seluruh
data yang dimasukkan. Seluruh data hasil normalisasi akan
menjadi masukan dalam proses mesin pembelajaran dan
menghasilkan tiga analisa yang berbeda yaitu tren mutasi,
kecenderungan primer, dan runutan keterkaitan primer dengan
25 isolated DNA.
8
OLTP Applilcation
OLAP Application
Data Warehouse (DW)
Server OLAP
Server DW
User Presentation Layer
Application LayerDatabase Layer
Gambar 1.
User 1
User 2
User n..
ApplicationServer
Database Server
User Application Layer
Application Layer Database Layer
First Tier Second Tier Third Tier
Gambar 2
9
Task Tracker
Job Tracker
Name Node
Data Node
Task Tracker 1 Task Tracker 2 Task Tracker N..
Data Node 1 Data Node 2 Data Node N..
HDFSLayer
Map ReduceLayer
Master NodeSlave Node 1 Slave Node 2 Slave Node N..
Gambar 3