FAKULTAS TEKNIK UNIVERSITAS BENGKULUdiyahpuspitaningrum.net/2016a_Java_and_WEKA_Programming.pdf ·...

ModulPemrogramanJavadenganWEKA Page1

MATERI KULIAH UMUM INFORMATIKA SERI DATA MINING:

PEMROGRAMAN JAVA DAN WEKA

FAKULTAS TEKNIK

UNIVERSITAS BENGKULU DESEMBER 2016

PENYAJI:

DR. DIYAH PUSPITANINGRUM, S.T., M.KOM.

KURNIA ANGGRIANI, S.T., M.KOM.


DAFTAR ISI

1. Pemrograman Java

2. GUI Java dengan Windows Builder

3. Pengantar WEKA (GUI WEKA)

4. Pemrograman Java dengan WEKA


1 P E M R O G R A M A N J A V A Bahasa pemrograman Java adalah bahasa pemrograman berorientasi objek yang mirip dengan

bahasa C++ dan Smalltalk. Java bersifat netral, tidak bergantung pada suatu platform, dan mengikuti prinsip WORA (Write Once and Run Anywhere).

Software yang kita gunakan untuk pemrograman kali ini adalah Eclipse IDE dan Java SE 8 U60. Eclipse merupakan komunitas open source yang bertujuan menghasilkan platform pemrograman terbuka. Eclipse terdiri dari framework yang dapat dikembangkan lebih lanjut, peralatan bantu untuk membuat dan memanage software sejak awal hingga diluncurkan. Platform Eclipse didukung oleh ekosistem besar yang terdiri dari vendor tekonologi, start-up inovatif, universitas, riset institusi serta individu.

Langkah-langkahnya yaitu : 1. Run aplikasi “Java SE 8 U60” ikuti langkah-langkahnya hingga finish. 2. Setelah itu jalankan Eclipse dengan cara cari file “eclipse.exe” pada folder eclipse à Run as

administrator (Create Shortcut di desktop dahulu supaya lebih cepat kalo mau menjalankan lagi).

Gambar 1. Cara membuka eclipse dari folder eclipse

Gambar 2. Cara membuka eclipse lewat shortcut


Gambar 3. Cara membuat shortcut

Gambar 4. Menunggu eclipse terbuka

Gambar 5. Menentukan tempat penyimpanan project Setelah eclipse terbuka langkah selanjutnya adalah memilih tempat folder penyimpanan project kita. Setelah itu klik OK


Gambar 6. Membuat project baru Selanjutnya kita masuk dibagian menu eclipse. Karena kita akan membuat project baru maka kita pilih file‐new‐Project..(Isi nama project dan sebagainya)

Gambar 7. Membuat package Setelah membuat folder project kita membuat package dengan cara klik pada folder “src” pada folder “project” klik kanan new‐Package..(Isikan nama package dan sebagainya)


Gambar 8. Membuat class

Setelah membuat Package kita membuat class dengan cara klik kanan folder package New‐Class. Class ini yang nantinya akan berisi program kita.

Gambar 9. Membuat object

Membuat/mengisi nama Class/object dan memilih jenis methodnya. Kali ini saya memberi nama Class “objeklaptop” karena kali ini laptop sebagai object yang didalamnya ada beberapa atribut. Class ini yang akan mendeklarasikan atribut‐atribut pada laptop yang akan kita inputkan. script package laptop; //menunjukkan bahwa class ini berada di package "laptop"


public class objeklaptop { //nama class otomatis akan tertulis disini. Nama class ini yang akan dipanggil di class lain jika diubutuhkan

String merk; //deklarasi untuk variabel atau atribut laptop yaitu "merk"

dengan jenis data String int ram; //deklarasi untuk variabel atau atribut laptop yaitu "ram" dengan

jenis data int String procesor; //deklarasi untuk variabel atau atribut laptop yaitu

"procesor" dengan jenis data int int harga; //deklarasi untuk variabel atau atribut laptop yaitu "harga"

dengan jenis data int void infolaptop(){ //untuk menampilkan inputan data dari

class"inputlaptop" pada void "infolaptop" System.out.println("Merk : "+merk); //untuk menampilkan inputan data

dari class"inputlaptop" yaitu dari variabel "merk"

System.out.println("RAM : "+ram); //untuk menampilkan inputan data dari class"inputlaptop" yaitu dari variabel "ram"

System.out.println("Processor: "+procesor); //untuk menampilkan inputan data dari class"inputlaptop" yaitu dari variabel "procesor"

System.out.println("Harga : "+harga); //untuk menampilkan inputan data dari class"inputlaptop" yaitu dari variabel "harga"

} }

Gambar 10. Membuat Class input

Kita buat Class lagi kali ini adalah class “inputlaptop”. Di Class ini laptop masih menjadi object, class ini sebagai input atribut‐atribut laptop yang akan masuk class “objeklaptop”. script package laptop; //menunjukkan bahwa class ini berada di package "laptop" import java.util.Scanner;


public class inputlaptop { //nama class otomatis akan tertulis disini. Nama class ini yang akan dipanggil di class lain jika diubutuhkan

public static void main(String[] args) { // TODO Auto-generated method stub objeklaptop jenis=new objeklaptop(); //Class "objeklaptop" diganti dengan

"jenis" jenis.merk="ASUS"; //untuk inputan Merk yang akan masuk ke class

"objeklaptop" variabel"merk" jenis.harga=5000000; //untuk inputan Harga yang akan masuk ke class

"objeklaptop" variabel"harga" jenis.procesor="intel"; //untuk inputan Processor yang akan masuk ke

class "objeklaptop" variabel"procesor" jenis.ram=2; //untuk inputan RAM yang akan masuk ke class "objeklaptop"

variabel"ram" jenis.infolaptop(); //void ini akan masuk di void "infolaptop" Scanner masuk =new Scanner(System.in); //untuk membuat inputan setelah

program di Run System.out.println(" "); System.out.println("Processor: "); //nama inputan jenis.procesor=masuk.nextLine(); //inputan akan masuk ke variabel

"procesor" System.out.println("Merk : "); //nama inputan jenis.merk=masuk.nextLine(); //inputan akan masuk ke variabel "merk" System.out.println("RAM : "); //nama inputan jenis.ram=masuk.nextInt(); //inputan akan masuk ke variabel "ram" System.out.println("Harga : "); //nama inputan jenis.harga=masuk.nextInt(); //inputan akan masuk ke variabel "procesor" jenis.infolaptop(); //void ini akan masuk di void "infolaptop" Scanner baru =new Scanner(System.in); //untuk membuat inputan setelah

program di Run System.out.println(" "); System.out.println("Processor: "); //nama inputan jenis.procesor=baru.nextLine(); //inputan akan masuk ke variabel

"procesor" System.out.println("Merk : "); //nama inputan jenis.merk=baru.nextLine(); //inputan akan masuk ke variabel "merk" System.out.println("RAM : "); //nama inputan jenis.ram=baru.nextInt(); //inputan akan masuk ke variabel "ram" System.out.println("Harga : "); //nama inputan jenis.harga=baru.nextInt(); //inputan akan masuk ke variabel "harga" jenis.infolaptop(); //void ini akan masuk di void "infolaptop" } }


Gambar 11. Hasil Run Apabila program tadi dijalankan maka hasilnya seperti diatas. Merk,RAM,dll akan tampil sesuai yang kita tulis pada class“inputlaptop” dan inputan warna biru merupakan inputan langsung setelah program kita jalankan. Dan hasilnya ada di bawahnya. Sumber: Saras Noya, “Program Berorientasi Object berbasis java dengan software eclipse”, UNNES


PROGRAM “HELLO WORLD”

Program yang lebih sederhana (“Hello World”) sebagai berikut:

Baik, pertama yang harus kita lakukan adalah menjalankan program Eclipse IDE.

Kemudian kita akan masuk ke dalam halaman pembuka Eclipse yang kira‐kira akan mempunyai tampilan seperti ini.

Setelah masuk ke dalam halaman pembuka Eclipse, kita siap untuk membuat project baru. Berikut adalah langkah‐langkah untuk membuat program baru dengan menggunakan Eclipse :

1. Klik ikon workbench pada Eclipse.


2. Setelah masuk ke workbench, Eclipse akan mempunyai tampilan kira‐kira seperti berikut.

3. Selanjutnya kita akan membuat sebuah project baru dengan cara klik menubar File > New > Project…

4. Pada wizard silahkan pilih Java Project lalu klik Next >


5. Berikutnya akan muncul dialog untuk memberikan pengaturan nama project , lokasi yang digunakan, dan versi JRE yang akan kita gunakan. Lalu klik Next >

6. Berikutnya pada dialog Java Setting klik Finish.


7. Kemudian jika muncul dialog seperti ini, pilih Yes *untuk mengatur text editor Eclipse menjadi perspektif editor untuk bahasa Java

8. Setelah project dibuat maka di sebelah kiri akan muncul daftar project yang kita buat barusan.


9. Berikutnya kita akan membuat package / susunan folder untuk memudahkan kita dalam mengelompokan file .java yang akan kita buat nanti. Uraikan project > Klik kanan pada folder ‘src’ > Klik New > Klik Package.

10. Buat nama package yang akan kita jadikan sebagai susunan folder, lalu klik finish. Contoh : com.helloworld.main artinya kita akan membuat folder dengan susunan com/helloworld/main yang nantinya akan menjadi tempat penyimpanan file .java

11. Setelah membuat package , kita akan membuat class atau file .java yang nantinya akan berisi kode dalam bahasa Java untuk di jalankan menjadi sebuah program. Caranya klik kanan pada package yang telah kita buat > Klik New > Pilih Class.


12. Kemudian akan muncul dialog untuk memberikan nama file yang akan dibuat. Jangan lupa untuk ‘ceklis’ pilihan ‘public static void main(String[] args)’. Agar file tersebut di eksekusi pada saat kita menjalankan project. Lalu klik Finish.

13. Kemudian tampilan akan menjadi seperti ini, tandanya bahwa file .java telah kita buat dan siap untuk diberikan kode menggunakan bahasa Java.


14. Ketikkan syntax atau kode berikut untuk menampilkan output berupa tulisan ‘Hello Wolrd !’ . Dan tekan ctrl + s untuk menyimpan perubahan file.

15. Setelah disimpan, maka program siap untuk dijalankan dengan cara klik kanan pada project > Klik Run As > Pilih Java Application.

Perhatikan pada tampilan Console akan menampilkan kata ‘Hello World !’ dan program berhasil dijalankan,

Nah ! Seperti itulah caranya membuat program JAVA menggunakan Eclipse. Sangat simple.


2 P E M R O G R A M A N G U I J A V A D E N G A N W I N D O W S

B U I L D E R

Membuat form dengan Java, Eclipse, dan Window Builder

Kali ini kita akan membuat program java dengan Eclipse (saya menggunakan Eclipse Juno).

Bagi yang belum tahu Eclipse, Eclipse adalah salah satu IDE atau software untuk mengetik,

mengkompile dan mengetes program java (sebenarnya eclipse juga mendukung bahasa lain seperti

C++ dan PHP).Sama seperti Netbeans (merek IDE lain), Eclipse juga gratis, tapi Eclipse lebih cepat

dalam menjalankan program(menurut pendapat pribadi).Eclipse Juno juga sudah terdapat Window

Builder Pro, yaitu suatu plugin untuk mendesain from Gui dengan cara drag and drop seperti

Netbeans. Kelebihan dari Window Builder pro yaitu kode yang dihasilkan dari drag and drop tadi

langsung diterjemahkan kedalam bahasa java dan tidak ada kode yang disembunyikan (dalam

netbeans tidak semua kode desain form ditampilkan dalam jendela kode).

Untuk membuat form ikuti langkah berikut ini.

1. Buka aplikasi Eclipse

2. Buat project baru dengan cara klik File >> new >> Java Project


3. Isi nama project misal >> test lalu klik finish

4. Klik kanan pada poject baru tadi >> new >> other . Klik kanan pada poject baru tadi >> new

>> other. Klik kanan pada poject baru tadi >> new >> other


5. Klik segitiga kecil pada Window Builder >> Swing Designer >> JFrame >> next >> isi nama

Package dan nama form Jframe dengan nama yang diinginkan. >> klik finish.


6. Form baru telah dibuat dan sebagai default ditampilkan jendela source.

7. Untuk masuk ke jendela desain klik tombol design


8. Selanjutnya kita akan mengganti ikon pada frame , sebelumnya buat folder baru untuk

menampung gambar ikon.

Klik kanan folder src >> new >> folder >>pastikan folder src terpilih lalu isi nama folder yang

diinginkan >>klik finish

9. Buka windows explorer lalu klik kanan pada gambar yang akan dijadikan ikon lalu pilih copy .

10. Buka kembali eclipse, pilih folder baru tadi, klik kanan paste.


11. Masuk ke jendela Design pada Eclipse, klik bagian judul (daerah antara gambar ikon dan tombol

close) frame pada kanvas

>> pilih jendela properties, >> iconImage dan klik tombol ... , >> klik radioButton Classpath resource bla.. bla.. , >> pilih file gambar , >> ok.

12. Ganti judul frame dengan cara, pada jendela properties pilih title dan ganti dengan nama yang

diinginkan.

13. Untuk menjalankan klik menu run >> run atau tombol Ctrl + F11.


TutorialMembuatKalkulatordenganEclipse

Kalkulator adalah sebuah alat hitung yang digunakan manusia untuk membantu proses perhitungan

dalam kegiatan sehari‐hari. Kalkulator yang sering kita gunakan adalah kalkulator hard were yang

didalamnya terdapat proses apa ketika kita gunakan kita tidak tahu,,yang kita tahu hanya keluar

hasilnya dengan cepat lebih cepat dari perhitungan kita secara manual. Kali ini saya akan

menjelaskan bagaimana cara membuat kalkulator dengan eclipse dengan bahasa Java, dan juga

proses apa yag ada didalamnya.

1. Pertama kita buka eclipse, sebelumnya anda install dulu java jdk‐Nya.

2. Buat project java file ‐ New ‐ Project misal kita kasih nama "Kalkulator"

3. Buat Package untuk menaruh meneu utama (main.java) misal kita buat package dengan nama

calculator file ‐ New ‐ Package


3. Buat class utama dalam package calculator misal Main.java

file ‐ new ‐ class


4. Kita isikan script ini di view.java

package calculator;

import view.CalculatorView;

public class Main {

public static void main(String[] args) {

new CalculatorView().setVisible(true);

}

}


5. Buat package view untuk menaruh class CalculatorView.java . Caranya file ‐ new ‐ package seperti

membuat package diatas.

6. Buat class utama dalam package view misal CalculatorView.java

file ‐ new ‐ class seperti membuat class diatas.

7. Kita isikan script ini di CalculatorView.java

package view;

import java.text.DecimalFormat;

import model.CalculatorModel;

public class CalculatorView extends javax.swing.JFrame {

private static final long serialVersionUID = 1L;

public CalculatorView() {

initComponents();

}

CalculatorModel model = new CalculatorModel();

String operand="";

public void getOperand(javax.swing.JButton button){

operand+=button.getText();

model.setOperand(operand);

resultLabel.setText(operand);

}

private void getOperator(int opt){

model.setOperator(opt);

operand="";

}

private void process(){

DecimalFormat df = new DecimalFormat("#,###.########");

model.process();

operand = "";

resultLabel.setText(df.format(model.getResult())+"");

}

private void initComponents() {

jPanel1 = new javax.swing.JPanel();

resultLabel = new javax.swing.JLabel();



button7 = new javax.swing.JButton();



buttonKoma = new javax.swing.JButton();










buttonBagi = new javax.swing.JButton();

buttonKali = new javax.swing.JButton();

buttonKurang = new javax.swing.JButton();

buttonTambah = new javax.swing.JButton();

buttonAC = new javax.swing.JButton();

buttonModulus = new javax.swing.JButton();

buttonSeper = new javax.swing.JButton();

buttonSamaDengan = new javax.swing.JButton();

setDefaultCloseOperation(javax.swing.WindowConstants.EXIT_ON_CLOSE);

setTitle("Kalkulator Java with Eclipse");

jPanel1.setBackground(new java.awt.Color(150, 250, 200));

resultLabel.setBackground(new java.awt.Color(255, 255, 200));

resultLabel.setFont(new java.awt.Font("Microsoft Sans Serif", 0, 36));

resultLabel.setHorizontalAlignment(javax.swing.SwingConstants.RIGHT);

resultLabel.setText("0");

javax.swing.GroupLayout jPanel1Layout = new javax.swing.GroupLayout(jPanel1);

jPanel1.setLayout(jPanel1Layout);

jPanel1Layout.setHorizontalGroup(

jPanel1Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)

.addGroup(jPanel1Layout.createSequentialGroup()

.addContainerGap()

.addComponent(resultLabel, javax.swing.GroupLayout.DEFAULT_SIZE, 268,

Short.MAX_VALUE)

.addContainerGap())

);


jPanel1Layout.setVerticalGroup(



.addContainerGap()

.addComponent(resultLabel, javax.swing.GroupLayout.DEFAULT_SIZE,

javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)

.addContainerGap())

);


button7.setText("7");

button7.addActionListener(new java.awt.event.ActionListener() {

public void actionPerformed(java.awt.event.ActionEvent evt) {

button7ActionPerformed(evt);

}

});





}

});





}

});

buttonKoma.setText(".");

buttonKoma.addActionListener(new java.awt.event.ActionListener() {


buttonKomaActionPerformed(evt);

}

});





}

});


button12.setText("C");




}

});





}

});





}

});





}

});





}

});





}

});






}

});






.addContainerGap()

.addGroup(jPanel2Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)


.addComponent(button7, javax.swing.GroupLayout.PREFERRED_SIZE, 42,

javax.swing.GroupLayout.PREFERRED_SIZE)

.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.RELATED)





javax.swing.GroupLayout.PREFERRED_SIZE))




















.addComponent(buttonKoma, javax.swing.GroupLayout.PREFERRED_SIZE, 42,








javax.swing.GroupLayout.PREFERRED_SIZE)))

.addContainerGap(javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE))

);




.addContainerGap()

.addGroup(jPanel2Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.BASELINE)

























.addComponent(buttonKoma, javax.swing.GroupLayout.PREFERRED_SIZE, 35,








);


buttonBagi.setText("/");

buttonBagi.addActionListener(new java.awt.event.ActionListener() {


buttonBagiActionPerformed(evt);

}

});

buttonKali.setText("*");

buttonKali.addActionListener(new java.awt.event.ActionListener() {


buttonKaliActionPerformed(evt);

}

});

buttonKurang.setText("‐");

buttonKurang.addActionListener(new java.awt.event.ActionListener() {


buttonKurangActionPerformed(evt);

}

});

buttonTambah.setText("+");

buttonTambah.addActionListener(new java.awt.event.ActionListener() {


buttonTambahActionPerformed(evt);

}

});

buttonAC.setText("AC");

buttonAC.addActionListener(new java.awt.event.ActionListener() {


buttonACActionPerformed(evt);

}

});

buttonModulus.setText("%");

buttonModulus.addActionListener(new java.awt.event.ActionListener() {



buttonModulusActionPerformed(evt);

}

});

buttonSeper.setText("1/x");

buttonSeper.addActionListener(new java.awt.event.ActionListener() {


buttonSeperActionPerformed(evt);

}

});

buttonSamaDengan.setText("=");

buttonSamaDengan.addActionListener(new java.awt.event.ActionListener() {


buttonSamaDenganActionPerformed(evt);

}

});






.addGroup(jPanel3Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)


.addComponent(buttonKurang, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGap(18, 18, 18)

.addComponent(buttonSeper, javax.swing.GroupLayout.PREFERRED_SIZE, 45,



.addComponent(buttonTambah, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGap(18, 18, 18)

.addComponent(buttonSamaDengan, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGroup(jPanel3Layout.createParallelGroup(javax.swing.GroupLayout.Alignment.TRAILING, false)

.addGroup(javax.swing.GroupLayout.Alignment.LEADING,

jPanel3Layout.createSequentialGroup()

.addComponent(buttonKali, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGap(18, 18, 18)


.addComponent(buttonModulus, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGroup(javax.swing.GroupLayout.Alignment.LEADING,

jPanel3Layout.createSequentialGroup()

.addComponent(buttonBagi, javax.swing.GroupLayout.PREFERRED_SIZE, 45,


.addGap(18, 18, 18)

.addComponent(buttonAC, javax.swing.GroupLayout.PREFERRED_SIZE, 45,

javax.swing.GroupLayout.PREFERRED_SIZE))))

.addContainerGap())

);




.addContainerGap()


.addComponent(buttonBagi, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addComponent(buttonAC, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addPreferredGap(javax.swing.LayoutStyle.ComponentPlacement.UNRELATED)


.addComponent(buttonKali, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addComponent(buttonModulus, javax.swing.GroupLayout.PREFERRED_SIZE, 33,




.addComponent(buttonKurang, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addComponent(buttonSeper, javax.swing.GroupLayout.PREFERRED_SIZE, 33,




.addComponent(buttonTambah, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addComponent(buttonSamaDengan, javax.swing.GroupLayout.PREFERRED_SIZE, 33,


.addContainerGap(14, Short.MAX_VALUE))

);


javax.swing.GroupLayout layout = new javax.swing.GroupLayout(getContentPane());

getContentPane().setLayout(layout);

layout.setHorizontalGroup(

layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)

.addGroup(layout.createSequentialGroup()

.addContainerGap()

.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)

.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE,

javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)



javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)



javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE)))


);

layout.setVerticalGroup(

layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING)


.addContainerGap()

.addComponent(jPanel1, javax.swing.GroupLayout.PREFERRED_SIZE, 60,



.addGroup(layout.createParallelGroup(javax.swing.GroupLayout.Alignment.LEADING, false)

.addComponent(jPanel3, javax.swing.GroupLayout.DEFAULT_SIZE,

javax.swing.GroupLayout.DEFAULT_SIZE, Short.MAX_VALUE)


javax.swing.GroupLayout.DEFAULT_SIZE, javax.swing.GroupLayout.PREFERRED_SIZE))


);

pack();

}

private void button1ActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_button1ActionPerformed

getOperand(button1);

}




}





}




}




}




}




}




}




}

private void buttonTambahActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonTambahActionPerformed

getOperator(1);

}

private void buttonKurangActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonKurangActionPerformed

getOperator(2);

}


private void buttonKaliActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonKaliActionPerformed

getOperator(3);

}

private void buttonBagiActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonBagiActionPerformed

getOperator(4);

}

private void buttonModulusActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonModulusActionPerformed

getOperator(5);

}

private void buttonSeperActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonSeperActionPerformed

getOperator(6);

}

private void buttonSamaDenganActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonSamaDenganActionPerformed

process();

}




}

private void buttonKomaActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonKomaActionPerformed

getOperand(buttonKoma);

}



if(operand.length()>1){

operand = operand.substring(0, operand.length()‐1);


resultLabel.setText(operand);

}else{

operand = "";




}

}

private void buttonACActionPerformed(java.awt.event.ActionEvent evt) {//GEN‐

FIRST:event_buttonACActionPerformed

operand = "";

model.setOperator(0);

model.setResult(0);


}

private javax.swing.JButton button1;











private javax.swing.JButton buttonAC;

private javax.swing.JButton buttonBagi;

private javax.swing.JButton buttonKali;

private javax.swing.JButton buttonKoma;

private javax.swing.JButton buttonKurang;

private javax.swing.JButton buttonModulus;

private javax.swing.JButton buttonSamaDengan;

private javax.swing.JButton buttonSeper;

private javax.swing.JButton buttonTambah;

private javax.swing.JPanel jPanel1;



private javax.swing.JLabel resultLabel;

}

8. Buat package model untuk menaruh class model caranya seperti cara membuat package diatas file

‐ new ‐ package.

9. Buat class CalculatorModel.java di package model seperti membuat class diatas

file ‐ new ‐ class

10. Isikan script ini di class CalculatorModel.java


package model; public class CalculatorModel { int operator=0; double operand1=0; double operand2=0; double result=0; public void setOperand(String opr) { if(!opr.equals("")){ if(operator==0){ operand1=Double.valueOf(opr); }else{ operand2=Double.valueOf(opr); } } } public void setOperator(int operator) { this.operator = operator; } public double getResult() { return result; } public void setResult(double hasil) { this.result = hasil; } public void process(){ switch (operator){ case 1: result = operand1 + operand2; break; case 2: result = operand1 ‐ operand2; break; case 3: result = operand1 * operand2; break; case 4: result = operand1 / operand2; break; case 5: result = operand1 % operand2; break; case 6: result = 1/operand1; break;


default: result = operand1; } operand1=result; } public void setOperand1(double operand1) { this.operand1 = operand1; } public void setOperand2(double operand2) { this.operand2 = operand2; } public void clear(){ setOperand1(0); setOperand2(0); setResult(0); setOperator(0); } } 10. Run program dengan cara klik tombol segitiga hijau di bagian atas

dan hasilnya seperti berikut.


ContohProjectJavaBerbasisGUI

GUI interface

GUI interface Java

Coding Program

import javax.swing.*; //untuk memanggil package import java.awt.*; import java.awt.event.*; public class Form extends JFrame { // pewarisan dari JFrame untuk digunakan di

class Form public Form() { // PEMBUATAN OBJEK JPanel panel1 = new JPanel(); // pembuatan panel 1 JPanel panel2 = new JPanel(); // pembuatan panel 1 Container con = this.getContentPane(); // pembuatan container // Membuat objek option button baru final JRadioButton rbAnggota1 = new JRadioButton(“Anggota Satu”); final JRadioButton rbAnggota2 = new JRadioButton(“Anggota Dua”); final JRadioButton rbAnggota3 = new JRadioButton(“Anggota Tiga”); final JRadioButton rbAnggota4 = new JRadioButton(“Anggota Empat”); //Membuat button group ButtonGroup radioBgroup = new ButtonGroup();


// Membuat label JLabel lblNIM=new JLabel(“NIM “); final JLabel lblNama =new JLabel(“Nama “); final JLabel lblJK =new JLabel(“Jenis Kelamin “); // Membuat text field final JTextField txtNIM=new JTextField(5); final JTextField txtNama=new JTextField(5); final JTextField txtJK=new JTextField(5); final JButton cmdTampil=new JButton(“Tampil”); final JButton cmdKosong=new JButton(“Kosongkan”); final JButton cmdExit=new JButton(“Keluar”); // konfigurasi layout con.setLayout(new GridLayout(1,2)); // panel1.setLayout(new GridLayout(3,3,2,5)); panel2.setLayout(new GridLayout(6,3,2,5)); panel1.setBorder(BorderFactory.createTitledBorder(“Anggota”)); panel2.setBorder(BorderFactory.createTitledBorder(“Data”)); // menambahkan panel ke window con.add(panel1); con.add(panel2); // mendaftarkan radio button sebagai group radioBgroup.add(rbAnggota1); radioBgroup.add(rbAnggota2); radioBgroup.add(rbAnggota3); radioBgroup.add(rbAnggota4); // menambahkan objek ke dalam panel 1 panel1.add(rbAnggota1); panel1.add(rbAnggota2); panel1.add(rbAnggota3); panel1.add(rbAnggota4); // menambahkan objek ke dalam panel 2 panel2.add(lblNIM); panel2.add(txtNIM); panel2.add(lblNama); panel2.add(txtNama); panel2.add(lblJK); panel2.add(txtJK); panel2.add(cmdTampil); panel2.add(cmdKosong); panel2.add(cmdExit); // memberikan even handling kepada command button cmdTampil.addActionListener(new ActionListener( ) {

public void actionPerformed(ActionEvent ae) { if (rbAnggota1.isSelected()){


txtNIM.setText(“07.11.1382?); txtNama.setText(“Syarief Hidayatulloh”); txtJK.setText(“laki – laki”); } if (rbAnggota2.isSelected()){ txtNIM.setText(“07.11.1356?); txtNama.setText(“Arif W Nugroho”); txtJK.setText(“laki – laki”); } if (rbAnggota3.isSelected()){ txtNIM.setText(“07.11.1420?); txtNama.setText(“Galuh Ristyanto”); txtJK.setText(“laki – laki”); }

if (rbAnggota4.isSelected()){ txtNIM.setText(“07.11.1385?); txtNama.setText(“Yuni Ardita Sari Dewi “); txtJK.setText(“Perempuan”); } } });

cmdKosong.addActionListener(new ActionListener( ) {

public void actionPerformed(ActionEvent ae) {

txtNIM.setText(” “); txtNama.setText(” “); txtJK.setText(” “);

} }); cmdExit.addActionListener(new ActionListener( ) {

public void actionPerformed(ActionEvent ae) {

System.exit(1); }

});

// menampilkan window this.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE); this.setLocation(40,120); this.setSize(520,230); this.setVisible(true);

}

public static void main(String[] args) {

new Form(); }

} source: http://mtsox.wordpress.com/2008/10/22/contoh-java-gui/


3 P E N G A N T A R W E K A ( G U I W E K A )

http://www.cs.waikato.ac.nz/ml/weka/

Fundamentals of Data Mining

Definition of Data Mining:

Data mining refers to extracting or mining knowledge from large amounts of data. Data

mining can also be referred as knowledge mining from data, knowledge extraction, data

archeology and data dredging. Applications of Data Mining:

Business Intelligence applications

Insurance

Banking

Medicine

Retail/Marketing etc.

Functionalities of Data Mining:

These functionalities are used to specify the kind of patterns to be found in data mining

tasks. Data mining tasks can be classified into 2 categories:

Descriptive

Predictive

The following are the functionalities of data mining:

Concept/Class description: Characterization and Discrimination:

Generalize , summarize and contrast data characteristics.

Mining frequent patterns , Associations and Correlations

Frequent patterns are patterns (such as item sets, subsequences, or substructures) that

appear in a data set frequently.

Classification and Prediction:

Construct models that describe and distinguish classes or concepts for future

prediction. Predicts some unknown or missing numerical values.

Cluster analysis:

Class label is unknown. Group data to form new classes. Maximizing intra‐class

similarity and minimizing inter‐class similarity.

Outlier analysis:

Outlier: a data object that does not comply with the general behavior of data.

Noise or exception but is quite useful in fraud detection, rare events analysis.


Introduction to WEKA

WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine

learning software written in Java, developed at the University of Waikato, New Zealand.

WEKA is an open source application that is freely available under the GNU general

public license agreement. Originally written in C, the WEKA application has been

completely rewritten in Java and is compatible with almost every computing platform. It

is user friendly with a graphical interface that allows for quick set up and operation.

WEKA operates on the predication that the user data is available as a flat file or

relation. This means that each data object is described by a fixed number of attributes

that usually are of a specific type, normal alpha‐numeric or numeric values. The WEKA

application allows novice users a tool to identify hidden information from database and

file systems with simple to use options and visual interfaces.

The WEKAworkbench contains a collection of visualization tools and algorithms for data

analysis and predictive modeling, together with graphical user interfaces for easy access

to this functionality.

This original version was primarily designed as a tool for analyzing data from agricultural

domains, but the more recent fully Java‐based version (WEKA 3), for which

development started in 1997, is now used in many different application areas, in

particular for educational purposes and research.

ADVANTAGES OF WEKA

The obvious advantage of a package like WEKA is that a whole range of data

preparation, feature selection and data mining algorithms are integrated. This means that

only one data format is needed, and trying out and comparing different approaches

becomes really easy. The package also comes with a GUI, which should make it easier to

use. Portability, since it is fully implemented in the Java programming language and thus

runs on almost any modern computing platform. A comprehensive collection of data

preprocessing and modeling techniques. Ease of use due to its graphical user interfaces.

WEKA supports several standard data mining tasks, more specifically, data preprocessing,

clustering, classification, regression, visualization, and feature selection. All of WEKA's

techniques are predicated on the assumption that the data is available as a single flat file or

relation, where each data point is described by a fixed number of attributes (normally,

numeric or nominal attributes, but some other attribute types are also supported). WEKA

provides access to SQL databases using Java Database Connectivity and can process the

result returned by a database query.

It is not capable of multi‐relational data mining, but there is separate software for

converting a collection of linked database tables into a single table that is suitable for

processing using WEKA. Another important area is sequence modeling. Attribute

Relationship File Format (ARFF) is the text format file used by WEKA to store data in a

database. The ARFF file contains two sections: the header and the data section. The first line

of the header tells us the relation name. Then there is the list of the attributes


(@attribute...). Each attribute is associated with a unique name and a type. The latter

describes the kind of data contained in the variable and what values it can have. The

variables types are: numeric, nominal, string and date. The class attribute is by default the

last one of the list. In the header section there can also be some comment lines, identified

with a '%' at the beginning, which can describe the database content or give the reader

information about the author. After that there is the data itself (@data), each line stores the

attribute of a single entry separated by a comma. WEKA's main user interface is the

Explorer, but essentially the same functionality can be accessed through the component‐

based Knowledge Flow interface and from the command line. There is also the

Experimenter, which allows the systematic comparison of the predictive performance of

WEKA's machine learning algorithms on a collection of datasets.

Launching WEKA

The WEKA GUI Chooser window is used to launch WEKA’s graphical environments. At

the bottom of the window are four buttons:

1. Simple CLI. Provides a simple command‐line interface that allows direct execution of

WEKA commands for operating systems that do not provide their own command line

Interface.

2. Explorer. An environment for exploring data with WEKA.

3. Experimenter. An environment for performing experiments and conducting.

4. Knowledge Flow. This environment supports essentially the same functions as the

Explorer but with a drag‐and‐drop interface. One advantage is that it supports

incremental learning.

If you launch WEKA from a terminal window, some text begins scrolling in the terminal.

Ignore this text unless something goes wrong, in which case it can help in tracking down the

cause. This User Manual focuses on using the Explorer but does not explain the individual

data preprocessing tools and learning algorithms in WEKA. For more information on the

various filters and learning methods in WEKA, see the book Data Mining (Witten and Frank,

2005).

The WEKA Explorer Section Tabs

At the very top of the window, just below the title bar, is a row of tabs. When the Explorer

is first started only the first tab is active; the others are grayed out. This is because it is

necessary to open (and potentially pre‐process) a dataset before starting to explore the data.

The tabs are as follows:

1. Preprocess. Choose and modify the data being acted on.

2. Classify. Train and test learning schemes that classify or perform regression.

3. Cluster. Learn clusters for the data.

4. Associate. Learn association rules for the data.

5. Select attributes. Select the most relevant attributes in the data.


6. Visualize. View an interactive 2D plot of the data. Once the tabs are active, clicking on them

flicks between different screens, on which the respective actions can be performed. The

bottom area of the window(including the status box, the log button, and the WEKA bird)

stays visible regardless of which section you are in.

Dalam Weka, setiap dataset merupakan instance dari class: weka.core.Instance

Setiap instance memiliki beberapa atribut (field). Domain dari atribut dapat berupa:

Nominal: jeruk, apel, pepaya

Numerik: bilangan bulat dan pecahan

String: diapit oleh tanda petik

Date: tanggal Relasional

Contoh Dataset


Contoh : Iris.arff


Data Iris.arff (detail)

Data: weather.arff


Jawab: Ikuti contoh seperti contoh berikut ini


GUI WEKA: PREPROCESSING

Data Preprocessing in WEKA This exercise illustrates some of the basic data preprocessing operations that can be

performed using WEKA. The sample data set used for this example is the "bank data" available in comma‐separated format (bank‐data.csv).

The data contains the following fields:

id : a unique identification number

age : age of customer in years (numeric)

sex : MALE / FEMALE

region : inner_city/rural/suburban/town income : income of customer (numeric)

married : is the customer married (YES/NO)

children : number of children (numeric)

car : does the customer own a car (YES/NO)

save_acct : does the customer have a saving account (YES/NO)

current_acct : does the customer have a current account (YES/NO)

mortgage : does the customer have a mortgage (YES/NO)

pep : did the customer buy a PEP (Personal Equity Plan) after the last mailing (YES/NO)

Loading the Data In addition to the native ARFF data file format, WEKA has the capability to read in ".csv"

format files. This is fortunate since many databases or spreadsheet applications can save or export data into flat files in this format. As can be seen in the sample data file, the first row contains the attribute names (separated by commas) followed by each data row with attribute values listed in the same order (also separated by commas). In fact, once loaded into WEKA, the data set can be saved into ARFF format.

In this example, we load the data set into WEKA, perform a series of operations using WEKA's preprocessing filters. While all of these operations can be performed from the command line, we use the GUI interface for WEKA Explorer.

Initially (in the Preprocess tab) click "open" and navigate to the directory containing the data file (.csv or .arff). In this case we will open the above data file. This is shown in Figure p1.

Figure p1


Once the data is loaded, WEKA will recognize the attributes and during the scan of the

data will compute some basic statistics on each attribute. The left panel in Figure p2 shows the list of recognized attributes, while the top panels indicate the names of the base relation (or table) and the current working relation (which are the same initially).

Figure p2

Clicking on any attribute in the left panel will show the basic statistics on that attribute.

For categorical attributes, the frequency for each attribute value is shown, while for continuous attributes we can obtain min, max, mean, standard deviation, etc. As an example, see Figures p3 and p4 below which show the results of selecting the "age" and "married" attributes, respectively.

Figure p3


Figure p4

Note that the visualization in the right bottom panel is a form of cross‐tabulation across

two attributes. For example, in Figure p4 above, the default visualization panel cross‐tabulates "married" with the "pep" attribute (by default the second attribute is the last column of the data file). You can select another attribute using the drop down list.

Selecting or Filtering Attributes

In our sample data file, each record is uniquely identified by a customer id (the "id" attribute). We need to remove this attribute before the data mining step. We can do this by (1) simply select the attribute and click on “Remove button” as shown in Figure p5 (WEKA 3.6.2) or

Figure p5


(2) using the Attribute filters in WEKA. In the "Filter" panel, click on the "Choose" button. This will show a popup window with a list available filters. Scroll down the list and select the "weka.filters.unsupervised.attribute.Remove" filter as shown in Figure p6.

Figure p6

Next, click on text box immediately to the right of the "Choose" button. In the resulting

dialog box enter the index of the attribute to be filtered out (this can be a range or a list separated by commas). In this case, we enter 1 which is the index of the "id" attribute (see the left panel). Make sure that the "invertSelection" option is set to false (otherwise everything except attribute 1 will be filtered). Then click "OK" (See Figure p7). Now, in the filter box you will see "Remove ‐R 1" (see Figure p8).

Figure p7


Figure p8

Click the "Apply" button to apply this filter to the data. This will remove the "id"

attribute and create a new working relation (whose name now includes the details of the filter that was applied). The result is depicted in Figure p9:

Figure p9

It is possible now to apply additional filters to the new working relation. In this example,

however, we will save our intermediate results as separate data files and treat each step as a separate WEKA session. To save the new working relation as an ARFF file, click on save button in the top panel. Here, as shown in the "save" dialog box (see Figure p10), we will save the new relation in the file "bank‐data‐R1.arff".


Figure p10

Figure p11 shows the top portion of the new generated ARFF file (in text editor).

Figure p11

Note that in the new data set, the "id" attribute and all the corresponding values in the

records have been removed. Also, note that WEKA has automatically determined the correct types and values associated with the attributes, as listed in the Attributes section of the ARFF file.

Discretization

Some techniques, such as association rule mining, can only be performed on categorical data. This requires performing discretization on numeric or continuous attributes. There are 3 such attributes in this data set: "age", "income", and "children". In the case of the


"children" attribute the range of possible values are only 0, 1, 2, and 3. In this case, we have opted for keeping all of these values in the data. This means we can simply discretize by removing the keyword "numeric" as the type for the "children" attribute in the ARFF file, and replacing it with the set of discrete values. We do this directly in our text editor as seen in Figure p12. In this case, we have saved the resulting relation in a separate file "bankdata2. arff".

Figure p12

We will rely on WEKA to perform discretization on the "age" and "income" attributes. In

this example, we divide each of these into 3 bins (intervals). The WEKA discretization filter, can divide the ranges blindly, or used various statistical techniques to automatically determine the best way of partitioning the data. In this case, we will perform simple binning.

First we will load our filtered data set into WEKA by opening the file "bank‐data2.arff". The "open" dialog box in depicted in Figure p13.

Figure p13


If we select the "children" attribute in this new data set, we see that it is now a

categorical attribute with four possible discrete values. This is depicted in Figure p14.

Figure p14

Now, once again we activate the Filter dialog box, but this time, we will select "weka.filters.unsupervised.attribute.Discretize" from the list (see Figure p15).

Figure p15

Next, to change the defaults for this filter, click on the box immediately to the right of

the "Choose" button. This will open the Discretize Filter dialog box. We enter the index for the attributes to be discretized. In this case we enter 1 corresponding to attribute "age". We also enter 3 as the number of bins (note that it is possible to discretize more than one attribute at the same time (by using a list of attribute indices). Since we are doing simple


binning, all of the other available options are set to "false". The dialog box is depicted in Figure p16. Clicking on “More” will give you detail of each parameter.

Figure p16

Click "Apply" in the Filter panel. This will result in a new working relation with the

selected attribute partitioned into 3 bins (see Figure p17). To examine the results, we save the new working relation in the file "bank‐data3.arff" as depicted in Figure p18.

Figure p17


Figure p18

Let us now examine the new data set using our text editor. The top portion of the data

is shown in Figure p18. You can observe that WEKA has assigned its own labels to each of the value ranges for the discretized attribute. For example, the lower range in the "age" attribute is labeled "(‐inf‐34.333333]" (enclosed in single quotes and escape characters), while the middle range is labeled "(34.333333‐50.666667]", and so on. These labels now also appear in the data records where the original age value was in the corresponding range.

Next, we apply the same process to discretize the "income" attribute into 3 bins. Again, Weka automatically performs the binning and replaces the values in the "income" column with the appropriate automatically generated labels. We save the new file into "bankdata3. arff", replacing the older version.

Clearly, the WEKA labels, while readable, leave much to be desired as far as naming conventions go. We will thus use the global search/replace functions in text editor to replace these labels with more succinct and readable ones.

Replace all of the WEKA‐assigned labels of “age” and “income” attributes. Note that the attribute section (the top part) of the arff file must be adjusted accordingly.

Figure p19 shows the final result of the transformation and the newly assigned labels for these attribute values.


Figure p19

We now also change the relation name in the ARFF file to "bank‐data‐final" and save

the file as "bank‐data‐final.arff". You may try with different number of bins. There is also a parameter for

equalfrequency binning. Check it out. Missing Values 1. Open file “bank‐data.arff” 2. Check if there is any missing values in any attribute.

3. Edit data to make some missing values. 4. Delete some data in “region”(Nominal) and “children”(Numeric) attributes. Click on “OK” button when finish.


5. Make note of Label that has Max Count in “region” and Mean of “children” attributes.


6. Choose “ReplaceMissingValues” filter (weka.filters.unsupervised.attribute.ReplaceMissingValues). Then, click on Apply button.

7. Look into the data. How did those missing values get replaced ?

8. Edit “bank‐data.arff” with text editor. Make some data missing by replacing them with ‘?’. (Try with nominal and numeric attributes). Save to “bank‐data‐missing.arff”. 9. Load “bank‐data‐missing.arff” into WEKA, observe the data and attribute information. 10. Replace missing values by the same procedure you had done before. Referensi:

http://maya.cs.depaul.edu/classes/ect584/WEKA/preprocess.html ‐‐ WEKA 3.4.1


GUI WEKA: CONNECTING TO A DATABASE

Opening Windows Databases in Weka

This documentation is superceded by the Wiki article Windows Databases.

Outdated documentation: A common query we get from our users is how to open an Accesss Database in the

Weka Explorer. This page is intended as a guide to help you achieve this. It is a complicated process and we cannot guarantee that it will work for you. The process described makes use of the JDBC‐ODBC bridge that is part of Sun's JRE 1.3, but is still experimental.

Important Note: There is an incompatibility problem with the Weka version 3‐2‐0 and earlier that means

that it will fail. This problem has been fixed in versions 3‐2‐1 and 3‐3 or greater. You will know the problem exists if you get the error 'Resultset is closed'. The easiest and most recommended fix is to ugrade to a newer version of Weka. This problem can be fixed in versions below 3‐2‐1 by replacing the InstanceQuery.class file in Weka with this patched version. If you need it you can also get the patched souce file. To install the patch, you will need to need to overwrite the files that are already present. One way of doing this would be to extract the weka.jar file. If for example you extracted the contents of the .jar into the directory c:\weka3‐2 then you can overwrite the InstanceQuery.class file sitting in c:\weka3‐2\weka\experiment\. To launch the Weka Explorer, you can open up a command prompt and execute the following command:

java -cp c:\weka-3-2\ weka.gui.explorer.Explorer

Note: There is an easier way to get your database table into Weka ‐ simply save it is a CSV file. The Weka explorer is able to load these files directly.

The following instructions are for Weka 3.2 upwards and Windows 2000. Under other Windows versions there may be slight differences.

Extracting .jar files

.jar files are Java archives that can be extracted by tools such as some versions of WinZip, or by Sun's jar tools. See this page for more info.

Step 1: Create a User DSN Note: Make sure your database is not open in another application before following the steps below. 1. Go to the Control Panel 2. Choose Adminstrative Tools 3. Choose Data Sources (ODBC) 4. At the User DSN tab, choose Add... 5. Choose the Microsoft Access driver and click Finish. At this point you could of course

choose another driver instead for a different type of database.


6. Give the source a name by typing it into the Data Source Name field 7. In the Database section, choose Select... 8. Browse to find your database file, select it and click OK 9. Click OK to finalize your DSN 10. Your DSN should now be listed in the User Data Sources list

Step 2: Set up the DatabaseUtils.props file You will need to create a file called DatabaseUtils.props. This file already exists under

the path /weka/experiment/ in the Weka.jar file that is part of the Weka download. This file needs to be recognized when the Explorer starts. You can achieve this by

making sure it is in the working directory, or by replacing the version the already exists in the /weka/experiment directory. A way of achieving the second alternative would be to extract the contents of the weka.jar, and setting your CLASSPATH to point to the directory where /weka resides rather that the .jar file. (As mentioned above.)

The file is a text file that needs to contain the following lines:

jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver jdbcURL=jdbc:odbc:dbname

where dbname is the name you gave the user DSN. (This can also be changed once the Explorer is running.)

Step 3: Open the database 1. Start up the Weka Explorer. If you want to be sure that the DatabaseUtils.props file is

in the current path, you can open a command prompt window, change to the directory where the DatabaseUtils.props file is located, make sure your CLASSPATH environment variable is set correctly (or set it with the ‐cp option to java) and launch the Explorer with the following command:

java weka.gui.explorer.Explorer 2. Choose Open DB... 3. Edit the query field to read 'select * from tablename' where tablename is the name of

the database table you want to read, or you could put a more complicated SQL query here instead.

4. The databaseURL should read 'jdbc:odbc:dbname' where dbname is the name you gave the user DSN.

5. Click OK

At this point the data should be read from the database.

How to connect a MYSQL database with Weka ?

This tutorial explains making a connection between MySql database with weka. Pre‐requisite: 1. MYSQL already installed windows OS system at port Number: 3306 2. Download the mysql‐connector‐java‐5.0.8‐bin.jar 3. Windows operating system(OS)


Follow these steps to download and install weka ‐ mysql database connection on your Windows machine.

Step 1: Find jdbc jar file: in my case jar path is ==>C:\Users\Reddy\Java‐JDBC\JavaJdbctest\mysql‐connector‐java‐5.1.36‐bin.jar (Download from following link: https://dev.mysql.com/downloads/connector/j/5.0.html)

Step 2: Copy mysql‐connector‐java‐5.1.36‐bin.jar file to => C:\Program Files\Weka‐3‐6

Now go to your weka installed folder and try to locate RunWeka.ini file , and modify 'RunWeka.ini" file with notepad/wordpad editor. "RunWeka.ini" by default located at => C:\Program Files\Weka‐3‐7\RunWeka.ini

find and bring following changes to last line where cp=%CLASSPATH%;

Replace with following cp=%CLASSPATH%;C:\Program Files\Weka‐3‐6\mysql‐connector‐java‐5.1.36‐bin.jar

Figure : 1.1 Setting Class path


Save changes to RunWeka.ini file

Now go Weka GUI Chooser > Tools > SqlViewer(ctrl+s)

Figure :1.2

Replace default URL with jdbc:mysql://localhost:3306/test (Figure: 1.3)

Figure : 1.3

Enter your username : root(by default ) and Password that you already set to mysql user

(Figure:1.4)


Figure:1.4

Now click on connect button


GUI WEKA: CLASSIFICATION

Melatih Dataset Klasifikasi

The following guide is based WEKA version 3.4.1 Additional resources on WEKA, including sample data sets can be found from the official WEKA Web site.

This example illustrates the use of C4.5 (J48) classifier in WEKA. The sample data set used for this example, unless otherwise indicated, is the bank data available in comma‐separated format (bank‐data.csv). This document assumes that appropriate data preprocessing has been perfromed. In this case ID field has been removed. Since C4.5 algorithm can handle numeric attributes, there is no need to discretize any of the attributes. For the purposes of this example, however, the "Children" attribute has been converted into a categorical attribute with values "YES" or "NO".

WEKA has implementations of numerous classification and prediction algorithms. The basic ideas behind using all of these are similar. In this example we will use the modified version of the bank data to classify new instances using the C4.5 algorithm (note that the C4.5 is implemented in WEKA by the classifier class: weka.classifiers.trees.J48). The modified (and smaller) version of the bank data can be found in the file "bank.arff" and the new unclassified instances are in the file "bank‐new.arff".

As usual, we begin by loading the data into WEKA, as seen in Figure 20:

Figure 20

Next, we select the "Classify" tab and click the "Choose" button to select the J48

classifier, as depicted in Figures 21‐a and 21‐b. Note that J48 (implementation of C4.5 algorithm) does not require discretization of numeric attributes, in contrast to the ID3 algorithm from which C4.5 has evolved.


Figure 21-a

Figure 21-b

Now, we can specify the various parameters. These can be specified by clicking in the text box to the right of the "Choose" button, as depicted in Figure 22. In this example we accept the default values. The default version does perform some pruning (using the subtree raising approach), but does not perform error pruning. The selected parameters are depicted in Figure 22.


Figure 22

Under the "Test options" in the main panel we select 10‐fold cross‐validation as our evaluation approach. Since we do not have separate evaluation data set, this is necessary to get a reasonable idea of accuracy of the generated model. We now click "Start" to generate the model. The ASCII version of the tree as well as evaluation statistics will appear in the eight panel when the model construction is completed (see Figure 23).

Figure 23

We can view this information in a separate window by right clicking the last result set (inside the "Result list" panel on the left) and selecting "View in separate window" from the pop‐up menu. These steps and the resulting window containing the classification results are depicted in Figures 24‐a and 24‐b.


Figure 24-a

Figure 24-b


Note that the classification accuracy of our model is only about 69%. This may indicate that we may need to do more work (either in preprocessing or in selecting the correct parameters for classification), before building another model. In this example, however, we will continue with this model despite its inaccuracy.

WEKA also let's us view a graphical rendition of the classification tree. This can be done by right clicking the last result set (as before) and selecting "Visualize tree" from the pop‐up menu. The tree for this example is depicted in Figure 25. Note that by resizing the window and selecting various menu items from inside the tree view (using the right mouse button), we can adjust the tree view to make it more readable.

Figure 25

We will now use our model to classify the new instances. A portion of the new instances ARFF file is depicted in Figure 26. Note that the attribute section is identical to the training data (bank data we used for building our model). However, in the data section, the value of the "pep" attribute is "?" (or unknown).

Figure 26


In the main panel, under "Test options" click the "Supplied test set" radio button, and then click the "Set..." button. This will pop up a window which allows you to open the file containing test instances, as in Figures 27‐a and 27‐b.

Figure 27-a

Figure 27-b


In this case, we open the file "bank‐new.arff" and upon returning to the main window, we click the "start" button. This, once again generates the models from our training data, but this time it applies the model to the new unclassified instances in the "bank‐new.arff" file in order to predict the value of "pep" attribute. The result is depicted in Figure 28. Note that the summary of the results in the right panel does not show any statistics. This is because in our test instances the value of the class attribute ("pep") was left as "?", thus WEKA has no actual values to which it can compare the predicted values of new instances.

Figure 28

Of course, in this example we are interested in knowing how our model managed to

classify the new instances. To do so we need to create a file containing all the new instances along with their predicted class value resulting from the application of the model. Doing this is much simpler using the command line version of WEKA classifier application. However, it is possible to do so in the GUI version using an "indirect" approach, as follows.

First, right‐click the most recent result set in the left "Result list" panel. In the resulting pop‐up window select the menu item "Visualize classifier errors". This brings up a separate window containing a two‐dimensional graph. These steps and the resulting window are shown in Figures 28 and 29.


Figure 29

For now, we are not interested in what this graph represents. Rather, we would like to "save" the classification results from which the graph is generated. In the new window, we click on the "Save" button and save the result as the file: "bank‐predicted.arff", as shown in Figure 30.

Figure 30


This file contains a copy of the new instances along with an additional column for the predicted value of "pep". The top portion of the file can be seen in Figure 31.

Figure 31

Note that two attributes have been added to the original new instances data:

"Instance_number" and "predictedpep". These correspond to new columns in the data portion. The "predictedpep" value for each new instance is the last value before "?" which the actual "pep" class value. For example, the predicted value of the "pep" attribute for instance 0 is "YES" according to our model, while the predicted class value for instance 4 is "NO".

Using the Command Line (Recommended)

While the GUI version of WEKA is nice for visualizing the results and setting the parameters using forms, when it comes to building a classification (or predictions) model and then applying it to new instances, the most direct and flexible approach is to use the command line. In fact, you can use the GUI to create the list of parameters (for example in case of the J48 class) and then use those parameters in the command line.

In the main WEKA interface, click "Simple CLI" button to start the command line interface. The main command for generating the classification model as we did above is:


java weka.classifiers.trees.J48 -C 0.25 -M 2 -t directory-path\bank.arff -d directory-path \bank.model

The options ‐C 0.25 and ‐M 2 in the above command are the same options that we selected for J48 classifier in the previous GUI example (see Figure 22). The ‐t option in the command specifies that the next string is the full directory path to the training file (in this case "bank.arff"). In the above command directory‐path should be replaced with the full directory path where the training file resides. Finally, the ‐d option specifies the name (and location) where the model will be stored. After executing this command inside the "Simple CLI" interface, you should see the tree and stats about the model in the top window (See Figure 32).

Figure 32

Based on the above command, our classification model has been stored in the file "bank.model" and placed in the directory we specified. We can now apply this model to the


new instances. The advantage of building a model and storing it is that it can be applied at any time to different sets of unclassified instances. The command for doing so is:

java weka.classifiers.trees.J48 -p 9 -l directory-path\bank.model -T directory-path \bank-new.arff

In the above command, the option ‐p 9 indicates that we want to predict a value for attribute number 9 (which is "pep"). The ‐l options specifies the directory path and name of the model file (this is what was created in the previous step). Finally, the ‐T option specifies the name (and path) of the test data. In our example, the test data is our new instances file "bank‐new.arff").

This command results in a 4‐column output similar to the following: 0 YES 0.75 ? 1 NO 0.7272727272727273 ? 2 YES 0.95 ? 3 YES 0.8813559322033898 ? 4 NO 0.8421052631578947 ?

The first column is the instance number assigned to the new instances in "bank‐new.arff" by WEKA. The 2nd column is the predicted value of the "pep" attribute for the new instance. The 3rd column is the confidence (prediction accuracy) for that instance. Finally, the 4th column in the actual "pep" value in the test data (in this case, we did not have a value for "pep" in "bank‐new.arff", thus this value is "?"). For example, in the above output, the predicted value of "pep" in instance 2 is "YES" with a confidence of 95%. Portion of the final result are depicted in Figure 33.

Figure 33


The above output is preferable over the output derived from the GUI version on WEKA. First, this is a more direct approach which allows us to save the classification model. This model can be applied to new instance later without having to regenerate the model. Secondly (and more importantly), in contrast to the final output of the GUI version, in this case we have independent confidence (accuracy) values for each of the new instances. This means that we can focus only on those prediction with which we are more confident. For example, in the above output, we could filter out any instance whose predicted value has an accuracy of less than 85%.

Referensi: http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/classify.html

Menguji Hasil Klasifikasi (Making predictions on new data using Weka) One we have learned a model, it can be used to classify new unseen data. These notes

describe the process of doing some both graphically and from the command line. First, the file with cases to predict needs to have the same structure that the file used to

learn the model. The difference is that the value of the class attribute is “?” for all instances (question marks represent missing values in Weka). For example assuming that we have learnt a decision tree using the diabetes datasets included weka, the following file will be used to predict the 5 cases included in the arff file:

@relation pima_diabetes @attribute 'preg' real @attribute 'plas' real @attribute 'pres' real @attribute 'skin' real @attribute 'insu' real @attribute 'mass' real @attribute 'pedi' real @attribute 'age' real @attribute 'class' { tested_negative, tested_positive} @data 6,148,72,35,0,33.6,0.627,50,? 1,85,66,29,0,26.6,0.351,31,? 8,183,64,0,0,23.3,0.672,32,? 1,89,66,23,94,28.1,0.167,21,? 0,137,40,35,168,43.1,2.288,33,?

Untuk menguji hasil training pada Weka dengan data yang belum pernah dilatihkan ke klasifier maka pada file arff untuk dataset test perlu diberikan tanda tanya pada field ‘class’‐nya. Tepat seperti contoh di atas. Weka akan mengisinya secara otomatis ketika di‐running.

Using Weka’s Explorer First, we load the saved model with the right click menu on the “Result list” panel:


In the “Test Options”, we have to select “Supplied test set”, and once the file is loaded we select “No class” from the list of attributes.

Then, clicking “More Options”, a new window opens and we choose PlainText from ‘Output predictions’.


Finally, we need to right click in the model and run “Re‐evaluate model on current test set”.


The results are shown in the “Classifier output” panel, under “Predictions on test data”. The “predicted” column contains tested_p or tested_n for each of the lines in the test file.

Using the command line

It is explained in the following link:

http://weka.wikispaces.com/Making+predictions

An example using our data:

java weka.classifiers.trees.J48 -T diabetes2.arff -l j48.model -p 0

You need to add the weka.jar file into the CLASSPATH environment variable (or using ‐cp) and the 'bin' directory of your java installation in the PATH variable). And the output should look like this:

Reference:

Daniel Rodriguez, Making Predictions on New Data Using Weka, University of Alcala


GUI WEKA: CLUSTERING

K‐Means Clustering in WEKA

The following guide is based WEKA version 3.4.1 Additional resources on WEKA, including sample data sets can be found from the official WEKA Web site.

This example illustrates the use of k‐means clustering with WEKA The sample data set used for this example is based on the "bank data" available in comma‐separated format (bank‐data.csv). This document assumes that appropriate data preprocessing has been perfromed. In this case a version of the initial data set has been created in which the ID field has been removed and the "children" attribute has been converted to categorical (This, however, is not necessary for clustering).

The resulting data file is "bank.arff" and includes 600 instances. As an illustration of performing clustering in WEKA, we will use its implementation of the K‐means algorithm to cluster the cutomers in this bank data set, and to characterize the resulting customer segments.

Figure 34 shows the main WEKA Explorer interface with the data file loaded.

Some implementations of K‐means only allow numerical values for attributes. In that case, it may be necessary to convert the data set into the standard spreadsheet format and convert categorical attributes to binary. It may also be necessary to normalize the values of


attributes that are measured on substantially different scales (e.g., "age" and "income"). While WEKA provides filters to accomplish all of these preprocessing tasks, they are not necessary for clustering in WEKA . This is because WEKA SimpleKMeans algorithm automatically handles a mixture of categorical and numerical attributes. Furthermore, the algorithm automatically normalizes numerical attributes when doing distance computations. The WEKA SimpleKMeans algorithm uses Euclidean distance measure to compute distances between instances and clusters.

To perform clustering, select the "Cluster" tab in the Explorer and click on the "Choose" button. This results in a drop down list of available clustering algorithms. In this case we select "SimpleKMeans". Next, click on the text box to the right of the "Choose" button to get the pop‐up window shown in Figure 35, for editing the clustering parameter.

In the pop‐up window we enter 6 as the number of clusters (instead of the default

values of 2) and we leave the value of "seed" as is. The seed value is used in generating a random number which is, in turn, used for making the initial assignment of instances to clusters. Note that, in general, K‐means is quite sensitive to how clusters are initially assigned. Thus, it is often necessary to try different values and evaluate the results.

Once the options have been specified, we can run the clustering algorithm. Here we make sure that in the "Cluster Mode" panel, the "Use training set" option is selected, and we click "Start". We can right click the result set in the "Result list" panel and view the results of clustering in a separate window. This process and the resulting window are shown in Figures 36 and 37.


The result window shows the centroid of each cluster as well as statistics on the number and percentage of instances assigned to different clusters. Cluster centroids are the mean vectors for each cluster (so, each dimension value in the centroid represents the mean value for that dimension in the cluster). Thus, centroids can be used to characterize the clusters. For example, the centroid for cluster 1 shows that this is a segment of cases representing middle aged to young (approx. 38) females living in inner city with an average income of approx. $28,500, who are married with one child, etc. Furthermore, this group have on average said YES to the PEP product.

Another way of understanding the characteristics of each cluster in through visualization. We can do this by right‐clicking the result set on the left "Result list" panel and selecting "Visualize cluster assignments". This pops up the visualization window as shown in Figure 38.

You can choose the cluster number and any of the other attributes for each of the three different dimensions available (x‐axis, y‐axis, and color). Different combinations of choices will result in a visual rendering of different relationships within each cluster. In the above example, we have chosen the cluster number as the x‐axis, the instance number (assigned by WEKA) as the y‐axis, and the "sex" attribute as the color dimension. This will result in a visualization of the distribution of males and females in each cluster. For instance, you can note that clusters 2 and 3 are dominated by males, while clusters 4 and 5 are dominated by females. In this case, by changing the color dimension to other attributes, we can see their distribution within each of the clusters.

Finally, we may be interested in saving the resulting data set which included each instance along with its assigned cluster. To do so, we click the "Save" button in the visualization window and save the result as the file "bank‐kmeans.arff". The top portion of this file is depicted in Figure 39.


Note that in addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original data set. In the data portion, each instance now has its assigned cluster as the last attribute value. By doing some simple manipulation to this data set, we can easily convert it to a more usable form for additional analysis or processing. For example, here we have converted this data set in a comma‐separated format and sorted the result by clusters. Furthermore, we have added the ID field from the original data set (before sorting). The results of these steps can be seen in the file "bank‐ kmeans.csv".

Referensi:

http://facweb.cs.depaul.edu/mobasher/classes/ect584/weka/index.html


GUI WEKA: ASSOCIATIONS

Association Rule Mining with WEKA Apriori works with categorical values only. Therefore, if a dataset contains numeric

attributes, they need to be converted into nominal before applying the Apriori algorithm. Hence, data preprocessing must be performed. Repeat homework 2 (Data Preprocessing), if you don’t know how to deal with numeric to nominal conversion.

weather.nominal.arff 1. Load weather.nominal.arff into a text editor and analyze the attribute types and values. 2. Is this dataset appropriate for association rule mining ? if not, modify it. You may use

WEKA’s “Preprocessing” capability. 3. Apply Apriori algorithm to the dataset.

a. Goto Association tab b. Choose Apriori as Associator c. Accept all default values. You may click on More button to see the synopsis for the

different parameters. d. Click on Start button to run

4. Study the output in the right panel. It should look something similar to the following :

Apriori ======= Minimum support: 0.15 Minimum metric : 0.9 Number of cycles performed: 17 Generated sets of large itemsets: Size of set of large itemsets L(1): 12 Size of set of large itemsets L(2): 47 Size of set of large itemsets L(3): 39 Size of set of large itemsets L(4): 6 Best rules found: 1. outlook=overcast 4 ==> play=yes 4 conf:(1) 2. temperature=cool 4 ==> humidity=normal 4 conf:(1) 3. humidity=normal windy=FALSE 4 ==> play=yes 4 conf:(1) 4. outlook=sunny play=no 3 ==> humidity=high 3 conf:(1) 5. outlook=sunny humidity=high 3 ==> play=no 3 conf:(1) 6. outlook=rainy play=yes 3 ==> windy=FALSE 3 conf:(1) 7. outlook=rainy windy=FALSE 3 ==> play=yes 3 conf:(1) 8. temperature=cool play=yes 3 ==> humidity=normal 3 conf:(1) 9. outlook=sunny temperature=hot 2 ==> humidity=high 2 conf:(1) 10. temperature=hot play=no 2 ==> outlook=sunny 2 conf:(1)

5. Can you explain what the output says ? 6. Try vary value of parameters; for example, minimum support, minimum confidence and

number of rules. 7. What do you find ?

WEKA’s Apriori (ref: web.mac.com) The default values for Number of rules, the decrease for Minimum support (delta factor)

and minimum Confidence values are 10, 0.05 and 0.9. Rule Support is the proportion of examples covered by the LHS and RHS while Confidence is the proportion of examples covered by the LHS that are also covered by the RHS. So if a rule's RHS and LHS covers 50% of the cases then the rule has 0.5 support, if the LHS of a rule covers 200 cases and of these the RHS covers 50 cases then the confidence is 0.25.


With default settings Apriori tries to generate 10 rules by starting with a minimum support of 100%, iteratively decreasing support by the delta factor until minimum non‐zero support is reached or the required number of rules with at least minimum confidence has been generated. If we examine Weka's output, a Minimum support of 0.15 indicates the minimum support reached in order to generate the 10 rules with the specified minimum metric, here confidence of 0.9. The item set sizes generated are displayed; e.g. there are 6 four‐item sets having the required minimum support. By default rules are sorted by confidence and any ties are broken based on support. The number preceding ==> indicates the number of cases covered by the LHS and the value following the rule is the number of cases covered by the RHS. The value in parenthesis is the rule's confidence.

bank.arff 1. Load bank.arff into a text editor and analyze the attribute types and values. 2. Is this dataset appropriate for association rule mining ? if not, modify it. You may use

WEKA’s “Preprocessing” capability. 3. Apply Apriori algorithm to the dataset. 4. Study the output in the right panel. 5. Check out output from various different sets of parameters. 6. Is it something you expected ?

market‐basket.arff 1. Perform similar steps against market‐basket.arff.

Association Rule Mining 1. Trace the results of using the Apriori algorithm on the grocery shop with support threshold

33.34% and confidence threshold 60%. Show the candidate and frequent itemsets for each database scan. Enumerate all the final frequent itemsets. Also indicate the association rules that are generated and highlight the strong ones, sort them by confidence.

Transaction ID Items T1 HotDogs, Buns, Ketchup T2 HotDogs, Buns T3 HotDogs, Coke, Chips T4 Chips, Coke T5 Chips, Ketchup T6 HotDogs, Coke, Chips

2. Trace the results of using the Apriori algorithm on the computer shop with support threshold 70% and confidence threshold 80%. Show the candidate and frequent itemsets for each database scan. Enumerate all the final frequent itemsets. Also indicate the association rules that are generated and highlight the strong ones, sort them by confidence.

Transaction ID Items T1 Tri‐pod, Lens, bag T2 Camera, Lens, bag T3 Camera, Tri‐pod, Lens, Memorycard T4 Camera, Tri‐pod, Lens, bag T5 Lens, Memorycard, bag

3. Describe the important of support and confidence thresholds in finding association rules ? And what should be their most appropriate values ?


GUI WEKA: SELECTING ATTRIBUTES

Feature Selection in Weka Many feature selection techniques are supported in Weka.

A good place to get started exploring feature selection in Weka is in the Weka Explorer. 1. Open the Weka GUI Chooser. 2. Click the “Explorer” button to launch the Explorer. 3. Open the Pima Indians dataset. 4. Click the “Select attributes” tab to access the feature selection methods.

Weka Feature Selection Feature selection is divided into two parts: Attribute Evaluator Search Method.

Each section has multiple techniques from which to choose. The attribute evaluator is the technique by which each attribute in your dataset (also

called a column or feature) is evaluated in the context of the output variable (e.g. the class). The search method is the technique by which to try or navigate different combinations of attributes in the dataset in order to arrive on a short list of chosen features.

Some Attribute Evaluator techniques require the use of specific Search Methods. For example, the CorrelationAttributeEval technique used in the next section can only be used with a Ranker Search Method, that evaluates each attribute and lists the results in a rank order. When selecting different Attribute Evaluators, the interface may ask you to change the Search Method to something compatible with the chosen technique.


Weka Feature Selection Alert Both the Attribute Evaluator and Search Method techniques can be configured. Once

chosen, click on the name of the technique to get access to its configuration details.

Weka Feature Selection Configuration Click the “More” button to get more documentation on the feature selection technique

and configuration parameters. Hover your mouse cursor over a configuration parameter to get a tooltip containing more details.


Now that we know how to access feature selection techniques in Weka, let’s take a look at how to use some popular methods on our chosen standard dataset.

Correlation Based Feature Selection

A popular technique for selecting the most relevant attributes in your dataset is to use correlation.

Correlation is more formally referred to as Pearson’s correlation coefficient in statistics. You can calculate the correlation between each attribute and the output variable and

select only those attributes that have a moderate‐to‐high positive or negative correlation (close to ‐1 or 1) and drop those attributes with a low correlation (value close to zero).

Weka supports correlation based feature selection with the CorrelationAttributeEval technique that requires use of a Ranker search method.

Running this on our Pima Indians dataset suggests that one attribute (plas) has the highest correlation with the output class. It also suggests a host of attributes with some modest correlation (mass, age, preg). If we use 0.2 as our cut‐off for relevant attributes, then the remaining attributes could possibly be removed (pedi, insu, skin and pres).

Information Gain Based Feature Selection

Another popular feature selection technique is to calculate the information gain. You can calculate the information gain (also called entropy) for each attribute for the

output variable. Entry values vary from 0 (no information) to 1 (maximum information).


Those attributes that contribute more information will have a higher information gain value and can be selected, whereas those that do not add much information will have a lower score and can be removed.

Weka supports feature selection via information gain using the InfoGainAttributeEval Attribute Evaluator. Like the correlation technique above, the Ranker Search Method must be used.

Running this technique on our Pima Indians we can see that one attribute contributes more information than all of the others (plas). If we use an arbitrary cutoff of 0.05, then we would also select the mass, age and insu attributes and drop the rest from our dataset.

Learner Based Feature Selection A popular feature selection technique is to use a generic but powerful learning

algorithm and evaluate the performance of the algorithm on the dataset with different subsets of attributes selected.

The subset that results in the best performance is taken as the selected subset. The algorithm used to evaluate the subsets does not have to be the algorithm that you intend to use to model your problem, but it should be generally quick to train and powerful, like a decision tree method.

In Weka this type of feature selection is supported by the WrapperSubsetEval technique and must use a GreedyStepwise or BestFirst Search Method. The latter, BestFirst, is preferred if you can spare the compute time. 1. First select the “WrapperSubsetEval” technique. 2. Click on the name “WrapperSubsetEval” to open the configuration for the method. 3. Click the “Choose” button for the “classifier” and change it to J48 under “trees”.


4. Click “OK” to accept the configuration. 5. Change the “Search Method” to “BestFirst”. 6. Click the “Start” button to evaluate the features. Running this feature selection technique on the Pima Indians dataset selects 4 of the 8 input variables: plas, pres, mass and age.


Select Attributes in Weka Looking back over the three techniques, we can see some overlap in the selected

features (e.g. plas), but also differences. It is a good idea to evaluate a number of different “views” of your machine learning

dataset. A view of your dataset is nothing more than a subset of features selected by a given feature selection technique. It is a copy of your dataset that you can easily make in Weka.

For example, taking the results from the last feature selection technique, let’s say we wanted to create a view of the Pima Indians dataset with only the following attributes: plas, pres, mass and age: 1. Click the “Preprocess” tab. 2. In the “Attributes” selection Tick all but the plas, pres, mass, age and class attributes.

3. Click the “Remove” button. 4. Click the “Save” button and enter a filename. You now have a new view of your dataset to explore.


What Feature Selection Techniques To Use You cannot know which views of your data will produce the most accurate models. Therefore, it is a good idea to try a number of different feature selection techniques on

your data and in turn create many different views of your data. Select a good generic technique, like a decision tree, and build a model for each view of

your data. Compare the results to get an idea of which view of your data results in the best

performance. This will give you an idea of the view or more specifically features that best expose the structure of your problem to learning algorithms in general.

Summary

In this post you discovered the importance of feature selection and how to use feature selection on your data with Weka.

Specifically, you learned: How to perform feature selection using correlation. How to perform feature selection using information gain. How to perform feature selection by training a model on different subsets of features.

Attribute Selection Attribute selection searches through all possible combinations of attributes in the data and finds which subset of attributes works best for prediction [1]. Attribute selection methods contain two parts: a search method such as best‐first, forward selection, random, exhaustive, genetic algorithm, ranking, and an evaluation method such as correlation‐based, wrapper, information gain, chi‐squared. Attribute selection mechanism is very flexible ‐ WEKA allows (almost) arbitrary combinations of the two methods [4]. For this exercise you will use weather data from the “weather.arff” file. To begin an attribute selection, click ‘Select attributes’ tab.


GUI WEKA: VISUALIZATION

There are a number of ways in which you can use Weka to visualize your data. The main GUI will show a histogram for the attribute distributions for a single selected attribute at a time, by default this is the class attribute. Note that the individual colors indicate the individual classes (the Iris dataset has 3). If you move the mouse over the histogram, it will show you the ranges and how many samples fall in each range. The button VISUALIZE ALL will let you bring up a screen showing all distributions at once as in the picture below. Take some time to look at the image and what it tells you about the attributes.

There is also a tab called called VISUALIZE. Clicking on that will open the scatterplots for all attribute pairs:


From these scatterplots, we can infer a number of interesting things. For example, in the picture above we can see that in some examples the clusters (for now, think of clusters as collections of points that are physically close to each other on the screen) and the different colors correspond to each other such as for example in the plots for class/(any attribute) pairs and the petalwidth/petallength attribute pair, whereas for other pairs (sepalwidth/sepallength for example) it's much hader to separate the clusters by color. By default, the colors indicate the different classes, in this case we used red and two shades of blue. Left clicking on any of the highlighted class names towards the bottom of the screenshot allows you to set your own color for the classes. Also, by default, the color is used in conjunction with the class attribute, but it can be useful to color the other attributes as well. For example, changing the color to the fourth attribute by clicking on the arrow next to the bar that currently reads Color: class (Num) and selecting pedalwidth enables us to observe even more about the data, for example the fact that for the class/sepallength attribute pair, which range of attribute values (indicated by different color) tends to go along with which class.


Visualizing

FWEKA’s visualization section allows you to visualize 2D plots of the current relation.

8.1 The scatter plot matrix

When you select the Visualize panel, it shows a scatter plot matrix for all the attributes, colour coded according to the currently selected class. It is possible to change the size of each individual 2D plot and the point size, and to randomly jitter the data (to uncover obscured points). It also possible to change the attribute used to colour the plots, to select only a subset of attributes for inclusion in the scatter plot matrix, and to sub sample the data. Note that changes will only come into effect once the Update button has been pressed.

8.2 Selecting an individual 2D scatter plot

When you click on a cell in the scatter plot matrix, this will bring up a separate window with a visualization of the scatter plot you selected. (We described above how to visualize particular results in a separate window—for example, classifier errors—the same visualization controls are used here.)

Data points are plotted in the main area of the window. At the top are two drop‐down list buttons for selecting the axes to plot. The one on the left shows which attribute is used for the x‐axis; the one on the right shows which is used for the y‐axis.


Beneath the x‐axis selector is a drop‐down list for choosing the colour scheme. This allows you to colour the points based on the attribute selected. Below the plot area, a legend describes what values the colours correspond to. If the values are discrete, you can modify the colour used for each one by clicking on them and making an appropriate selection in the window that pops up.

To the right of the plot area is a series of horizontal strips. Each strip represents an attribute, and the dots within it show the distribution of values of the attribute. These values are randomly scattered vertically to help you see concentrations of points. You can choose what axes are used in the main graph by clicking on these strips. Left‐clicking an attribute strip changes the x‐axis to that attribute, whereas right‐clicking changes the y‐axis. The ‘X’ and ‘Y’ written beside the strips shows what the current axes are (‘B’ is used for ‘both X and Y’).

Above the attribute strips is a slider labelled Jitter, which is a random displacement given to all points in the plot. Dragging it to the right increases the amount of jitter, which is useful for spotting concentrations of points. Without jitter, a million instances at the same point would look no different to just a single lonely instance.

8.3 Selecting Instances

There may be situations where it is helpful to select a subset of the data using the visualization tool. (A special case of this is the UserClassifier in the Classify panel, which lets you build your own classifier by interactively selecting instances.)

Below the y‐axis selector button is a drop‐down list button for choosing a selection method. A group of data points can be selected in four ways:

1. Select Instance. Clicking on an individual data point brings up a window listing its attributes. If more than one point appears at the same location, more than one set of attributes is shown.

2. Rectangle. You can create a rectangle, by dragging, that selects the points inside it. 3. Polygon. You can build a free‐form polygon that selects the points inside it. Left‐click to

add vertices to the polygon, right‐click to complete it. The polygon will always be closed off by connecting the first point to the last.

4. Polyline. You can build a polyline that distinguishes the points on one side from those on the other. Left‐click to add vertices to the polyline, right‐click to finish. The resulting shape is open (as opposed to a polygon, which is always closed).

Once an area of the plot has been selected using Rectangle, Polygon or Polyline, it turns grey. At this point, clicking the Submit button removes all instances from the plot except those within the grey selection area. Clicking on the Clear button erases the selected area without affecting the graph.

Once any points have been removed from the graph, the Submit button changes to a Reset button. This button undoes all previous removals and returns you to the original graph with all points included. Finally, clicking the Save button allows you to save the currently visible instances to a new ARFF file.


GUI WEKA: MANAJEMEN RESIKO KREDIT Referensi: Bharath Annamaneni. Data Mining Lab Record JNTU WORLD [www.jntuworld.com]

Description of German Credit Data

Credit Risk Assessment

Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial importance. You have to develop a system to help a loan officer decide whether the credit of a customer is good. Or bad. A bank’s business rules regarding loans must consider two opposing factors. On th one han, a bank wants to make as many loans as possible.

Interest on these loans is the banks profit source. On the other hand, a bank can not afford to make too many bad loans. Too many bad loans could lead to the collapse of the bank. The bank’s loan policy must involved a compromise. Not too strict and not too lenient. To do the assignment, you first and foremost need some knowledge about the world of credit.

You can acquire such knowledge in a number of ways. 1. Knowledge engineering: Find a loan officer who is willing to talk. Interview her and try

to represent her knowledge in a number of ways. 2. Books: Find some training manuals for loan officers or perhaps a suitable textbook on

finance. Translate this knowledge from text from to production rule form. 3. Common sense: Imagine yourself as a loan officer and make up reasonable rules which

can be used to judge the credit worthiness of a loan applicant. 4. Case histories: Find records of actual cases where competent loan officers correctly

judged when and not to. Approve a loan application.

The German Credit Data

Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such data set. Consisting of 1000 actual cases collected in Germany. In spite of the fact that the data is German, you should probably make use of it for this assignment(Unless you really can consult a real loan officer!) There are 20 attributes used in judging a loan applicant( ie., 7 Numerical attributes and 13 Categorical or Nominal attributes). The goal is the classify the applicant into one of two categories. Good or Bad.

The total number of attributes present in German credit data are. 1. Checking_Status 2. Duration 3. Credit_history 4. Purpose 5. Credit_amout 6. Savings_status 7. Employment 8. Installment_Commitment 9. Personal_status

10. Other_parties 11. Residence_since 12. Property_Magnitude 13. Age 14. Other_payment_plans 15. Housing 16. Existing_credits 17. Job 18. Num_dependents


19. Own_telephone 20. Foreign_worker

21. Class

Tasks (Turn in your answers to the following tasks)

1. List all the categorical (or nominal) attributes and the real valued attributes separately.

Answer: The following are the Categorical (or Nominal) attributes) 1. Checking_Status 2. Credit_history 3. Purpose 4. Savings_status 5. Employment 6. Personal_status 7. Other_parties

8. Property_Magnitude 9. Other_payment_plans 10. Housing 11. Job 12. Own_telephone 13. Foreign_worker

The following are the Numerical attributes) 1. Duration 2. Credit_amout 3. Installment_Commitment 4. Residence_since

5. Age 6. Existing_credits 7. Num_dependents

2. What attributes do you think might be crucial in making the credit assessment? Come up with some simple rules in plain English using your selected attributes.

Answer:


The following are the attributes may be crucial in making the credit assessment. 1. Credit_amount 2. Age 3. Job 4. Savings_status

5. Existing_credits 6. Installment_commitment 7. Property_magnitude

3. One type of model that you can create is a Decision tree . train a Decision tree using

the complete data set as the training data. Report the model obtained after training. Answer: We created a decision tree by using J48 Technique for the complete dataset as the training data. The following model obtained after training.

=== Run information === Scheme: weka.classifiers.trees.J48 ‐C 0.25 ‐M 2 Relation: german_credit Instances: 1000 Attributes: 21 checking_status duration credit_history purpose credit_amount savings_status employment installment_commitment personal_status other_parties residence_since property_magnitude age other_payment_plans housing existing_credits job num_dependents own_telephone foreign_worker class Test mode: evaluate on training data === Classifier model (full training set) === J48 pruned tree ‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐‐ Number of Leaves : 103 Size of the tree : 140 Time taken to build model: 0.08 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 855 85.5 % Incorrectly Classified Instances 145 14.5 %

Kappa statistic 0.6251 Mean absolute error 0.2312 Root mean squared error 0.34 Relative absolute error 55.0377 % Root relative squared error 74.2015 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 93.3 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.956 0.38 0.854 0.956 0.902 0.857 good 0.62 0.044 0.857 0.62 0.72 0.857 bad WeightedAvg.0.855 0.279 0.855 0.855 0.847 0.857 === Confusion Matrix === a b <‐‐ classified as 669 31 | a = good 114 186 | b = bad

4. Suppose you use your above model trained on the complete dataset, and classify

credit good/bad for each of the examples in the dataset. What % of examples can you classify correctly?(This is also called testing on the training set) why do you think can not get 100% training accuracy?

Answer: If we used our above model trained on the complete dataset and classified credit as good/bad for each of the examples in that dataset. We can not get 100% training accuracy only 85.5% of examples, we can classify correctly.

5. Is testing on the training set as you did above a good idea? Why or why not?

Answer: It is not good idea by using 100% training data set.

6. One approach for solving the problem encountered in the previous question is using

crossvalidation? Describe what is cross validation briefly. Train a decision tree again using cross validation and report your results. Does accuracy increase/decrease? Why?

Answer: Cross‐Validation Definition: The classifier is evaluated by cross validation using the number of folds that are entered in the folds text field. In Classify Tab, Select cross‐validation option and folds size is 2 then Press Start Button, next time change as folds size is 5 then press start, and next time change as folds size is 10 then press start.

i) Fold Size‐10

Stratified cross‐validation === === Summary ===

Correctly Classified Instances 705 70.5 % Incorrectly Classified Instances 295 29.5 % Kappa statistic 0.2467 Mean absolute error 0.3467 Root mean squared error 0.4796 Relative absolute error 82.5233 % Root relative squared error 104.6565 % Coverage of cases (0.95 level) 92.8 % Mean rel. region size (0.95 level) 91.7 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.84 0.61 0.763 0.84 0.799 0.639 good 0.39 0.16 0.511 0.39 0.442 0.639 bad Weighted Avg. 0.705 0.475 0.687 0.705 0.692 0.639 === Confusion Matrix === a b <‐‐ classified as 588 112 | a = good 183 117 | b = bad

ii) Fold Size‐5

Stratified cross‐validation === === Summary === Correctly Classified Instances 733 73.3 % Incorrectly Classified Instances 267 26.7 % Kappa statistic 0.3264 Mean absolute error 0.3293 Root mean squared error 0.4579 Relative absolute error 78.3705 % Root relative squared error 99.914 % Coverage of cases (0.95 level) 94.7 % Mean rel. region size (0.95 level) 93 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.851 0.543 0.785 0.851 0.817 0.685 good 0.457 0.149 0.568 0.457 0.506 0.685 bad Weighted Avg. 0.733 0.425 0.72 0.733 0.724 0.685 === Confusion Matrix === a b <‐‐ classified as 596 104 | a = good 163 137 | b = bad

iii) Fold Size‐2

Stratified cross‐validation === === Summary === Correctly Classified Instances 721 72.1 % Incorrectly Classified Instances 279 27.9 % Kappa statistic 0.2443 Mean absolute error 0.3407

Root mean squared error 0.4669 Relative absolute error 81.0491 % Root relative squared error 101.8806 % Coverage of cases (0.95 level) 92.8 % Mean rel. region size (0.95 level) 91.3 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.891 0.677 0.755 0.891 0.817 0.662 good 0.323 0.109 0.561 0.323 0.41 0.662 bad Weighted Avg. 0.721 0.506 0.696 0.721 0.695 0.662 === Confusion Matrix === a b <‐‐ classified as 624 76 | a = good 203 97 | b = bad Note: With this observation, we have seen accuracy is increased when we have folds size is 5 and accuracy is decreased when we have 10 folds.

7. Check to see if the data shows a bias against “foreign workers” or “personal‐status”.

One way to do this is to remove these attributes from the data set and see if the decision tree created in those cases is significantly different from the full dataset case which you have already done. Did removing these attributes have any significantly effect? Discuss.

Answer: We use the Preprocess Tab in Weka GUI Explorer to remove an attribute “Foreign‐workers” & “Perosnal_status” one by one. In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

i) If Foreign_worker is removed

Evaluation on training set === === Summary === Correctly Classified Instances 859 85.9 % Incorrectly Classified Instances 141 14.1 % Kappa statistic 0.6377 Mean absolute error 0.2233 Root mean squared error 0.3341 Relative absolute error 53.1347 % Root relative squared error 72.9074 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 91.9 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.954 0.363 0.86 0.954 0.905 0.867 good 0.637 0.046 0.857 0.637 0.73 0.867 bad Weighted Avg 0.859 0.268 0.859 0.859 0.852 0.867 === Confusion Matrix ===

a b <‐‐ classified as 668 32 | a = good 109 191 | b = bad

i) If Personal_status is removed Evaluation on training set === === Summary === Correctly Classified Instances 866 86.6 % Incorrectly Classified Instances 134 13.4 % Kappa statistic 0.6582 Mean absolute error 0.2162 Root mean squared error 0.3288 Relative absolute error 51.4483 % Root relative squared error 71.7411 % Coverage of cases (0.95 level) 100 % Mean rel. region size (0.95 level) 91.7 % Total Number of Instances 1000 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F‐Measure ROC Area Class 0.954 0.34 0.868 0.954 0.909 0.868 good 0.66 0.046 0.861 0.66 0.747 0.868 bad Weighted Avg. 0.866 0.252 0.866 0.866 0.86 0.868 === Confusion Matrix === a b <‐‐ classified as 668 32 | a = good 102 198 | b = bad Note: With this observation we have seen, when “Foreign_worker “attribute is removed from the Dataset, the accuracy is decreased. So this attribute is important for classification.

8. Another question might be, do you really need to input so many attributes to get

good results? May be only a few would do. For example, you could try just having attributes 2,3,5,7,10,17 and 21. Try out some combinations. (You had removed two attributes in problem 7. Remember to reload the arff data file to get all the attributes initially before you start selecting the ones you want.)

Answer: We use the Preprocess Tab in Weka GUI Explorer to remove 2nd attribute (Duration). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set === === Summary === Correctly Classified Instances 841 84.1 % Incorrectly Classified Instances 159 15.9 % Confusion Matrix === a b <‐‐ classified as 647 53 | a = good 106 194 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 3rd attribute (Credit_history). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set === === Summary === Correctly Classified Instances 839 83.9 % Incorrectly Classified Instances 161 16.1 % == Confusion Matrix === a b <‐‐ classified as 645 55 | a = good 106 194 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 5th attribute (Credit_amount). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set === === Summary === Correctly Classified Instances 864 86.4 % Incorrectly Classified Instances 136 13.6 % = Confusion Matrix === a b <‐‐ classified as 675 25 | a = good 111 189 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 7th attribute (Employment). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.


Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 10th attribute (Other_parties). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

Time taken to build model: 0.05 seconds === Evaluation on training set === === Summary === Correctly Classified Instances 845 84.5 % Incorrectly Classified Instances 155 15.5 % Confusion Matrix === a b <‐‐ classified as 663 37 | a = good 118 182 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 17th attribute (Job). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set === === Summary === Correctly Classified Instances 859 85.9 % Incorrectly Classified Instances 141 14.1 % === Confusion Matrix === a b <‐‐ classified as 675 25 | a = good 116 184 | b = bad

Remember to reload the previous removed attribute, press Undo option in Preprocess tab. We use the Preprocess Tab in Weka GUI Explorer to remove 21st attribute (Class). In Classify Tab, Select Use Training set option then Press Start Button, If these attributes removed from the dataset, we can see change in the accuracy compare to full data set when we removed.

=== Evaluation on training set === === Summary === Correctly Classified Instances 963 96.3 % Incorrectly Classified Instances 37 3.7 % === Confusion Matrix === a b <‐‐ classified as 963 0 | a = yes 37 0 | b = no

Note: With this observation we have seen, when 3rd attribute is removed from the Dataset, the accuracy (83%) is decreased. So this attribute is important for classification. when 2nd and 10th attributes are removed from the Dataset, the accuracy(84%) is same. So we can remove any one among them. when 7th and 17th attributes are removed from the Dataset, the accuracy(85%) is same. So we can remove any one among them. If we remove 5th and 21st attributes the accuracy is increased, so these attributes may not be needed for the classification.


9. Sometimes, The cost of rejecting an applicant who actually has good credit might be higher than accepting an applicant who has bad credit. Instead of counting the misclassification equally in both cases, give a higher cost to the first case ( say cost 5) and lower cost to the second case. By using a cost matrix in weak. Train your decision tree and report the Decision Tree and cross validation results. Are they significantly different from results obtained in problem 6.

Answer: In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In Classify Tab then press Choose button in that select J48 as Decision Tree Technique. In Classify Tab then press More options button then we get classifier evaluation options window in that select cost sensitive evaluation the press set option Button then we get Cost Matrix Editor. In that change classes as 2 then press Resize button. Then we get 2X2 Cost matrix. In Cost Matrix (0,1) location value change as 5, then we get modified cost matrix is as follows.

0.0 5.0 1.0 0.0

Then close the cost matrix editor, then press ok button. Then press start button.


Note: With this observation we have seen that ,total 700 customers in that 669 classified as good customers and 31 misclassified as bad customers. In total 300cusotmers, 186 classified as bad customers and 114 misclassified as good customers.

10. Do you think it is a good idea to prefect simple decision trees instead of having long

complex decision tress? How does the complexity of a Decision Tree relate to the bias of the model? Answer: It is Good idea to prefer simple Decision trees, instead of having complex Decision tree.

11. You can make your Decision Trees simpler by pruning the nodes. One approach is to

use Reduced Error Pruning. Explain this idea briefly. Try reduced error pruning for training your Decision Trees using cross validation and report the Decision Trees you obtain? Also Report your accuracy using the pruned model Does your Accuracy increase? Answer: We can make our decision tree simpler by pruning the nodes. For that In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option . In Classify Tab then


press Choose button in that select J48 as Decision Tree Technique. Beside Choose Button Press on J48 –c 0.25 –M2 text we get Generic Object Editor. In that select Reduced Error pruning Property as True then press ok. Then press start button.


By using pruned model, the accuracy decreased. Therefore by pruning the nodes we can make our decision tree simpler.

12. How can you convert a Decision Tree into “if‐then‐else rules”. Make up your own

small Decision Tree consisting 2‐3 levels and convert into a set of rules. There also exist different classifiers that output the model in the form of rules. One such classifier in weka is rules. PART, train this model and report the set of rules obtained. Sometimes just one attribute can be good enough in making the decision, yes, just one ! Can you predict what attribute that might be in this data set? OneR classifier uses a single attribute to make decisions(it chooses the attribute based on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of j48,PART,oneR.

Answer: Sample Decision Tree of 2‐3 levels.

Converting Decision tree into a set of rules is as follows.

Rule1: If age = youth AND student=yes THEN buys_computer=yes Rule2: If age = youth AND student=no THEN buys_computer=no Rule3: If age = middle_aged THEN buys_computer=yes

Rule4: If age = senior AND credit_rating=excellent THEN buys_computer=yes Rule5: If age = senior AND credit_rating=fair THEN buys_computer=no

In Weka GUI Explorer, Select Classify Tab, In that Select Use Training set option .There also exist different classifiers that output the model in the form of Rules. Such classifiers in weka are “PART” and ”OneR” . Then go to Choose and select Rules in that select PART and press start Button.

== Evaluation on training set === === Summary === Correctly Classified Instances 897 89.7 % Incorrectly Classified Instances 103 10.3 % == Confusion Matrix === a b <‐‐ classified as 653 47 | a = good 56 244 | b = bad

Then go to Choose and select Rules in that select OneR and press start Button.

== Evaluation on training set === === Summary === Correctly Classified Instances 742 74.2 % Incorrectly Classified Instances 258 25.8 % === Confusion Matrix === a b <‐‐ classified as 642 58 | a = good 200 100 | b = bad

Then go to Choose and select Trees in that select J48 and press start Button.


Note: With this observation we have seen the performance of classifier and Rank is as follows 1. PART 2. J48 3. OneR


4 P E M R O G R A M A N J A V A D E N G A N W E K A

Weka with Java (Eclipse), Getting Started 1) Make sure you’ve downloaded Weka 2) Create a new project in Eclipse. Find Java Build Path ‐> Libraries either during project

creation or afterwards under “Package Explorer” ‐> RClick project ‐> Properties. 3) “Add External Jars…” and select the weka.jar from your download. 4) Create a class file under the “src” folder. This code is taken pretty much line for line

from weka.wikispaces. 1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

importweka.classifiers.Classifier;

importweka.classifiers.Evaluation;

importweka.classifiers.bayes.NaiveBayes;

importweka.core.Attribute;

importweka.core.FastVector;

importweka.core.Instance;

importweka.core.Instances;

publicclassDriver{

publicstaticvoidmain(String[]args)throwsException{

//Declaretwonumericattributes

AttributeAttribute1=newAttribute("firstNumeric");

AttributeAttribute2=newAttribute("secondNumeric");

//Declareanominalattributealongwithitsvalues

FastVectorfvNominalVal=newFastVector(3);

fvNominalVal.addElement("blue");

fvNominalVal.addElement("gray");

fvNominalVal.addElement("black");

AttributeAttribute3=newAttribute("aNominal",fvNominalVal);


23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

//Declaretheclassattributealongwithitsvalues

FastVectorfvClassVal=newFastVector(2);

fvClassVal.addElement("positive");

fvClassVal.addElement("negative");

AttributeClassAttribute=newAttribute("theClass",fvClassVal);

//Declarethefeaturevector

FastVectorfvWekaAttributes=newFastVector(4);

fvWekaAttributes.addElement(Attribute1);



fvWekaAttributes.addElement(ClassAttribute);

//Createanemptytrainingset

InstancesisTrainingSet=newInstances("Rel",fvWekaAttributes,10);

//Setclassindex

isTrainingSet.setClassIndex(3);

//Createtheinstance

InstanceiExample=newInstance(4);

iExample.setValue((Attribute)fvWekaAttributes.elementAt(0),1.0);

iExample.setValue((Attribute)fvWekaAttributes.elementAt(1),0.5);

iExample.setValue((Attribute)fvWekaAttributes.elementAt(2),"gray");

iExample.setValue((Attribute)fvWekaAttributes.elementAt(3),"positive");

//addtheinstance

isTrainingSet.add(iExample);

ClassifiercModel=(Classifier)newNaiveBayes();

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

cModel.buildClassifier(isTrainingSet);

//Testthemodel

EvaluationeTest=newEvaluation(isTrainingSet);

eTest.evaluateModel(cModel,isTrainingSet);

//PrinttheresultàlaWekaexplorer:

StringstrSummary=eTest.toSummaryString();

System.out.println(strSummary);

//Gettheconfusionmatrix

double[][]cmMatrix=eTest.confusionMatrix();

for(introw_i=0;row_i<cmMatrix.length;row_i++){

for(intcol_i=0;col_i<cmMatrix.length;col_i++){

System.out.print(cmMatrix[row_i][col_i]);

System.out.print("|");

}

System.out.println();

}

}

}

A Simple Machine Learning Example in Java This is a "Hello World" example of machine learning in Java. It simply give you a taste of

machine learning in Java.

Environment Java 1.6+ and Eclipse

Step 1: Download Weka library Download page: http://www.cs.waikato.ac.nz/ml/weka/snapshots/weka_snapshots.html Download stable.XX.zip, unzip the file, add weka.jar to your library path of Java project in Eclipse.


Step 2: Prepare Data Create a txt file "weather.txt" by following the following format:

@relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {TRUE, FALSE} @attribute play {yes, no} @data sunny,85,85,FALSE,no sunny,80,90,TRUE,no overcast,83,86,FALSE,yes rainy,70,96,FALSE,yes rainy,68,80,FALSE,yes rainy,65,70,TRUE,no overcast,64,65,TRUE,yes sunny,72,95,FALSE,no sunny,69,70,FALSE,yes rainy,75,80,FALSE,yes sunny,75,70,TRUE,yes overcast,72,90,TRUE,yes overcast,81,75,FALSE,yes rainy,71,91,TRUE,no

This dataset is from weka download package. It is located at "/data/weather.numeric.arff". The file extension name is "arff", but we can simply use "txt".

Step 3: Training and Testing by Using Weka

This code example use a set of classifiers provided by Weka. It trains model on the given dataset and test by using 10‐split cross validation. I will explain each classifier later as it is a more complicated topic.

import java.io.BufferedReader; import java.io.FileNotFoundException; import java.io.FileReader; import weka.classifiers.Classifier; import weka.classifiers.Evaluation; import weka.classifiers.evaluation.NominalPrediction; import weka.classifiers.rules.DecisionTable; import weka.classifiers.rules.PART; import weka.classifiers.trees.DecisionStump; import weka.classifiers.trees.J48; import weka.core.FastVector; import weka.core.Instances; public class WekaTest { public static BufferedReader readDataFile(String filename) { BufferedReader inputReader = null; try { inputReader = new BufferedReader(new FileReader(filename)); } catch (FileNotFoundException ex) {

System.err.println("File not found: " + filename); } return inputReader; } public static Evaluation classify(Classifier model, Instances trainingSet, Instances testingSet) throws Exception { Evaluation evaluation = new Evaluation(trainingSet); model.buildClassifier(trainingSet); evaluation.evaluateModel(model, testingSet); return evaluation; } public static double calculateAccuracy(FastVector predictions) { double correct = 0; for (int i = 0; i < predictions.size(); i++) { NominalPrediction np = (NominalPrediction) predictions.elementAt(i); if (np.predicted() == np.actual()) { correct++; } } return 100 * correct / predictions.size(); } public static Instances[][] crossValidationSplit(Instances data, int numberOfFolds) { Instances[][] split = new Instances[2][numberOfFolds]; for (int i = 0; i < numberOfFolds; i++) { split[0][i] = data.trainCV(numberOfFolds, i); split[1][i] = data.testCV(numberOfFolds, i); } return split; } public static void main(String[] args) throws Exception { BufferedReader datafile = readDataFile("weather.txt"); Instances data = new Instances(datafile); data.setClassIndex(data.numAttributes() - 1); // Do 10-split cross validation Instances[][] split = crossValidationSplit(data, 10); // Separate split into training and testing arrays Instances[] trainingSplits = split[0]; Instances[] testingSplits = split[1]; // Use a set of classifiers Classifier[] models = { new J48(), // a decision tree new PART(), new DecisionTable(),//decision table

majority classifier new DecisionStump() //one-level decision tree }; // Run for each model for (int j = 0; j < models.length; j++) { // Collect every group of predictions for current model in a FastVector FastVector predictions = new FastVector(); // For each training-testing split pair, train and test the classifier for (int i = 0; i < trainingSplits.length; i++) { Evaluation validation = classify(models[j], trainingSplits[i], testingSplits[i]); predictions.appendElements(validation.predictions()); // Uncomment to see the summary for each training-testing pair. //System.out.println(models[j].toString()); } // Calculate overall accuracy of current classifier on all splits double accuracy = calculateAccuracy(predictions); // Print current classifier's name and accuracy in a complicated, // but nice-looking way. System.out.println("Accuracy of " + models[j].getClass().getSimpleName() + ": " + String.format("%.2f%%", accuracy) + "\n---------------------------------"); } } }

The package view of your project should look like the following:

java-machine-learning-example

References: 1. http://www.cs.umb.edu/~ding/history/480_697_spring_2013/homework/WekaJavaAPITutorial.pdf 2. http://www.cs.ru.nl/P.Lucas/teaching/DM/weka.pdf


How to run Weka from Eclipse Weka is a collection of machine learning algorithms for data mining tasks. The

algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre‐processing, classification, regression, clustering, association rules, and visualization. It is also well‐suited for developing new machine learning schemes.

A possible way of doing this is: From the Weka installation unzip the weka‐src.jar (or you can use cvs download too)

into a folder named Weka. Put this folder in the eclipse workspace. Then create a new project called Weka (Eclipse will detect the files).

Look at http://www.cs.waikato.ac.nz/~ml/weka/eclipse_and_weka/ for step‐by‐step description with screen snap shots.

Reference: How to run Weka from Eclipse ‐ Java Tips.html

UseWEKAinyourJavacode

The most common components you might want to use are Instances ‐ your data Filter ‐ for preprocessing the data Classifier/Clusterer ‐ built on the processed data Evaluating ‐ how good is the classifier/clusterer? Attribute selection ‐ removing irrelevant attributes from your data

The following sections explain how to use them in your own code. A link to an example class can be found at the end of this page, under the Links section. The classifiers and filters always list their options in the Javadoc API (book, stable, developer version) specification.

You might also want to check out the Weka Examples collection, containing examples for the different versions of Weka. Another, more comprehensive, source of information is the chapter Using the API of the Weka manual for the stable‐3.6 and developer version (snapshots and releases later than 09/08/2009).

Instances ARFF File Pre 3.5.5 and 3.4.x

Reading from an ARFF file is straightforward:

import weka.core.Instances; import java.io.BufferedReader; import java.io.FileReader; ... BufferedReader reader = new BufferedReader( new FileReader("/some/where/data.arff"));


Instances data = new Instances(reader); reader.close(); // setting class attribute data.setClassIndex(data.numAttributes() - 1);

The class index indicates the target attribute used for classification. By default, in an ARFF file, it is the last attribute, which explains why it's set to numAttributes‐1. You must set it if your instances are used as a parameter of a weka function (e.g.,: weka.classifiers.Classifier.buildClassifier(data))

3.5.5 and newer The DataSource class is not limited to ARFF files. It can also read CSV files and other

formats (basically all file formats that Weka can import via its converters).

import weka.core.converters.ConverterUtils.DataSource;

... DataSource source = new DataSource("/some/where/data.arff"); Instances data = source.getDataSet(); // setting class attribute if the data format does not provide this information // For example, the XRFF format saves the class attribute information as well if (data.classIndex() == -1) data.setClassIndex(data.numAttributes() - 1);

Database Reading from Databases is slightly more complicated, but still very easy. First, you'll

have to modify your DatabaseUtils.props file to reflect your database connection. Suppose you want to connect to a MySQL server that is running on the local machine on the default port 3306. The MySQL JDBC driver is called Connector/J. (The driver class is org.gjt.mm.mysql.Driver.) The database where your target data resides is called some_database. Since you're only reading, you can use the default user nobody without a password. Your props file must contain the following lines:

jdbcDriver=org.gjt.mm.mysql.Driver jdbcURL=jdbc:mysql://localhost:3306/some_database

Secondly, your Java code needs to look like this to load the data from the database:

import weka.core.Instances; import weka.experiment.InstanceQuery; ... InstanceQuery query = new InstanceQuery(); query.setUsername("nobody"); query.setPassword(""); query.setQuery("select * from whatsoever"); // You can declare that your data set is sparse // query.setSparseData(true); Instances data = query.retrieveInstances();

Notes:

Don't forget to add the JDBC driver to your CLASSPATH.


For MS Access, you must use the JDBC‐ODBC‐bridge that is part of a JDK. The Windows databases article explains how to do this.

InstanceQuery automatically converts VARCHAR database columns to NOMINAL attributes, and long TEXT database columns to STRING attributes. So if you use InstanceQuery to do text mining against text that appears in a VARCHAR column, Weka will regard such text as nominal values. Thus it will fail to tokenize and mine that text. Use the NominalToString or StringToNominal filter (package weka.filters.unsupervised.attribute) to convert the attributes into the correct type.

Option handling Weka schemes that implement the weka.core.OptionHandler interface, such as classifiers, clusterers, and filters, offer the following methods for setting and retrieving options:

void setOptions(String[] options) String[] getOptions()

There are several ways of setting the options:

Manually creating a String array:

String[] options = new String[2]; options[0] = "-R"; options[1] = "1";

Using a single command‐line string and using the splitOptions method of the

weka.core.Utils class to turn it into an array:

String[] options = weka.core.Utils.splitOptions("-R 1");

Using the

OptionsToCode.java o Details o Download o 2 KB

class to automatically turn a command line into code. Especially handy if the

command line contains nested classes that have their own options, such as kernels

for SMO:

java OptionsToCode weka.classifiers.functions.SMO

will generate output like this:

// create new instance of scheme weka.classifiers.functions.SMO scheme = new weka.classifiers.functions.SMO(); // set options scheme.setOptions(weka.core.Utils.splitOptions("-C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V -1 -W 1 -K \"weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0\""));


Also, the OptionTree.java Details Download 8 KB

tool allows you to view a nested options string, e.g., used at the command line, as a tree. This can help you spot nesting errors.

Filter A filter has two different properties:

supervised or unsupervised either takes the class attribute into account or not attribute‐ or instance‐based e.g., removing a certain attribute or removing instances

that meet a certain condition

Most filters implement the OptionHandler interface, which means you can set the options via a String array, rather than setting them each manually via set‐methods. For example, if you want to remove the first attribute of a dataset, you need this filter

weka.filters.unsupervised.attribute.Remove

with this option

-R 1

If you have an Instances object, called data, you can create and apply the filter like this: import weka.core.Instances; import weka.filters.Filter; import weka.filters.unsupervised.attribute.Remove; ... String[] options = new String[2]; options[0] = "-R"; // "range" options[1] = "1"; // first attribute Remove remove = new Remove(); // new instance of filter remove.setOptions(options); // set options remove.setInputFormat(data); // inform filter about dataset **AFTER** setting options Instances newData = Filter.useFilter(data, remove); // apply filter

Filtering on‐the‐fly The FilteredClassifier meta‐classifier is an easy way of filtering data on the fly. It removes the necessity of filtering the data before the classifier can be trained. Also, the data need not be passed through the trained filter again at prediction time. The following is an example of using this meta‐classifier with the Remove filter and J48 for getting rid of a numeric ID attribute in the data:

import weka.classifiers.meta.FilteredClassifier;

import weka.classifiers.trees.J48; import weka.filters.unsupervised.attribute.Remove;

... Instances train = ... // from somewhere Instances test = ... // from somewhere // filter Remove rm = new Remove(); rm.setAttributeIndices("1"); // remove 1st attribute // classifier J48 j48 = new J48(); j48.setUnpruned(true); // using an unpruned J48 // meta-classifier FilteredClassifier fc = new FilteredClassifier(); fc.setFilter(rm); fc.setClassifier(j48); // train and make predictions fc.buildClassifier(train); for (int i = 0; i < test.numInstances(); i++) { double pred = fc.classifyInstance(test.instance(i)); System.out.print("ID: " + test.instance(i).value(0)); System.out.print(", actual: " + test.classAttribute().value((int) test.instance(i).classValue())); System.out.println(", predicted: " + test.classAttribute().value((int) pred)); } Other handy meta‐schemes in Weka:

weka.clusterers.FilteredClusterer (since 3.5.4) weka.associations.FilteredAssociator (since 3.5.6)

Batch filtering On the command line, you can enable a second input/output pair (via ‐r and ‐s) with the

‐b option, in order to process the second file with the same filter setup as the first one. Necessary, if you're using attribute selection or standardization ‐ otherwise you end up with incompatible datasets. This is done fairly easy, since one initializes the filter only once with the setInputFormat(Instances) method, namely with the training set, and then applies the filter subsequently to the training set and the test set. The following example shows how to apply the Standardize filter to a train and a test set.

Instances train = ... // from somewhere

Instances test = ... // from somewhere Standardize filter = new Standardize(); filter.setInputFormat(train); // initializing the filter once with training set Instances newTrain = Filter.useFilter(train, filter); // configures the Filter based on train instances and returns filtered instances Instances newTest = Filter.useFilter(test, filter); // create new test set

Calling conventions The setInputFormat(Instances) method always has to be the last call before

the filter is applied, e.g., with Filter.useFilter(Instances,Filter). Why? First, it is the convention for using filters and, secondly, lots of filters generate the header of the output format in the setInputFormat(Instances) method with the currently set options (setting otpions after this call doesn't have any effect any more).


Classification The necessary classes can be found in this package: weka.classifiers

Building a Classifier Batch A Weka classifier is rather simple to train on a given dataset. E.g., we can train an unpruned C4.5 tree algorithm on a given dataset data. The training is done via the buildClassifier(Instances) method.

import weka.classifiers.trees.J48; ... String[] options = new String[1]; options[0] = "-U"; // unpruned tree J48 tree = new J48(); // new instance of tree tree.setOptions(options); // set the options tree.buildClassifier(data); // build classifier

Incremental Classifiers implementing the weka.classifiers.UpdateableClassifier

interface can be trained incrementally. This conserves memory, since the data doesn't have to be loaded into memory all at once. See the Javadoc of this interface to see what classifiers are implementing it.

The actual process of training an incremental classifier is fairly simple: Call buildClassifier(Instances) with the structure of the dataset (may or

may not contain any actual data rows). Subsequently call the updateClassifier(Instance) method to feed the

classifier new weka.core.Instance objects, one by one.

Here is an example using data from a weka.core.converters.ArffLoader to train weka.classifiers.bayes.NaiveBayesUpdateable:

// load data ArffLoader loader = new ArffLoader(); loader.setFile(new File("/some/where/data.arff")); Instances structure = loader.getStructure(); structure.setClassIndex(structure.numAttributes() - 1); // train NaiveBayes NaiveBayesUpdateable nb = new NaiveBayesUpdateable(); nb.buildClassifier(structure); Instance current; while ((current = loader.getNextInstance(structure)) != null) nb.updateClassifier(current);

A working example is IncrementalClassifier.java

Details Download 1 KB


Evaluating Cross‐validation

If you only have a training set and no test you might want to evaluate the classifier by using 10 times 10‐fold cross‐validation. This can be easily done via the Evaluation class. Here we seed the random selection of our folds for the CV with 1. Check out the Evaluation class for more information about the statistics it produces.

import weka.classifiers.Evaluation; import java.util.Random; ... Evaluation eval = new Evaluation(newData); eval.crossValidateModel(tree, newData, 10, new Random(1));

Note: The classifier (in our example tree) should not be trained when handed over to the crossValidateModel method. Why? If the classifier does not abide to the Weka convention that a classifier must be re‐initialized every time the buildClassifier method is called (in other words: subsequent calls to the buildClassifier method always return the same results), you will get inconsistent and worthless results. The crossValidateModel takes care of training and evaluating the classifier. (It creates a copy of the original classifier that you hand over to the crossValidateModel for each run of the cross‐validation.)

Train/test set In case you have a dedicated test set, you can train the classifier and then evaluate it on

this test set. In the following example, a J48 is instantiated, trained and then evaluated. Some statistics are printed to stdout:

import weka.core.Instances; import weka.classifiers.Evaluation; import weka.classifiers.trees.J48; ... Instances train = ... // from somewhere Instances test = ... // from somewhere // train classifier Classifier cls = new J48(); cls.buildClassifier(train); // evaluate classifier and print some statistics Evaluation eval = new Evaluation(train); eval.evaluateModel(cls, test); System.out.println(eval.toSummaryString("\nResults\n======\n", false));

Statistics Some methods for retrieving the results from the evaluation:

nominal class o correct() ‐ number of correctly classified instances (see also incorrect()) o pctCorrect() ‐ percentage of correctly classified instances (see also

pctIncorrect()) o kappa() ‐ Kappa statistics

numeric class o correlationCoefficient() ‐ correlation coefficient

general o meanAbsoluteError() ‐ the mean absolute error o rootMeanSquaredError() ‐ the root mean squared error o unclassified() ‐ number of unclassified instances o pctUnclassified() ‐ percentage of unclassified instances

If you want to have the exact same behavior as from the command line, use this call:

import weka.classifiers.trees.J48; import weka.classifiers.Evaluation; ... String[] options = new String[2]; options[0] = "-t"; options[1] = "/some/where/somefile.arff"; System.out.println(Evaluation.evaluateModel(new J48(), options));

ROC curves/AUC Since Weka 3.5.1, you can also generate ROC curves/AUC with the predictions Weka recorded during testing. You can access these predictions via the predictions() method of the Evaluation class. See the Generating ROC curve article for a full example of how to generate ROC curves.

Classifying instances In case you have an unlabeled dataset that you want to classify with your newly trained

classifier, you can use the following code snippet. It loads the file /some/where/unlabeled.arff, uses the previously built classifier tree to label the instances, and saves the labeled data as /some/where/labeled.arff.

import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.FileReader; import java.io.FileWriter; import weka.core.Instances; ... // load unlabeled data Instances unlabeled = new Instances( new BufferedReader( new FileReader("/some/where/unlabeled.arff"))); // set class attribute unlabeled.setClassIndex(unlabeled.numAttributes() - 1); // create copy Instances labeled = new Instances(unlabeled); // label instances for (int i = 0; i < unlabeled.numInstances(); i++) { double clsLabel = tree.classifyInstance(unlabeled.instance(i)); labeled.instance(i).setClassValue(clsLabel); } // save labeled data BufferedWriter writer = new BufferedWriter( new FileWriter("/some/where/labeled.arff")); writer.write(labeled.toString());


writer.newLine(); writer.flush(); writer.close(); Note on nominal classes:

If you're interested in the distribution over all the classes, use the method distributionForInstance(Instance). This method returns a double array with the probability for each class.

The returned double value from classifyInstance (or the index in the array returned by distributionForInstance) is just the index for the string values in the attribute. That is, if you want the string representation for the class label returned above clsLabel, then you can print it like this:

System.out.println(clsLabel + " -> " + unlabeled.classAttribute().value((int) clsLabel));

Clustering Clustering is similar to classification. The necessary classes can be found in this package:

weka.clusterers

Building a Clusterer Batch A clusterer is built in much the same way as a classifier, but the buildClusterer(Instances) method instead of buildClassifier(Instances). The following code snippet shows how to build an EM clusterer with a maximum of 100 iterations.

import weka.clusterers.EM; ... String[] options = new String[2]; options[0] = "-I"; // max. iterations options[1] = "100"; EM clusterer = new EM(); // new instance of clusterer clusterer.setOptions(options); // set the options clusterer.buildClusterer(data); // build the clusterer

Incremental Clusterers implementing the weka.clusterers.UpdateableClusterer interface can be

trained incrementally (available since version 3.5.4). This conserves memory, since the data doesn't have to be loaded into memory all at once. See the Javadoc for this interface to see which clusterers implement it.

The actual process of training an incremental clusterer is fairly simple: Call buildClusterer(Instances) with the structure of the dataset (may or may

not contain any actual data rows). Subsequently call the updateClusterer(Instance) method to feed the clusterer

new weka.core.Instance objects, one by one. Call updateFinished() after all Instance objects have been processed, for the

clusterer to perform additional computations.


Here is an example using data from a weka.core.converters.ArffLoader to train weka.clusterers.Cobweb:

// load data ArffLoader loader = new ArffLoader(); loader.setFile(new File("/some/where/data.arff")); Instances structure = loader.getStructure(); // train Cobweb Cobweb cw = new Cobweb(); cw.buildClusterer(structure); Instance current; while ((current = loader.getNextInstance(structure)) != null) cw.updateClusterer(current); cw.updateFinished();

A working example is

IncrementalClusterer.java


Evaluating For evaluating a clusterer, you can use the ClusterEvaluation class. In this example, the number of clusters found is written to output:

import weka.clusterers.ClusterEvaluation; import weka.clusterers.Clusterer; ... ClusterEvaluation eval = new ClusterEvaluation(); Clusterer clusterer = new EM(); // new clusterer instance, default options clusterer.buildClusterer(data); // build clusterer eval.setClusterer(clusterer); // the cluster to evaluate eval.evaluateClusterer(newData); // data to evaluate the clusterer on System.out.println("# of clusters: " + eval.getNumClusters()); // output # of clusters

Or, in the case of density based clusters, you can cross‐validate the clusterer (Note: with MakeDensityBasedClusterer you can turn any clusterer into a density‐based one):

import weka.clusterers.ClusterEvaluation; import weka.clusterers.DensityBasedClusterer; import weka.core.Instances; import java.util.Random; ... Instances data = ... // from somewhere DensityBasedClusterer clusterer = new ... // the clusterer to evaluate double logLikelyhood = ClusterEvaluation.crossValidateModel( // cross-validate


clusterer, data, 10, // with 10 folds new Random(1)); // and random number generator with seed 1

Or, if you want the same behavior/print‐out from command line, use this call:

import weka.clusterers.EM; import weka.clusterers.ClusterEvaluation; ... String[] options = new String[2]; options[0] = "-t"; options[1] = "/some/where/somefile.arff"; System.out.println(ClusterEvaluation.evaluateClusterer(new EM(), options));

Clustering instances The only difference with regard to classification is the method name. Instead of

classifyInstance(Instance), it is now clusterInstance(Instance). The method for obtaining the distribution is still the same, i.e., distributionForInstance(Instance).

Classes to clusters evaluation If your data contains a class attribute and you want to check how well the generated

clusters fit the classes, you can perform a so‐called classes to clusters evaluation. The Weka Explorer offers this functionality, and it's quite easy to implement. These are the necessary steps (complete source code:

ClassesToClusters.java Details Download 2 KB

load the data and set the class attribute

Instances data = new Instances(new BufferedReader(new FileReader("/some/where/file.arff"))); data.setClassIndex(data.numAttributes() - 1);

generate the class‐less data to train the clusterer with

weka.filters.unsupervised.attribute.Remove filter = new weka.filters.unsupervised.attribute.Remove(); filter.setAttributeIndices("" + (data.classIndex() + 1)); filter.setInputFormat(data); Instances dataClusterer = Filter.useFilter(data, filter);

train the clusterer, e.g., EM

EM clusterer = new EM(); // set further options for EM, if necessary... clusterer.buildClusterer(dataClusterer);


evaluate the clusterer with the data still containing the class attribute

ClusterEvaluation eval = new ClusterEvaluation(); eval.setClusterer(clusterer); eval.evaluateClusterer(data);

print the results of the evaluation to stdout

System.out.println(eval.clusterResultsToString());

Attribute selection There is no real need to use the attribute selection classes directly in your own code,

since there are already a meta‐classifier and a filter available for applying attribute selection, but the low‐level approach is still listed for the sake of completeness. The following examples all use CfsSubsetEval and GreedyStepwise (backwards). The code listed below is taken from the

AttributeSelectionTest.java


Meta‐Classifier The following meta‐classifier performs a preprocessing step of attribute selection before the data gets presented to the base classifier (in the example here, this is J48).

Instances data = ... // from somewhere AttributeSelectedClassifier classifier = new AttributeSelectedClassifier(); CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); J48 base = new J48(); classifier.setClassifier(base); classifier.setEvaluator(eval); classifier.setSearch(search); // 10-fold cross-validation Evaluation evaluation = new Evaluation(data); evaluation.crossValidateModel(classifier, data, 10, new Random(1)); System.out.println(evaluation.toSummaryString());

Filter The filter approach is straightforward: after setting up the filter, one just filters the data through the filter and obtains the reduced dataset.

Instances data = ... // from somewhere AttributeSelection filter = new AttributeSelection(); // package weka.filters.supervised.attribute! CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true);


filter.setEvaluator(eval); filter.setSearch(search); filter.setInputFormat(data); // generate new data Instances newData = Filter.useFilter(data, filter); System.out.println(newData);

Low‐level If neither the meta‐classifier nor filter approach is suitable for your purposes, you can use the attribute selection classes themselves.

Instances data = ... // from somewhere AttributeSelection attsel = new AttributeSelection(); // package weka.attributeSelection! CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); attsel.setEvaluator(eval); attsel.setSearch(search); attsel.SelectAttributes(data); // obtain the attribute indices that were selected int[] indices = attsel.selectedAttributes(); System.out.println(Utils.arrayToString(indices));

Note on randomization Most machine learning schemes, like classifiers and clusterers, are susceptible to the ordering of the data. Using a different seed for randomizing the data will most likely produce a different result. For example, the Explorer, or a classifier/clusterer run from the command line, uses only a seeded java.util.Random number generator, whereas the weka.core.Instances.getgetRandomNumberGenerator(int) (which the

WekaDemo.java Details Download 6 KB

uses) also takes the data into account for seeding. Unless one runs 10‐fold cross‐validation 10 times and averages the results, one will most likely get different results.

See also Examples Links


UseWekainyourJavacode2There are three files :

1. The first one is just an example of ARFF files. 2. The second, based on Use Weka in your Java Code, uses only the seen methods on

this page. 3. The last files contains some new methods to build and filter the Instances.

1mn.arff Details Download 1 MB

Weka_Use.java Details Download 7 KB

Weka_ManageInstances.java Details Download 17 KB

Weka_Use.java Based on Use Weka in your Java Code uses five methods.

1. test, is to try the four other methods. 2. buildInstancesP/N, build an Instances from an other and selected the rows from Percent

or Number. 3. learning, to create the classifier. 4. evaluation, used to evaluate the test Instances given with the classifier previously built.

Weka_ManageInstances.java Contains some methods to build and filter the given Instances.

1. To convert a String into an ARFF files.

instancesFromString //Convert a String which represents an ARFF file into an Instances.// //__@param__ arff// //String which represents an ARFF file.// //__@return__ The Instances from the ARFF String.// //__@throws__ IOException// public static Instances instancesFromString (String arff) throws IOException { StringReader reader = new StringReader(arff); Instances insts = new Instances (reader); if (insts.classIndex() == -1) insts.setClassIndex(insts.numAttributes() - 1); return insts; }

2. Columns Selection (Attributes Selection).

attributSelection

//Select some attributes from a given Instances.// //__@param__ data// //An Instances of the data.// //__@param__ option// //String which represents the attributes to remove.// //"1-4" | "28" | "1-70,45,68-72" | "" | ...// //__@return__ The new Instances of data without undesired attributes.// //__@throws__ Exception// public static Instances attributSelection (Instances data, String option) throws Exception { String[] options = new String[2]; options[0] = "-R"; options[1] = option; Remove remove = new Remove(); remove.setOptions(options); remove.setInputFormat(data); Instances newData = Filter.useFilter(data, remove); if (newData.classIndex() == -1) newData.setClassIndex(newData.numAttributes() - 1); return newData; }

3. Rows Selection (Instance Selection). 3.1 Based on percent or number.

percentSelection

//Used to choose some lines of data by indicating between what// //percents select the rows.// //__@param__ data// //An Instances of the data.// //__@param__ start// //Percent indicating the first line of the selection.// //__@param__ end// //Percent indicating the last line of the selection.// //__@return__ The new Instances of data with only the desired rows.// public static Instances percentSelection (Instances data, double start, double end) { if(end<start){ double temp = start; start = end; end = temp; } int to_start = (int) Math.round(data.numInstances() * start); int to_end = Math.max ( (int) Math.round(data.numInstances() * end) - to_start, 1);

Instances newData = new Instances(data, to_start, to_end); if (newData.classIndex() == -1) newData.setClassIndex(newData.numAttributes() - 1); return (newData); }

rowNumberSelection //Select some lines of data by indicating between what// //line numbers choose the rows.// //__@param__ data// //An Instances of the data.// //__@param__ start// //Line number indicating the first line of the selection.// //__@param__ end// //Line number indicating the last line of the selection.// //__@return__ The new Instances of data with only the desired rows.// public static Instances rowNumberSelection (Instances data, int start, int end) { if(end<start){ int temp = start; start = end; end = temp; } Instances newData = new Instances(data, start, end-start); if (newData.classIndex() == -1) newData.setClassIndex(newData.numAttributes() - 1); return (newData); }

3.2 Based on filter.

operatorSelection

//Pick up lines whose the attribute number attribute_index is// //[ '>', '<', '=' ] than the value.// //Ex: data = operatorSelection(data, 4, '>', -0.3);// //Every lines whose their values in the column number 4 are greater than -0.3.// //__@param__ data// //An Instances of the data.// //__@param__ attribute_index// //The index of attribute column to compare.// //__@param__ operator// //Used to choose how to compare.// //'>' | '<' | '='// //__@param__ value// //The value used to compare.// //__@return__ The new Instances of data with only the desired rows.// //__@throws__ Exception// public static Instances operatorSelection (Instances data, int attribute_index, char operator, double value) throws Exception { RemoveWithValues filter = new RemoveWithValues(); if (attribute_index> data.numAttributes())

attribute_index = data.numAttributes(); int current = 0; double epsilon = 0.001; String[] options = new String [4]; switch (operator){ case '>': options = new String[4]; value += epsilon; break; case '<': options = new String[5]; options[current++]= "-V"; break; case '=': //>=// //options = new String[4];// //options[current++] = "-C";// //options[current++] = String.valueOf(attribute_index);// //options[current++] = "-S";// //options[current++] = String.valueOf(value);// //filter.setOptions(options);// //filter.setInputFormat(data);// //data = Filter.useFilter(data, filter);// <= current =0; options = new String[5]; options[current++]= "-V"; value += epsilon; break; default: System.out.println("ERROR: Weka_ManageInstance, operatorSelection, unknow operator."); return data; } options[current++] = "-C"; options[current++] = String.valueOf(attribute_index); options[current++] = "-S"; options[current++] = String.valueOf(value); filter.setOptions(options); filter.setInputFormat(data); Instances newData = Filter.useFilter(data, filter); if (newData.classIndex() == -1) newData.setClassIndex(newData.numAttributes() - 1); return newData; }

3.3 To avoid the redundancy.

differentNextSelection

//Delete every lines followed by a row with the same values.// //Use equalsInstance().// //__@param__ data// //An Instances of the data.// //__@return__ The new Instances of data with only the desired rows.// public static Instances differentNextSelection (Instances data) { Instances newData = data; for(int index = newData.numInstances()-1; index>0; index--) { Instance inst = newData.instance(index); Instance next = newData.instance(index-1); if (equalsInstance(inst, next)) newData.delete(index); } return (newData); }

4. To add lines.

concatInstances

//Return a concatenation of the given Instances.// //__@param__ inst1// //First Instances (head).// //__@param__ inst2// //Second Instances to add (tail).// //__@return__ ( inst1 ^ inst2 )// public static Instances concatInstances (Instances inst1, Instances inst2) { ArrayList<Instance> instAL = new ArrayList<Instance>(); for (int i=0; i<inst2.numInstances(); i++) instAL.add(inst2.instance(i)); for (int i=0; i<instAL.size(); i++) inst1.add(instAL.get(i)); return (inst1); }

addDifferentWithPrevious //Add the Instance at the end of the Instances if the last is different.// //__@param__ data// //An Instances of the data.// //__@return__ The new Instances of data with only the desired rows.// public static Instances addDifferentWithPrevious (Instances data, Instance inst) { Instances newData = data;

if (!equalsInstance(data.instance(data.numInstances()-1), inst)) newData.add(inst); return (newData); }

addDifferentWithAll

//Add the Instance at the end of the Instances if all instance are different.// //__@param__ data// //An Instances of the data.// //__@return__ The new Instances of data with only the desired rows.// public static Instances addDifferentWithAll (Instances data, Instance inst) { Instances newData = data; for(int i=0; i<newData.numInstances(); i++) { if (equalsInstance(data.instance(i), inst)) return newData; } newData.add(inst); return (newData); }

addDifferentWithAll_dontCareOfLastAtt //Used to add a new Instance in the learningInstances and replace// //an older one which got the same value for a different prediction/last attributes.// //__@param__ data// //The learningInstance.// //__@param__ inst// //The new instance to replace an older prediction.// //__@param__ addEvenIfNoSimilar// //Add inst at data even if there is no similar instance (not only replace).// //__@return__ The new learningInstances.// public static Instances addDifferentWithAll_dontCareOfLastAtt (Instances data, Instance inst, boolean addEvenIfNoSimilar) { Instances newData = data; int i = indexOfSame_dontCareOfLastAtt (newData, inst); if(i!=-1) { newData.delete(i); newData.add(inst); } else if(addEvenIfNoSimilar) newData.add(inst); return newData; }

5. To delete.

deleteInstance

//Delete every Instance inst of Data.// //__@param__ data// //__@param__ inst// __//@return//__ public static Instances deleteInstance(Instances data, Instance inst) { Instances newData = data; int i=0; while(i!=-1 && data.numInstances()>2) { i = indexOfInstance (newData, inst); if(i!=-1) newData.delete(i); } return newData; }

deleteClosestInstance //Return an Instances without the closest Instance of inst.// //__@param__ data// //__@param__ inst// //__@param__ valueMinMax// //ArrayList of the value min and max taken by each Attribute of data.// //__@return__ The new data without the closest Instance of inst.// public static Instances deleteClosestInstance (Instances data, Instance inst, ArrayList<Double> valueMinMax) { Instances newData = data; Instance instToDel = Weka_ManageInstances.getClosestInstance (newData, inst, valueMinMax); newData = Weka_ManageInstances.deleteInstance(newData, instToDel); return newData; }

deleteClosestInstance

//Return an Instances without the numberToDel closest Instance of inst.// //__@param__ data// //__@param__ inst// //__@param__ valueMinMax// //ArrayList of the value min and max taken by each Attribute of data.// //__@param__ numberToDel// //The number of Instance of data close to inst to delete.// //__@return__ The new data without the numberToDel closest Instance of inst.// public static Instances deleteClosestInstance (Instances data, Instance inst, ArrayList<Double> valueMinMax, int numberToDel) { Instances newData = data; for(int i=0; i<numberToDel && newData.numInstances()>numberToDel+5 ; i++) {


Instance instToDel = Weka_ManageInstances.getClosestInstance (newData, inst, valueMinMax); newData = Weka_ManageInstances.deleteInstance(newData, instToDel); } return newData; }

Programmatic Use Table of Contents Introduction Step 1: Express the problem with features Step 2: Train a Classifier Step 3: Test the classifier Step 4: use the classifier Conclusion and More Information Links

Introduction

This tutorial shows how to use Weka (build feature vector, train a classifier, test a

classifier, use a classifier) directly from Java code. It is not intended to replace the

Explorer/Experimenter GUI that offer the visualization and engineering tools required to set

up and debug machine learning experiments. Weka’s automation is useful to embed a

classifier in a larger program and to create a training/testing loop that can be seen as a

regression test for machine learning capabilities.

Step 1: Express the problem with features This step corresponds to the engineering task needed to write an .arff file. Let’s put all our features in a weka.core.FastVector. Each feature is contained in a weka.core.Attribute object.

Here, we have two numeric features, one nominal feature (blue, gray, black) and a

nominal class (positive, negative).

// Declare two numeric attributes Attribute Attribute1 = new Attribute(“firstNumeric”); Attribute Attribute2 = new Attribute(“secondNumeric”); // Declare a nominal attribute along with its values FastVector fvNominalVal = new FastVector(3); fvNominalVal.addElement(“blue”); fvNominalVal.addElement(“gray”); fvNominalVal.addElement(“black”); Attribute Attribute3 = new Attribute(“aNominal”, fvNominalVal); // Declare the class attribute along with its values FastVector fvClassVal = new FastVector(2); fvClassVal.addElement(“positive”); fvClassVal.addElement(“negative”); Attribute ClassAttribute = new Attribute(“theClass”, fvClassVal); // Declare the feature vector FastVector fvWekaAttributes = new FastVector(4);


fvWekaAttributes.addElement(Attribute1); fvWekaAttributes.addElement(Attribute2); fvWekaAttributes.addElement(Attribute3); fvWekaAttributes.addElement(ClassAttribute);

Step 2: Train a Classifier

Training requires 1) having a training set of instances and 2) choosing a classifier. Let’s first create an empty training set (weka.core.Instances). We named the relation “Rel”. The attribute prototype is declared using the vector from step 1. We give an initial set capacity of 10. We also declare that the class attribute is the fourth one in the vector (see step 1) // Create an empty training set Instances isTrainingSet = new Instances("Rel", fvWekaAttributes, 10); // Set class index isTrainingSet.setClassIndex(3);

Now, let’s fill the training set with one instance (weka.core.Instance):

// Create the instance Instance iExample = new DenseInstance(4); iExample.setValue((Attribute)fvWekaAttributes.elementAt(0), 1.0); iExample.setValue((Attribute)fvWekaAttributes.elementAt(1), 0.5); iExample.setValue((Attribute)fvWekaAttributes.elementAt(2), "gray"); iExample.setValue((Attribute)fvWekaAttributes.elementAt(3), "positive"); // add the instance isTrainingSet.add(iExample);

Finally, Choose a classifier (weka.classifiers.Classifier) and create the model. Let’s, for example, create a naive Bayes classifier (weka.classifiers.bayes.NaiveBayes)

// Create a naïve bayes classifier Classifier cModel = (Classifier)new NaiveBayes(); cModel.buildClassifier(isTrainingSet);

Step 3: Test the classifier

Now that we create and trained a classifier, let’s test it. To do so, we need an evaluation module (weka.classifiers.Evaluation) to which we feed a testing set (see section 2, since the testing set is built like the training set).

// Test the model Evaluation eTest = new Evaluation(isTrainingSet); eTest.evaluateModel(cModel, isTestingSet);

The evaluation module can output a bunch of statistics:

// Print the result à la Weka explorer: String strSummary = eTest.toSummaryString(); System.out.println(strSummary);


// Get the confusion matrix double[][] cmMatrix = eTest.confusionMatrix();

Step 4: use the classifier

For real world applications, the actual use of the classifier is the ultimate goal. Here’s the simplest way to achieve that. Let’s say we’ve built an instance (named iUse) as explained in step 2:

// Specify that the instance belong to the training set // in order to inherit from the set description iUse.setDataset(isTrainingSet); // Get the likelihood of each classes // fDistribution[0] is the probability of being “positive” // fDistribution[1] is the probability of being “negative” double[] fDistribution = cModel.distributionForInstance(iUse);

Conclusion and More Information

This tutorial shows the basic way to train, test and use a classifier programmatically in Weka. The code shown was not compiled nor tested since it requires being part of a real classification problem. For complete and compilable examples, please check Balie, an open source NLP software that uses Weka for language identification and sentence boundary recognition tasks.

FAKULTAS TEKNIK UNIVERSITAS BENGKULUdiyahpuspitaningrum.net/2016a_Java_and_WEKA_Programming.pdf ·...

Documents

Transcript of FAKULTAS TEKNIK UNIVERSITAS BENGKULUdiyahpuspitaningrum.net/2016a_Java_and_WEKA_Programming.pdf ·...