Classification with shuffling - Printable Version

Classification with shuffling - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: Data Science (https://python-forum.io/forum-44.html)
+--- Thread: Classification with shuffling (/thread-6231.html)

Classification with shuffling - PythonNewbie - Nov-11-2017

Hello all,

This is my first post here, and I hope to find some help.

I am trying to reproduce the results of an example (although the example isn't provided in full, so I had to write some parts myself with my limited knowledge in Python), where the file "seeds.tsv" is read by a function and returns data and labels as follows (the function is defined in a separate file called "load.py):

import numpy as np


def load_dataset(dataset_name):
    '''
    data,labels = load_dataset(dataset_name)

    Load a given dataset

    Returns
    -------
    data : numpy ndarray
    labels : list of str
    '''
    data = []
    labels = []
    with open('{0}.tsv'.format(dataset_name)) as ifile:
        for line in ifile:
            tokens = line.strip().split('\t')
            data.append([float(tk) for tk in tokens[:-1]])
            labels.append(tokens[-1])
    data = np.array(data)
    labels = np.array(labels)
    return data, labels

After reading the file, I used the x-fold cross validation for the nearest neighbor algorithm as follows

from load import load_dataset
import numpy as np
import random

feature_neames = ['area',
                  'perimeter',
                  'compactness',
                  'length of kernel',
                  'width of kernel',
                  'asymmetry coefficient',
                  'length of kernel groove']

data, lables = load_dataset('seeds')
"""
rndInx = random.sample(range(len(lables)), len(lables))

data = data[rndInx]
lables = lables[rndInx]

print(lables)
"""
#print(lables.shape)

#This function returns the distance between two points in N-dimensional space
def distance(f1, f2):
    return np.sum((f1 - f2)**2)


#10-fold cross validation

fold = 10 #number of folds and blocks in each fold
elem = int(len(lables)/fold)#number of elements in each block

error = 0.0
for fi in range(fold):
    nearestLable = []
    training = np.ones(len(lables), bool)
    training[fi*fold: fi*fold + elem] = False
    testing = ~ training
    data_tr = data[training]
    data_ts = data[testing]
    labels_tr = lables[training]
    labels_ts = lables[testing]
    for x_ts in data_ts:
        dists = np.array([distance(x_ts, y_tr) for y_tr in data_tr])
    nearest = dists.argmin()
    nearestLable.append(labels_tr[nearest])
    error += np.sum(nearestLable != labels_ts)

print("\n\nThe accuracy of the nearest neighbor"
      " \nclassifier using %i-fold cross "
      "\nvalidation is: %1.2f" %(fold, (1-(error/len(lables)))))

When I ran the above codes without randomizing the data for 10-fold cross validation, I get an accuracy of ~0.86 (it should be 0.88 as reported in the original example!!!), but when I randomize the data by using the random indices rndInx (lines 15-18 in the second code segment), I get an accuracy of 0.38!!!. I am not quite sure why? The original data is ordered in the sense that examples of the same class are placed contagiously. But when I used 70-fold cross validation I get an accuracy of 0.98!!

Am I doing something wrong?

Thanks in advance

RE: Classification with shuffling - PythonNewbie - Nov-12-2017

Anyone could comment on this, please? I realize it needs some machine learning background.