Advice for large dataset classification

Status
Not open for further replies.

ssulun

Newbie level 5
Joined
Jul 31, 2013
Messages
9
Helped
0
Reputation
0
Reaction score
0
Trophy points
1
Visit site
Activity points
121
Hi everyone,

I am an undergrad student taking a grad level course, Pattern Recognition and I need some advice on my project. I am given a database of leaves, total of 3032 samples, belonging to 47 different classes with 2003 features for each sample. The features consist of the following: rectangularity, aspect ratio, mean hue, eccentricity, convexity and 999 features for the FFT magnitudes along x-axis and 999 features for the FFT magnitudes along y-axis. I am using Matlab.

First of all, I know that I need dimensionality reduction. I applied Fisher's Linear Discriminant Analysis to maximize the between cluster variance and minimize within cluster variance. I have received a warning saying that the matrix can be ill-conditioned. So I have decided to apply PCA first and then try LDA again.

Some experienced PhD students said that the dataset can be represented with at most 10 dimensions but I have tried different numbers of principal components and ran Matlab's built-in linear classifier. Here are the results:



It appears that 400 PC gives the best result, 100 is also acceptable but 10 is certainly bad. Do I really need 100 features or should I do something extra?

Later I ran the LDA, downloaded from the page https://www.mathworks.com/matlabcentral/fileexchange/29673-lda--linear-discriminant-analysis (explanation is in https://matlabdatamining.blogspot.com.tr/2010/12/linear-discriminant-analysis-lda.html). I am not sure whether this code performs dimensionality reduction though. It helps me to acquire linear scores and when I do classification based on them, I get the exact same result as Matlab's built-in linear classifier. Does it mean Matlab's linear classifier perform some optimal transformation as well? Would it be better if I can somehow reduce the dimensionality while transforming the data, and how can I do it?

Finally I wanted to try Support Vector Machines since I have heard that "most of the time" it performs better than any other classifier. I have used a function named multisvm which simply loops through all the classes, takes the current class as 1 and others 0 and uses Matlab's built in svmtrain function (https://www.mathworks.com/matlabcentral/fileexchange/39352-multi-class-svm/content/multisvm.m).

The results were much worse than I expected. I also tried RBF for kernel and different sigmas but the error rates were still much more than the 10% that I obtained with linear classifier.



Again, do you think is SVM way worse in this particular case or am I doing something wrong?

I am open to new ideas as well.

Thanks in advance.
Serkan
 

Status
Not open for further replies.
Cookies are required to use this site. You must accept them to continue using the site. Learn more…