June 04, 2004

Dirty Data Done Dirt Cheap

I have to confess to feeling a bit stupid. I have been struggling with MATLAB for weeks now, trying to get it to read in my data files so I can automate my analyses. My data is in a tab-delimited file and looks something like:

	Doc1	Doc2	Doc3	Doc4
Doc1	100	76	18	91
Doc2	76	100	22	35
Doc3	18	22	100	65
Doc4	91	34	65	100

This is not too dissimilar from the labelled diagram, part of the MATLAB documentation on data importing. Except that, if you look at the table below it, which describes which functions to use, they don't have a function with a similar example to their labelled diagram. Early on I thought I should be able to use dlmread, which allows you specify rows/columns for starting points or a range. My idea was just to have a range which excluded the non-numeric troublesome labels. No matter what I did, though, I could not get it to work. It was frustrating, because I could paste the data into the Import Wizard and that could handle the data fine. I wrote people, I researched on the web, and I tried all sorts of things.

Eventually, I came full-circle back to dlmread and experimented by making a small data file with unrelated data in it. That worked fine. So I then copied half of one of my data tables into the test file and tried that. That also worked fine. I copied the whole data table into the test file and used dlmread on it. It worked fine! What was the difference between the two identical data files other than their filenames? When I uncovered the answer to that, I kicked myself. My data files were generated years ago and stored on my Mac OS 9-based laptop. My laptop and the data have since migrated to Apple's swoopy BSD-based UNIX goodness and that's the environment that MATLAB runs under. So... Have you guessed the problem? Yes, it was linefeeds! The data files had original Mac linefeeds and MATLAB wanted UNIX linefeeds. D'oh! It just goes to reaffirm that the things you don't see can really hurt you.

Once that was solved, work proceded rapidly apace as I was now able to finish automating the whole comparison process from start to finish.
function  [Anal1Raw, Anal2Raw, Anal1MDS, Anal2MDS, fit] = 
      processEinCiteData(firstFile, secondFile, runName, labels)


% Read in the similarity matrices from the two data files
Anal1Raw = dlmread(firstFile, '\t', 1, 1);
Anal2Raw = dlmread(secondFile, '\t', 1, 1);

% Set up default document name labels if we didn't get any
if nargin < 4
	labels = {'g4c', 'pp1', 'pp2', 'msc', 'pl1', 'pl2', 'pl3', 'sp1', 'sp2', 'ac1', 'ac2', 'bws'};
    if nargin < 3
    runName = '';
    end
end

% Set up labels for the filenames
fileName1 = regexprep(firstFile, '\..*$', '');
fileName2 = regexprep(secondFile, '\..*$', '');

% Convert the similarity data to numbers below 1 for use in MDS
Anal1Raw = abs(100 - Anal1Raw) 
Anal2Raw = abs(100 - Anal2Raw) 

% Calculate the MDS and prepare a diagram showing the
% clusterings for the first document
[Anal1MDS, eigvals] = cmdscale(Anal1Raw);
figure(1);
plot(1:length(eigvals),eigvals,'bo-');
graph2d.constantline(0,'LineStyle',':','Color',[.7 .7 .7]);
axis([1,length(eigvals),min(eigvals),max(eigvals)*1.1]);
xlabel('Eigenvalue number');
ylabel('Eigenvalue');
plot(Anal1MDS(:,1),Anal1MDS(:,2),'bo', 'MarkerFaceColor', 'b', 'MarkerSize', 10);
axis(max(max(abs(Anal1MDS))) * [-1.1,1.1,-1.1,1.1]); axis('square');
text(Anal1MDS(:,1)+1.5,Anal1MDS(:,2),labels,'HorizontalAlignment','left');
hx = graph2d.constantline(0,'LineStyle','-','Color',[.7 .7 .7]);
hx = changedependvar(hx,'x');
hy = graph2d.constantline(0,'LineStyle','-','Color',[.7 .7 .7]);
hy = changedependvar(hy,'y');
title(['\fontname{lucida}\fontsize{18}' fileName1 ' MDS']);
xlabel(['\fontname{lucida}\fontsize{14}' runName ' on ' date], 'FontWeight', 'bold');

% Calculate the MDS and prepare a diagram showing the
% clusterings for the second document
[Anal2MDS, eigvals] = cmdscale(Anal2Raw);
figure(2);
plot(1:length(eigvals),eigvals,'rd-');
graph2d.constantline(0,'LineStyle',':','Color',[.7 .7 .7]);
axis([1,length(eigvals),min(eigvals),max(eigvals)*1.1]);
xlabel('Eigenvalue number');
ylabel('Eigenvalue');
plot(Anal2MDS(:,1),Anal2MDS(:,2),'rd', 'MarkerFaceColor', 'r', 'MarkerSize', 10);
axis(max(max(abs(Anal2MDS))) * [-1.1,1.1,-1.1,1.1]); axis('square');
text(Anal2MDS(:,1)+1.5,Anal2MDS(:,2),labels,'HorizontalAlignment','left');
hx = graph2d.constantline(0,'LineStyle','-','Color',[.7 .7 .7]);
hx = changedependvar(hx,'x');
hy = graph2d.constantline(0,'LineStyle','-','Color',[.7 .7 .7]);
hy = changedependvar(hy,'y');
title(['\fontname{lucida}\fontsize{18}' fileName2 ' MDS']);
xlabel(['\fontname{lucida}\fontsize{14}' runName ' on ' date], 'FontWeight', 'bold');

% Apply Procrustes to the two MDS results to map them 
% into the same vector space and prepare a plot of the 
% result
[fit, Z, transform] = procrustes(Anal1MDS, Anal2MDS);
figure(3);
plot(Anal1MDS(:,1), Anal1MDS(:,2), 'bo','MarkerFaceColor', 'b', 'MarkerSize', 10);
hold on
plot(Z(:,1), Z(:,2), 'rd', 'MarkerFaceColor', 'r', 'MarkerSize', 10);
hold off
text(Anal1MDS(:,1)+1.5,Anal1MDS(:,2), labels, 'Color', 'b');
text(Z(:,1)+1.5,Z(:,2),labels, 'Color', 'r');
xlabel(['\fontname{lucida}\fontsize{14}' runName ' on ' date], 'FontWeight', 'bold');
ylabel(['\fontname{lucida}\fontsize{14}' 'fit = ' num2str(fit, '%2.4f')], 'FontWeight', 'bold');
titleStr = ['\fontname{lucida}\fontsize{18}' fileName1 ...
        ' compared to ' fileName2];
title(titleStr, 'HorizontalAlignment', 'center', ...
    'VerticalAlignment', 'bottom');
legend({firstFile, secondFile}, 4);

At the end, I had a quantitative number, the degree of fit, between two diagrams after applying the Procrustes Rotation to them. Finally! On a whim, I fed in the same data table as both arguments to my comparison program. That is, I compared the same data file to itself. My hypothesis was that the resultant degree of fit should be either 0 or 1 (depending on which the fitness was measured). Much to my surprise, no matter which data file I used, the result was never 0 or 1. My previous Procrustes Analysis code was taken from some sample code in the MATLAB documentation and looked like: [D,Z] = procrustes(Anal1aMDS, Anal2aMDS(:,1:2)); That last bit in () is some kind of MATLAB scaling, which, being a novice to MATLAB, I didn't realize. So, in fact, my two diagrams weren't the same which is why I wasn't getting a 100% degree of fit. I do not want to say how long it took me to narrow that down. Once I did, though, it looked like I was basically set and I was able to quickly produce some comparisons between my "weird" half-baked metric and the cosine normalization one. One small step for EinKind.

This is a delayed entry from May 12th, 2004.

Posted by Michelle at June 4, 2004 04:44 PM | TrackBack
Comments
Post a comment




Remember me?

(You may use HTML tags for style)