Search All of the Math Forum:

Views expressed in these public forums are not endorsed by NCTM or The Math Forum.

Notice: We are no longer accepting new posts, but the forums will continue to be readable.

Topic: How to make the computation time of a fuzzy string comparison algo much FASTER?
Replies: 0

 Rossella Posts: 17 Registered: 11/28/11
How to make the computation time of a fuzzy string comparison algo much FASTER?
Posted: Nov 4, 2013 10:37 AM

Hi,

I need to make the following code much faster. It basically computes a string similarity score. The problem is that, given a string as input, I need to compute the similarity score between the input string and a list of about 40000 strings, which takes about 40 seconds.
I need it instead to take max a few seconds.

The similarity score is computed by combining two different measures: one is the Damerau?Levenshtein distance (which takes approximately 1/3 of the computation time) and the Dice Coefficient (which takes about 2/3 of the computation time).

Any idea?
Thanks in advance for any help.

Rossella

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% MAIN FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% MySTRING = string that the user wants to find the possible matches for
% TargetList = cellarray with 40000 rows with a list of strings
[DamLev, DiceCoeff] = ComputeStringDistance(MySTRING,TargetList);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% ComputeStringDistance FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function [NormalizedDistance, DiceCoefficient] = ComputeStringDistance(MySTRING, TargetList)
% ComputeStringDistance calculates the Damerau-Levenshtein distance and the DICE COEFFICIENT

Nk = numel(key);
NormalizedDistance = cellfun(@(x) ComputeLevDamDistance(x, key, Nk), TargetList);

words1 = GetSingleWords(key);
% for each word, find all bigrams
bigrams1 = cellfun(@GetBigrams, words1, 'UniformOutput',0);
% put them together on one string
big_key = [];
for b = 1:numel(bigrams1)
big_key = [big_key, bigrams1{b}];
end
N_big_key = numel(big_key);

DiceCoefficient = cellfun(@(x) FastDiceCoefficient(x, big_key, N_big_key),TargetList);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% ComputeLevDamDistance FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function NormalizedDistance = ComputeLevDamDistance(Target, key, Nk)

Nt = numel(Target);

% compute demerau lev. disdtance
d = zeros(Nt+1,Nk+1);
% initialize distance matrix
d(1,:) = 0:Nk;
d(:,1) = 0:Nt;
for i=2:Nt+1
for j=2:Nk+1
if key(j-1) == Target(i-1)
cost = 0;
else
cost = 1;
end

d(i,j)=min([ ...
d(i, j-1) + 1, ... % deletion
d(i-1, j) + 1, ... % insertion
d(i-1,j-1) + cost ... % substitution
]);

if (i-1)>1 && (j-1)>1 && i-1 <= Nk && j-1 <= Nt && key((i-1))==Target((j-1)-1) && key((i-1)-1) == Target(j-1)
% transposition
d(i,j) = min([d(i,j), d(i-1,j-1)]);
end
end
end

% normalize the distance
NormalizedDistance = 1 - (d(Nt+1,Nk+1)/max(Nt, Nk));

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% GetSingleWords FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function words = GetSingleWords(s)

space_s = [0 regexp(s,'\s') length(s)+1];
words = cell(1,length(space_s) - 1);
for w = 1:length(space_s)-1
words{w} = s(space_s(w)+1:space_s(w+1)-1);
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% GetBigrams FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function bigram = GetBigrams(s)

bigram{1} = s(1:end);
for i = 1:length(s)-1
bigram{i} = s(i:i+1);
end

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% FastDiceCoefficient FUNCTION
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

function DiceCoefficient = FastDiceCoefficient( Target, big_key, N_big_key)

words2 = GetSingleWords(Target);

% for each word, find all bigrams
bigrams2 = cellfun(@GetBigrams, words2, 'UniformOutput',0);

% put them together on one cell
big2 = [];
for b = 1:numel(bigrams2)
big2 = [big2, bigrams2{b}];
end

DiceCoefficient = sum([ismember(big2,big_key), ismember(big_key,big2)])/(N_big_key + numel(big2));