Search All of the Math Forum:
Views expressed in these public forums are not endorsed by
NCTM or The Math Forum.



How to make the computation time of a fuzzy string comparison algo much FASTER?
Posted:
Nov 4, 2013 10:37 AM


Hi,
I need to make the following code much faster. It basically computes a string similarity score. The problem is that, given a string as input, I need to compute the similarity score between the input string and a list of about 40000 strings, which takes about 40 seconds. I need it instead to take max a few seconds.
The similarity score is computed by combining two different measures: one is the Damerau?Levenshtein distance (which takes approximately 1/3 of the computation time) and the Dice Coefficient (which takes about 2/3 of the computation time).
Any idea? Thanks in advance for any help.
Rossella
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% MAIN FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% MySTRING = string that the user wants to find the possible matches for % TargetList = cellarray with 40000 rows with a list of strings [DamLev, DiceCoeff] = ComputeStringDistance(MySTRING,TargetList);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% ComputeStringDistance FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function [NormalizedDistance, DiceCoefficient] = ComputeStringDistance(MySTRING, TargetList) % ComputeStringDistance calculates the DamerauLevenshtein distance and the DICE COEFFICIENT
Nk = numel(key); NormalizedDistance = cellfun(@(x) ComputeLevDamDistance(x, key, Nk), TargetList);
words1 = GetSingleWords(key); % for each word, find all bigrams bigrams1 = cellfun(@GetBigrams, words1, 'UniformOutput',0); % put them together on one string big_key = []; for b = 1:numel(bigrams1) big_key = [big_key, bigrams1{b}]; end N_big_key = numel(big_key);
DiceCoefficient = cellfun(@(x) FastDiceCoefficient(x, big_key, N_big_key),TargetList);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% ComputeLevDamDistance FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function NormalizedDistance = ComputeLevDamDistance(Target, key, Nk)
Nt = numel(Target);
% compute demerau lev. disdtance d = zeros(Nt+1,Nk+1); % initialize distance matrix d(1,:) = 0:Nk; d(:,1) = 0:Nt; for i=2:Nt+1 for j=2:Nk+1 if key(j1) == Target(i1) cost = 0; else cost = 1; end
d(i,j)=min([ ... d(i, j1) + 1, ... % deletion d(i1, j) + 1, ... % insertion d(i1,j1) + cost ... % substitution ]);
if (i1)>1 && (j1)>1 && i1 <= Nk && j1 <= Nt && key((i1))==Target((j1)1) && key((i1)1) == Target(j1) % transposition d(i,j) = min([d(i,j), d(i1,j1)]); end end end
% normalize the distance NormalizedDistance = 1  (d(Nt+1,Nk+1)/max(Nt, Nk));
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% GetSingleWords FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function words = GetSingleWords(s)
space_s = [0 regexp(s,'\s') length(s)+1]; words = cell(1,length(space_s)  1); for w = 1:length(space_s)1 words{w} = s(space_s(w)+1:space_s(w+1)1); end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% GetBigrams FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function bigram = GetBigrams(s)
bigram{1} = s(1:end); for i = 1:length(s)1 bigram{i} = s(i:i+1); end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% FastDiceCoefficient FUNCTION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function DiceCoefficient = FastDiceCoefficient( Target, big_key, N_big_key)
words2 = GetSingleWords(Target);
% for each word, find all bigrams bigrams2 = cellfun(@GetBigrams, words2, 'UniformOutput',0);
% put them together on one cell big2 = []; for b = 1:numel(bigrams2) big2 = [big2, bigrams2{b}]; end
DiceCoefficient = sum([ismember(big2,big_key), ismember(big_key,big2)])/(N_big_key + numel(big2));



