1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
/** * warning performs a cartesian product * @param list of NameAddress objects * @param threshold - how similar must an entry be to be considered duplicate? * @return */ public Collection findAllDuplicates(List list, int threshold) { List duplicates = new ArrayList(); // removed iteration over list, thanks to Jeremy Weiskotten NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]); boolean dupes = false; // outer loop - for each NameAddress for (int i = 0; i< entries.length; i++) { if (entries[i] == null) { continue; } // are there duplicates of entries[i]? not yet. dupes = false; // inner loop - compare entry against all subsequent ones. for (int k = i+1; k < entries.length; k++) { if (entries[k] == null) { continue; } if (this.rateSimilarity(entries[i], entries[k]) >= threshold) { dupes = true; duplicates.add(entries[k]); // remove this from further comparisons entries[k] = null; } } if (dupes == true) { duplicates.add(entries[i]); } } return duplicates; }
Refactorings
No refactoring yet !
Jeremy Weiskotten
January 17, 2008, January 17, 2008 19:42, permalink
One thing you can do easily is replace the loop to populate the "entries" array.
1
NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]);
getopenid.com/garretokelly
January 17, 2008, January 17, 2008 22:00, permalink
Thanks Jeremy... toArray() runs slightly faster than the iterator as well as being easier to read!
frznn
January 19, 2008, January 19, 2008 03:01, permalink
Maybe without the "continue" istruction the code is clearer.
I don't know what else you can do...^^
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
public Collection findAllDuplicates(List list, int threshold) { List duplicates = new ArrayList(); // removed iteration over list, thanks to Jeremy Weiskotten NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]); boolean dupes = false; // outer loop - for each NameAddress for (int i = 0; i< entries.length; i++) { if (entries[i] != null) { for (int k = i+1; k < entries.length; k++) { if (entries[k] != null) { if (this.rateSimilarity(entries[i], entries[k]) >= threshold) { dupes = true; duplicates.add(entries[k]); // remove this from further comparisons entries[k] = null; } } } if (dupes == true) { duplicates.add(entries[i]); } } } return duplicates; }
armandino.myopenid.com
January 19, 2008, January 19, 2008 04:33, permalink
I haven't tested this properly but I think it should work.
java
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
public Collection findAllDuplicates(List<NameAddress> list, int threshold) {
Collection<NameAddress> duplicates = new HashSet<NameAddress>();
boolean[] isDuplicate = new boolean[list.size()];
for(int i = 0; i < list.size(); i++) {
for(int k = i+1; k < list.size() && !isDuplicate[i]; k++) {
if(!isDuplicate[k] && rateSimilarity(list.get(i), list.get(k)) >= threshold) {
duplicates.add(list.get(k));
isDuplicate[k] = true;
}
}
}
return duplicates;
}
Gavin
January 23, 2008, January 23, 2008 11:40, permalink
Slight improvements on previous version, again haven't tested this.
No need for isDuplicate array, also pull out the list entries into local variables to prevent doing lookup multiple times.
Note that this also removes the !isDuplicate[i] check in the second loop, as that meant you might not be checking all pairs. For example assume the similarity measure checked changed charaters, if only 1 character was different it was over the threshold (i,e. a dupe) and you had the following entries
abcde
abcce
abccc
Then abcce would be marked as a dupe (of abcde) but the next check for dupes of abcce wouldn't happen so abccc wouldn't be marked as a dupe. It is up to you what you want the behaviour to be, you can change it by just adding a !duplicates.contains(baseEntry) condition to the second loop.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
public Collection findAllDuplicates(List<NameAddress> list, int threshold) { Collection<NameAddress> duplicates = new HashSet<NameAddress>(); for(int i = 0; i < list.size(); i++) { NameAddress baseEntry = list.get(i); for(int k = i+1; k < list.size(); k++) { NameAddress checkedEntry = list.get(k); if(!duplicates.contains(checkedEntry) && rateSimilarity(baseEntry, checkedEntry) >= threshold) { duplicates.add(checkedEntry); } } } return duplicates; }
Marco Valtas
January 25, 2008, January 25, 2008 00:30, permalink
I think you can remove the nested loop doing a sort before find the duplicates, then going in one step using a Set data structure to keep the duplicates.
The gain is that the Set structure will take care of future duplicates and you avoid the double trip in the Array. This assuming you can sort your items.
sort before
1 2 3 4 5 6 7 8
java.util.Arrays.sort(cArray); for(int i = 0; i < cArray.length - 1; i++) { if(cArray[i].equals(cArray[i + 1])) { dupes.add(cArray[i]); } }
Given a list of NameAddress objects, I want to compare each one to all the others.
The outer and inner loops here look a bit clumsy, could it be done more elegantly?
Speed is important but legibility moreso.