Avatar

Given a list of NameAddress objects, I want to compare each one to all the others.

The outer and inner loops here look a bit clumsy, could it be done more elegantly?
Speed is important but legibility moreso.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
    /**
     * warning performs a cartesian product
     * @param list of NameAddress objects
     * @param threshold - how similar must an entry be to be considered duplicate?
     * @return
     */
    public Collection findAllDuplicates(List list, int threshold) {
    	List duplicates = new ArrayList();

        // removed iteration over list, thanks to Jeremy Weiskotten
        NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]);

    	
    	boolean dupes = false;
    	
    	// outer loop - for each NameAddress
    	for (int i = 0; i< entries.length; i++) {
    		
    		if (entries[i] == null) {
    			continue;
    		}
    		// are there duplicates of entries[i]? not yet.
    		dupes = false; 
    		
    		// inner loop - compare entry against all subsequent ones.
    		for (int k = i+1; k < entries.length; k++) {
    			if (entries[k] == null) {
    				continue;
    			}
    			
    			if (this.rateSimilarity(entries[i], entries[k]) >= threshold) {
    				dupes = true;
    				duplicates.add(entries[k]);
    				// remove this from further comparisons
    				entries[k] = null;
    			}
    		}
    		if (dupes == true) {
    			duplicates.add(entries[i]);
    		}
    	}
    	
    	return duplicates;
    }

Refactorings

No refactoring yet !

5170ca260dbd2cdfd5a887a4dba7636f

Jeremy Weiskotten

January 17, 2008, January 17, 2008 19:42, permalink

1 rating. Login to rate!

One thing you can do easily is replace the loop to populate the "entries" array.

1
NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]);
Avatar

getopenid.com/garretokelly

January 17, 2008, January 17, 2008 22:00, permalink

No rating. Login to rate!

Thanks Jeremy... toArray() runs slightly faster than the iterator as well as being easier to read!

Avatar

frznn

January 19, 2008, January 19, 2008 03:01, permalink

No rating. Login to rate!

Maybe without the "continue" istruction the code is clearer.
I don't know what else you can do...^^

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
    public Collection findAllDuplicates(List list, int threshold) {
    	List duplicates = new ArrayList();

        // removed iteration over list, thanks to Jeremy Weiskotten
        NameAddress[] entries = (NameAddress[]) list.toArray(new NameAddress[list.size()]);

    	
    	boolean dupes = false;
    	
    	// outer loop - for each NameAddress
    	for (int i = 0; i< entries.length; i++) {
    	    if (entries[i] != null) {
    			
    		for (int k = i+1; k < entries.length; k++) {
    		    if (entries[k] != null) {
    			
    			if (this.rateSimilarity(entries[i], entries[k]) >= threshold) {
    				dupes = true;
    				duplicates.add(entries[k]);
    				// remove this from further comparisons
    				entries[k] = null;
    			}
                    }

    		}
    		if (dupes == true) {
    			duplicates.add(entries[i]);
    		}
            }
    	}
    	
    	return duplicates;
    }
2155c92be66863c4634778bf522efe14

armandino.myopenid.com

January 19, 2008, January 19, 2008 04:33, permalink

No rating. Login to rate!

I haven't tested this properly but I think it should work.

java

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
    public Collection findAllDuplicates(List<NameAddress> list, int threshold) {
        Collection<NameAddress> duplicates = new HashSet<NameAddress>();
        boolean[] isDuplicate = new boolean[list.size()];
        
        for(int i = 0; i < list.size(); i++) {
            for(int k = i+1; k < list.size() && !isDuplicate[i]; k++) {
                if(!isDuplicate[k] && rateSimilarity(list.get(i), list.get(k)) >= threshold) {
                    duplicates.add(list.get(k));
                    isDuplicate[k] = true;
                }
            }
        }
        
        return duplicates;
    }
Avatar

Gavin

January 23, 2008, January 23, 2008 11:40, permalink

No rating. Login to rate!

Slight improvements on previous version, again haven't tested this.

No need for isDuplicate array, also pull out the list entries into local variables to prevent doing lookup multiple times.

Note that this also removes the !isDuplicate[i] check in the second loop, as that meant you might not be checking all pairs. For example assume the similarity measure checked changed charaters, if only 1 character was different it was over the threshold (i,e. a dupe) and you had the following entries

abcde
abcce
abccc

Then abcce would be marked as a dupe (of abcde) but the next check for dupes of abcce wouldn't happen so abccc wouldn't be marked as a dupe. It is up to you what you want the behaviour to be, you can change it by just adding a !duplicates.contains(baseEntry) condition to the second loop.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    public Collection findAllDuplicates(List<NameAddress> list, int threshold) {
        Collection<NameAddress> duplicates = new HashSet<NameAddress>();
        
        for(int i = 0; i < list.size(); i++) {
            NameAddress baseEntry = list.get(i);
            for(int k = i+1; k < list.size(); k++) {
                NameAddress checkedEntry = list.get(k);
                if(!duplicates.contains(checkedEntry) && rateSimilarity(baseEntry, checkedEntry) >= threshold) {
                    duplicates.add(checkedEntry);
                }
            }
        }
        
        return duplicates;
    }
B8d457d2c39911ea4c74ba7d66b9c3f7

Marco Valtas

January 25, 2008, January 25, 2008 00:30, permalink

No rating. Login to rate!

I think you can remove the nested loop doing a sort before find the duplicates, then going in one step using a Set data structure to keep the duplicates.

The gain is that the Set structure will take care of future duplicates and you avoid the double trip in the Array. This assuming you can sort your items.

sort before

1
2
3
4
5
6
7
8
java.util.Arrays.sort(cArray);

for(int i = 0; i < cArray.length - 1; i++) {
	if(cArray[i].equals(cArray[i + 1])) {
		dupes.add(cArray[i]);
	}
}

Your refactoring





Format Copy from initial code

or Cancel