Pages

Thursday, June 2, 2011

ContainsAny method from apache commons

There was and there is now such method in Apache Commons Collections to define is there any single intersection between two collections. In the version 2.* it was:


  public static <E> boolean containsAny1(Collection<E> collection, Collection<E> elements) {
    for (E e : elements) {
      if (collection.contains(e)) {
        return true;
      }
    }
    return false;
  }
I have added "1" to the name to distinguish.

And in version 3.* it was optimized and now looks like this:

  public static <E> boolean containsAny2(final Collection<E> coll1, final Collection<E> coll2) {
    if (coll1.size() < coll2.size()) {
      for (Iterator<E> it = coll1.iterator(); it.hasNext();) {
        if (coll2.contains(it.next())) {
          return true;
        }
      }
    } else {
      for (Iterator<E> it = coll2.iterator(); it.hasNext();) {
        if (coll1.contains(it.next())) {
          return true;
        }
      }
    }
    return false;
  }
Here is a benchmark, of whether this optimization was worth doing (omitting methods themselves):


  private static final int ITERATIONS = 100000;
  private static final int MAX = 100;
  private static final int MULTIPLIER = 100;

  public static void main(String[] args) throws InstantiationException, IllegalAccessException {
    doWithCollections(ArrayList.class, ArrayList.class);
    doWithCollections(HashSet.class, HashSet.class);
    doWithCollections(HashSet.class, ArrayList.class);
    doWithCollections(HashSet.class, LinkedList.class);
    doWithCollections(LinkedList.class, LinkedList.class);
    doWithCollections(LinkedList.class, HashSet.class);
  }

  @SuppressWarnings("unchecked")
  public static void doWithCollections(Class<?> colType1, Class<?> colType2)
      throws InstantiationException, IllegalAccessException {
    Random r = new Random();
    Collection<Integer> test1, test2;
    long time1 = 0, time2 = 0, start;
    for (int i = 0; i < ITERATIONS; i++) {
      int size1 = r.nextInt(MAX) * MULTIPLIER;
      test1 = (Collection<Integer>) colType1.newInstance();
      for (int j = 0; j < size1; j++) {
        test1.add(r.nextInt(MAX * MULTIPLIER / 2));
      }

      int size2 = r.nextInt(MAX) * MULTIPLIER;

      test2 = (Collection<Integer>) colType2.newInstance();
      for (int j = 0; j < size2; j++) {
        test2.add(r.nextInt(MAX * MULTIPLIER / 2));
      }

      if (r.nextBoolean()) {
        start = System.nanoTime();
        containsAny1(test1, test2);
        time1 += System.nanoTime() - start;

        start = System.nanoTime();
        containsAny2(test1, test2);
        time2 += System.nanoTime() - start;
      } else {
        start = System.nanoTime();
        containsAny2(test1, test2);
        time2 += System.nanoTime() - start;

        start = System.nanoTime();
        containsAny1(test1, test2);
        time1 += System.nanoTime() - start;
      }
    }

    System.out.println(time1 / 1000000 + " milliseconds for first version with " + colType1.getSimpleName()
        + " and " + colType2.getSimpleName());

    System.out.println(time2 / 1000000 + " milliseconds for second version with " + colType1.getSimpleName()
        + " and " + colType2.getSimpleName());
  }
And here are the results:

764 milliseconds for first version with ArrayList and ArrayList
727 milliseconds for second version with ArrayList and ArrayList

As it was expected there is a minor optimization. I tried to make a test where there is no strong advantage in length of the Collections between one and another method. Of cause if we always supply very big ArrayList as the first and very small as the second  with no intersections the performance gain from the "optimized" version of method will be significant. The following code:

  public static void main(String[] args) throws InstantiationException, IllegalAccessException {
    List<Integer> first = new ArrayList<Integer>(), second = new ArrayList<Integer>();
    for (int i = 0; i < 1000000; i++) {
      first.add(i);
    }
    for (int i = 10000000; i < 1000010; i++) {
      second.add(i);
    }

    long start = System.nanoTime();
    containsAny1(first, second);
    System.out.println(System.nanoTime() - start);

    start = System.nanoTime();
    containsAny2(first, second);
    System.out.println(System.nanoTime() - start);
  }
Will give us:

213048
8011

So we have two orders of performance increase, but my goal is to test with equal  and uniform conditions regarding size, but for different types of the collections: lists, sets and their combinations. So forget about the result above ;-)

Next result from the original benchmark is the following:

186 milliseconds for first version with HashSet and HashSet
63 milliseconds for second version with HashSet and HashSet

Almost three times performance increase! HashSets are in favor... Let's look what with combinations:

204 milliseconds for first version with HashSet and ArrayList
531 milliseconds for second version with HashSet and ArrayList

Stop.. more than two times performance loss... this optimization is not as good as it seemed from the very beginning. Lets combine HashSet with the kind of list that is very fast to iterate through, I mean LinkedList:

186 milliseconds for first version with HashSet and LinkedList
1346 milliseconds for second version with HashSet and LinkedList

What? One order of performance loss? And what with two LinkedLists?

1791 milliseconds for first version with LinkedList and LinkedList
1800 milliseconds for second version with LinkedList and LinkedList

No significant difference. Lets be honest and combine in the opposite order the HashSet with LinkedList:

1797 milliseconds for first version with LinkedList and HashSet
1302 milliseconds for second version with LinkedList and HashSet

The performance increase is significant, but not as great as what we have already seen. So maybe the good decision is to provide overloaded methods for sets (contains operation is almost always fast on them) and lists (iterating over them is faster than checking whether they contain element - it requires full iteration in most of the cases):

  public static <E> boolean containsAny(Set<E> collection, List<E> elements) {
    for (E e : elements) {
      if (collection.contains(e)) {
        return true;
      }
    }
    return false;
  }

  public static <E> boolean containsAny(List<E> collection, Set<E> elements) {
    for (E c : collection) {
      if (elements.contains(c)) {
        return true;
      }
    }
    return false;
  }

No comments:

Post a Comment