Software Programming

Kunuk Nykjaer

Running time analysis – hashset used for running time optimization – C# example

leave a comment »


Imagine you want to find the list of distinct numbers of all possible duplicate numbers
from two lists with unknown length and unknown content.
You want to do it as fastest as possible.

e.g.
List a = [5,2,2,6,4,3,3]
List b = [2,2,2,1,3,4,3,5]
You should find [2,3,4,5]

How would you do it?

Do you use the usual nested for loops iteration?
If you don’t mind using some extra memory you could use the hashset trick.
See the code below for the examples.

If you worry about the memory you could sort the lists
and iterate them which takes in total O(nlogn) + O(mlogm). That is not covered here.

The hashset technique takes O(n+m) which is faster.

If you don’t know that a hashset is here are some wiki readings:
http://en.wikipedia.org/wiki/Hash_table
http://en.wikipedia.org/wiki/Set_(abstract_data_type)

The code shows two styles: one that runs O(n*m) and one that runs O(n+m) with different implementations.
I will use C# Linq also to show you how it can perform bad when you don’t use it correct (ver2 vs. ver4 in the code)

Source code and demo.
Program.cs

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

/// <summary>
/// Author: Kunuk Nykjaer
/// </summary>
public class Program
{
    const int Lengh = 1000;
    const int Range = 10000;
    static readonly Random Random = new Random();
    static readonly List<int> List1 = new List<int>();
    static readonly List<int> List2 = new List<int>();
    
    public static void Main(string[] args)
    {
        for (int i = 0; i < Lengh; i++)
            List1.Add(Random.Next(Range));
        for (int i = 0; i < Lengh; i++)
            List2.Add(Random.Next(Range));

        Console.WriteLine("n and m has length: {0}", Lengh);

        RunningTime1();
        RunningTime2();
        RunningTime3();
        RunningTime4();
        RunningTime5();

        Console.WriteLine("press.."); Console.ReadKey();
    }
      
    private static void RunningTime1()
    {
        Console.WriteLine("1.  O(n*m)  Classic nested for loops");
        var sw = new Stopwatch();
        sw.Start(); //O(1)
        var equals = new HashSet<int>();
        foreach (var i in List1) //O(n)
            foreach (var j in List2) //O(m)
                if (i == j) //O(1)
                    equals.Add(i); //O(1)

        sw.Stop();

        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime2()
    {
        Console.WriteLine("2.  O(n*m)  Linq version quck style");
        var sw = new Stopwatch();
        sw.Start();

        // from... +    Distinct 
        //O(n*m)   +   O((n+m)/2)    )
        var equals = (from i in List1 from j in List2 where i == j select i).
            Distinct().ToList();

        sw.Stop();

        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime3()
    {
        Console.WriteLine("3.  O(n+m)  Linq");        
        var sw = new Stopwatch();
        sw.Start();

        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)

        foreach (var i in List1) //O(n)
            set.Add(i); //O(1)

        foreach (int i in List2.Where(set.Contains)) //O(m)
            equals.Add(i); //O(1)

        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime4()
    {
        Console.WriteLine("4.  O(n+m) foreach loop");
        
        var sw = new Stopwatch();
        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)
        sw.Start();
        foreach (var i in List1) //O(n)
            set.Add(i);
        foreach (int i in List2) //O(m)
            if (set.Contains(i)) //O(1)
                equals.Add(i); //O(1)

        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime5()
    {
        Console.WriteLine("5.  O(n+m) for loop");
        
        var sw = new Stopwatch();
        sw.Start();

        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)

        int count1 = List1.Count; //O(1)
        for (int i = 0; i < count1; i++) //O(n)
            set.Add(List1[i]); //O(1)

        int count2 = List2.Count; //O(1)
        for (int i = 0; i < count2; i++) //O(m)
        {
            int j = List2[i]; //O(1)
            if (set.Contains(j)) //O(1)
                equals.Add(j); //O(1)
        }
        
        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
}

O(n*m) and O(n+m)

n and m has length: 10000
1.  O(n*m)  Classic nested for loops
Elapsed msec: 1582, equals len: 900
2.  O(n*m)  Linq version quck style
Elapsed msec: 8446, equals len: 900
3.  O(n+m)  Linq
Elapsed msec: 8, equals len: 900
4.  O(n+m) foreach loop
Elapsed msec: 1, equals len: 900
5.  O(n+m) for loop
Elapsed msec: 1, equals len: 900

Ver2 is a bad implementation using Linq. Linq can cost you running time if you use it poorly.

O(n*m) and O(n+m)

n and m has length: 20000
1.  O(n*m)  Classic nested for loops
Elapsed msec: 6285, equals len: 3234
2.  O(n*m)  Linq version quck style
Elapsed msec: 34932, equals len: 3234
3.  O(n+m)  Linq
Elapsed msec: 10, equals len: 3234
4.  O(n+m) foreach loop
Elapsed msec: 4, equals len: 3234
5.  O(n+m) for loop
Elapsed msec: 3, equals len: 3234

The poor version takes 30 sec. and the hashset trick are maximum 10 msec.

Here is a test with larger list, the O(n*m) versions were to slow to include.
O(n+m) with large list

n and m has length: 1000000
3.  O(n+m)  Linq
Elapsed msec: 505, equals len: 90594
4.  O(n+m) foreach loop
Elapsed msec: 537, equals len: 90594
5.  O(n+m) for loop
Elapsed msec: 499, equals len: 90594

Even for large list, the time is below 1 sec.
For nested loops it would have taken 12 days!
(source: Algorithm Design book by Jon Kleinberg and Éva Tardos, chapter 2, O(n*n) where n=1.000.000)

The lesson to take is:
When you are about to code something using nested for loops, first see if the list is large, if yes consider using this trick if fast running time is important.

Written by kunuk Nykjaer

March 4, 2012 at 12:52 am

Leave a comment