Software Programming

Kunuk Nykjaer

Get best k items from n items as fast as possible

leave a comment »


SortedList2 – data structure example using C#

Reference: Selection algorithm

data structure

Recently I needed something like a SortedSet or SortedDictionary which supports multiple identical values.
I Explored the BCL but could not found the data structure I needed.

The scenarie:
I have a dataset size n where I want the k best items.
This is an alternative approach than using the selection algorithm.
Using selection algorithm is also much faster than the naive approach.

I will use Big O notations (beginners guide).

A sorted data structure should have the following operations (Binary search tree):

Insert(item) -> O(log n)
Exists(item) -> O(log n)
Remove(item) -> O(log n)
Max -> O(1)
Min -> O(1)
Count -> O(1)

Neither SortedList, SortedSet or SortedDictionary supports identical values and the listed operations.
The C5 collections has the TreeBag data structure and can be used for value types.

Naive version
Sort the data and take the k best items.
The worst case is O(n * log n).
The fastest optimistic running time will be Ω(n) (if the dataset is already sorted).

What if the we have the best k items on k iterations?
Inserting the first k items takes O(k * log k)

Checking for max item takes O(1).
For the n – k iteration: checking if there exists a better item takes O( (n - k) * log (1) )

On best case scenario this gives: Ω(k * log k + (n - k) * log (1)).

For k << n
that is Ω(n).

On average case for random distributed data where k << n the running time is:
Ω(n * log k).

I will implement a data structure SortedList2 which supports multiple identical comparable values
and test the running vs. a naive implementation.

I will use the SortedSet and the Dictionary structure.

The best item in this example is defined as: smallest even number.

Test cases

best case input

random input

worst case input

The result shows how the Sortelist2 performs for various k values versus the naive version.

To avoid the worst case input you can run the data through a randomizer filter which takes O(n).
Then the running time would be similar to the random input (It’s implemented in the attached source code).
When k < 10% of n then Sortedlist2 performs better.

n = 1.000.000
k = 5

Data distribution: best case

SortedList2 Elapsed: 556 msec.
UId: 003                Comparer: 0             Name: n0
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 004                Comparer: 2             Name: n1
UId: 005                Comparer: 4             Name: n2

Naive Elapsed: 2707 msec.
UId: 003                Comparer: 0             Name: n0
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 004                Comparer: 2             Name: n1
UId: 005                Comparer: 4             Name: n2


I assume the Naive version runs fast because the data is already sorted (compiler branch prediction).
The OrderBy runs faster than O(n * log n)


Data distribution: random

SortedList2 Elapsed: 523 msec.
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 773997             Comparer: 2             Name: n773994
UId: 142607             Comparer: 6             Name: n142604
UId: 757235             Comparer: 6             Name: n757232

Naive Elapsed: 8483 msec.
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 773997             Comparer: 2             Name: n773994
UId: 142607             Comparer: 6             Name: n142604
UId: 757235             Comparer: 6             Name: n757232


I ran this multiple times and the result were similar.
The Naive version runs clearly slow here.


Data distribution: worst case

SortedList2 Elapsed: 3269 msec.
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 1000002            Comparer: 2             Name: n999999
UId: 1000001            Comparer: 4             Name: n999998
UId: 1000000            Comparer: 6             Name: n999997

Naive Elapsed: 2967 msec.
UId: 001                Comparer: 2             Name: duplicate
UId: 002                Comparer: 2             Name: duplicate
UId: 1000002            Comparer: 2             Name: n999999
UId: 1000001            Comparer: 4             Name: n999998
UId: 1000000            Comparer: 6             Name: n999997


Here the Naive version is best for worst case input.
I assume the Naive version runs fast because the data is reverse sorted.
The OrderBy runs faster than O(n * log n)


n = 1.000.000
k = 100.000

Data distribution: best case

SortedList2 Elapsed: 1768 msec.
Naive Elapsed: 2675 msec.


Data distribution: random

SortedList2 Elapsed: 6364 msec.
Naive Elapsed: 6064 msec.


Data distribution: worst case

SortedList2 Elapsed:16478 msec.
Naive Elapsed: 2590 msec.


Conclusion

If you want something fast for k << n then the Sortedlist2 (or the selection algorithms) are a better option than the naive approach.

Source code

Program.cs

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.Linq;
using System.Threading;

namespace Datastructure
{
    public class Program
    {
        static readonly Action<object> CW = Console.WriteLine;
        const int MaxSize = 5;
        const int N = 2 * 100 * 1000;

        public static void Main(string[] args)
        {
            Thread.CurrentThread.CurrentCulture = CultureInfo.GetCultureInfo("en-US");

            var stopwatch = new Stopwatch();
            stopwatch.Start();

            Run();

            stopwatch.Stop();

            CWF("\nSec: {0}\nPress a key to exit", stopwatch.Elapsed.ToString());
            Console.ReadKey();
        }

        static void CWF(string s, params object[] a)
        {
            Console.WriteLine(s, a);
        }

        static void Run()
        {
            var comparer = new ObjComparer<Comparer>();
            
            var rand = new Random();
            var datas = new List<IObj>();
            
            for (var i = 0; i < N; i++)
            {
                //datas.Add(new Obj { Comparer = new Comparer(i * 2), Name = "n" + i }); // best case
                datas.Add(new Obj { Comparer = new Comparer(rand.Next(1, N)), Name = "n" + i }); // random
                //datas.Add(new Obj { Comparer = new Comparer((N - i) * 2), Name = "n" + i }); // worst case
            }

            const bool displayList = true;

            // --- Run sortedlist2
            var sw = new Stopwatch();
            sw.Start();

            var sorted = new SortedList2(comparer);

            //sorted.AddAll(datas, MaxSize, false); // method 1
            foreach (var i in datas) sorted.Add(i, MaxSize); // method 2

            var result = sorted.GetAll();
            sw.Stop();
            CWF("SortedList2 Elapsed: {0} msec.", sw.ElapsedMilliseconds);
            if (displayList) foreach (var i in result) CW(i);            

            // --- Run naive
            sw = new Stopwatch();
            sw.Start();

            datas.Sort(ObjComparer<IObj>.DoCompare);
            result = datas.Take(MaxSize).ToList(); // method 1
            //result = datas.OrderBy(i => i.Comparer).Take(MaxSize).ToList(); // method 2
            
            sw.Stop();

            CWF("\nNaive Elapsed: {0} msec.", sw.ElapsedMilliseconds);
            if (displayList) foreach (var i in result) CW(i);


            // --- Run selection algo
            sw = new Stopwatch();
            sw.Start();

            var s = new Selection { List = datas, K = MaxSize };
            s.Algo();
            result = s.GetAll();

            sw.Stop();

            CWF("\nSelection algo Elapsed: {0} msec.", sw.ElapsedMilliseconds);
            if (displayList) foreach (var i in result) CW(i);
        }

    }

    public class Comparer : IComparable
    {
        public Comparer(int i) { Value = i; }

        public long Value { get; set; }
        public override int GetHashCode() { return this.Value.GetHashCode(); }
        public override bool Equals(object obj)
        {
            var other = obj as Comparer;
            if (other == null) return false;

            var eq = this.GetHashCode().Equals(other.GetHashCode());
            return eq;
        }
        public override string ToString()
        {
            return string.Format("{0}", Value.ToString());
        }

        /// <summary>
        /// Comparison algo is implemented here
        /// 
        /// Even is best
        /// If both or none are even then smallest is best
        /// </summary>
        /// <param name="obj"></param>
        /// <returns></returns>
        public int CompareTo(object obj)
        {
            var other = obj as Comparer;
            if (other == null) return -1;

            var a = (this.Value & 1) == 0; // is even?
            var b = (other.Value & 1) == 0; // is even?

            if (a && !b) return -1; // this is even, other is not
            if (!a && b) return 1; // this is not even, other is

            return this.Value.CompareTo(other.Value);
        }
    }

    public class Obj : AObj, IObj
    {
        // Insert your custom properties here
        public string Name { get; set; }
        public override string ToString()
        {
            return string.Format("UId: {0:000} \t\tComparer: {1} \t\tName: {2}",
                Uid, Comparer, Name);
        }

        public override int GetHashCode()
        {
            return Comparer.GetHashCode();
        }

        public override bool Equals(object obj)
        {
            var other = obj as IObj;
            return other != null && this.GetHashCode().Equals(other.GetHashCode());
        }
    }

    public interface IObj : IComparable
    {
        string Name { get; set; }
        Comparer Comparer { get; set; }
    }

    public abstract class AObj : IComparable
    {
        private static int _counter;
        public virtual int Uid { get; private set; }
        protected AObj() { Uid = ++_counter; }

        public Comparer Comparer { get; set; }

        public int CompareTo(object obj)
        {
            var other = obj as AObj;
            if (other == null) return -1;

            return ObjComparer<IObj>.DoCompare(this.Comparer, other.Comparer);
        }
    }

    /// <summary>
    /// Thread safe
    /// </summary>
    public class SortedList2
    {
        private readonly object _lock = new object();

        private int _count;
        private readonly Dictionary<Comparer, LinkedList<IObj>> _lookup =
            new Dictionary<Comparer, LinkedList<IObj>>();
        private readonly SortedSet<Comparer> _set;
        private readonly IComparer<Comparer> _comparer;

        public SortedList2(IComparer<Comparer> comparer)
        {
            _comparer = comparer;
            _set = new SortedSet<Comparer>(comparer);
        }

        // O(log n)
        public bool Add(IObj i)
        {
            return this.Add(i, long.MaxValue);
        }

        // O(log n)
        public bool Add(IObj i, long k)
        {
            lock (_lock)
            {
                if (i == null || k <= 0) return false;

                Comparer val = i.Comparer;

                if (_count < k) _count++;
                else
                {
                    Comparer max = _set.Max;
                    if (_comparer.Compare(val, max) >= 0) return false; // Don't add

                    // Remove old
                    this.Remove(max);
                }

                if (_set.Contains(val))
                {
                    _lookup[val].AddLast(i); // Append
                }
                else
                {
                    // Insert new
                    _set.Add(val);

                    var ps = new LinkedList<IObj>();
                    ps.AddLast(i);
                    _lookup.Add(val, ps);
                }

                return true;
            }
        }

        public void AddAll(List<IObj> objs, bool randomizeFirst = false)
        {
            AddAll(objs, int.MaxValue, randomizeFirst);
        }

        public void AddAll(List<IObj> objs, int k, bool randomizeFirst = false)
        {
            if (randomizeFirst)
            {
                var list = objs;

                #region maintain input order                
                //list = new List<IObj>();
                //list.AddRange(objs);
                #endregion 

                Randomize(list);
                foreach (var i in list) Add(i, k);
            }
            else foreach (var i in objs) Add(i, k);
        }

        // http://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle
        private static void Randomize(IList<IObj> list)
        {
            var rand = new Random();
            var n = list.Count;
            for (var i = 0; i < n; i++)
            {
                var j = rand.Next(n);
                var tmp = list[i];
                list[i] = list[j];
                list[j] = tmp;
            }
        }

        // O(n)
        public List<IObj> GetAll()
        {
            lock (_lock)
            {
                var all = new List<IObj>();
                var dists = _set.ToList();
                foreach (var dist in dists) all.AddRange(_lookup[dist]);
                return all;
            }
        }

        public int Count
        {
            get
            {
                lock (_lock) return _count;
            }
        }

        // O(log n)
        public bool Remove(IObj i)
        {
            lock (_lock)
            {
                if (i == null) return false;
                var isRemoved = this.Remove(i.Comparer);
                if (isRemoved) _count--;

                return isRemoved;
            }
        }

        // O(log n)
        public bool Remove(Comparer val)
        {
            lock (_lock)
            {
                return this.RemoveHelper(val);
            }
        }

        // O(log n)
        private bool RemoveHelper(Comparer val)
        {
            if (_set.Contains(val))
            {
                var bag = _lookup[val];
                bag.RemoveLast(); // O(1)
                if (bag.Count == 0)
                {
                    _lookup.Remove(val); // O(1)
                    _set.Remove(val); // O(log n)
                }

                return true;
            }
            return false;
        }
    }

    public class ObjComparer<T> : IComparer<T> where T : IComparable
    {
        public int Compare(T a, T b)
        {
            return DoCompare(a, b);
        }
        public static int DoCompare<U>(U a, U b) where U : IComparable
        {
            return a.CompareTo(b); // ascending
            //return b.CompareTo(a); // descending
        }
    }


    // http://en.wikipedia.org/wiki/Selection_algorithm
    public class Selection
    {
        public List<IObj> List = new List<IObj>();
        public int K = 1;

        public List<IObj> GetAll()
        {
            return List.Take(K).ToList();
        }

        /*     
      function select(list[1..n], k)
     for i from 1 to k
         minIndex = i
         minValue = list[i]
         for j from i+1 to n
             if list[j] < minValue
                 minIndex = j
                 minValue = list[j]
         swap list[i] and list[minIndex]
     return list[k]
     */
        public void Algo()
        {
            var n = List.Count;
            for (int i = 0; i < K; i++)
            {
                var minIndex = i;
                var minValue = List[i];
                for (int j = i + 1; j < n; j++)
                {
                    if (List[j].CompareTo(minValue) < 0)
                    {
                        minIndex = j;
                        minValue = List[j];
                    }
                }
                Swap(i, minIndex);
            }
        }

        void Swap(int i, int j)
        {
            var tmp = List[i];
            List[i] = List[j];
            List[j] = tmp;
        }
    }
}
Advertisements

Written by kunuk Nykjaer

February 23, 2013 at 2:18 pm

Posted in Algorithm, Csharp

Tagged with ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: