Software Programming

Kunuk Nykjaer

Archive for the ‘Algorithm’ Category

Interview-question: Find two largest elements in array of size n – C# example

leave a comment »

Design an algorithm to find the two largest elements in a sequence of n numbers.
Number of comparisons need to be O(n) + O(log n)

Answer:
You use a tournament sort.
Here’s an example in C#

Program.cs

using System;
using System.Collections.Generic;
using System.Linq;

/// <summary>
/// Author: Kunuk Nykjaer
/// Tournament Sort example
/// http://en.wikipedia.org/wiki/Tournament_sort
/// </summary>
class Program
{
    class P
    {
        public int Strength { get; set; }
        public List<int> Opponents { get; private set; }
        public P()
        {
            Opponents = new List<int>();
        }
    }

    // Return winner of the match and remember opponent
    // O(1)
    static P Play(P p1, P p2)
    {
        if (p1.Strength > p2.Strength)
        {
            p1.Opponents.Add(p2.Strength);
            return p1;
        }
        p2.Opponents.Add(p1.Strength);
        return p2;
    }

    // Runtime O(n) + O(logn)
    static void Main(string[] args)
    {
        // Test data setup, even number list
        var playerStrengths = new[] { 1, 2, 5, 4, 9, 7, 8, 7, 5, 4, 1, 0, 1, 4, 2, 3 };
        var players = playerStrengths.Select(i => new P { Strength = i }).ToList();

        // O(n)
        while (players.Count > 1)
        {
            var nextRound = new List<P>();
            for (int i = 0; i < players.Count - 1; i += 2)
            {
                P winner = Play(players[i], players[i + 1]);
                nextRound.Add(winner); // add winner of the match
            }
            players = nextRound;
        }

        // O(1)
        var tournamentWinner = players.First();
        int first = tournamentWinner.Strength;
        int second = -1;

        // O(logn)            
        foreach (int i in tournamentWinner.Opponents)
            if (i > second)// update                
                second = i;

        Console.WriteLine("First: {0}, Second: {1}", first, second);
        Console.WriteLine("press key..."); Console.ReadKey();
    }
}

Output is:
First: 9, Second: 8

Reference:
http://stackoverflow.com/questions/10591112/interview-algorithm-find-two-largest-elements-in-array-of-size-n

http://en.wikipedia.org/wiki/Tournament_sort

http://stackoverflow.com/questions/3628718/find-the-2nd-largest-element-in-an-array-with-minimum-of-comparisom

Written by kunuk Nykjaer

May 25, 2012 at 3:42 pm

Posted in Algorithm, Csharp

Tagged with

Interview-question: How to find all permutations of a given word in a given text? – Java example

leave a comment »

Write a function (in Java) to find all permutations of a given word that appear in a given text. For example, for word abc and text abcxyaxbcayxycab the function should return abc, bca, cab.

Answer:
To find a permutation of a string you can use number theory.

There is a method where you can calculate a hash of a string using prime numbers. Every permutation of the same string will give the same hash value. All other string combination which is not a permutation will give some other hash value.

The hash-value is calculated by c1 * p1 + c2 * p2 + … + cn * pn
where ci is a unique value for the current char in the string and where pi is a unique prime number value for the ci char.

Here is the implementation.
Main.java

// Author: Kunuk Nykjaer
public class Main {
	final static int[] primes = new int[] { 2, 3, 5, 7, 11, 13, 17, 
		19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 
		73, 79, 83, 89, 97, 101, 103, 107, 109, 113 };
	
	public static void main(String[] args) {		
		final char[] text = "abcxaaabbbccyaxbcayaaaxycab"
			.toCharArray();			
		int hash = val(new char[]{'a','b','c'});					
		for (int i = 0; i < text.length - 2; i++) {
			char[] _123 = new char[]{text[i],text[i+1],text[i+2]};			
			if(val(_123)==hash){
				System.out.println(new String(_123) );		
			}
		}
	}	
	static int p(char c) {
		return primes[(int)c - (int)'a'];
	}	
	static int val(char[] cs) {
		return 
		p(cs[0])*(int)cs[0] + p(cs[1])*(int)cs[1] + p(cs[2])*(int)cs[2];
	}
}

The output is: abc bca cab

Reference:
http://stackoverflow.com/questions/10727118/how-to-find-all-permutations-of-a-given-word-in-a-given-text

Written by kunuk Nykjaer

May 25, 2012 at 1:20 pm

Posted in Algorithm, Java

Tagged with

Interview-question: Getting top 100 URL from a large log file – C# example

leave a comment »

We have a fairly large log file, about 5GB. Each line of the log file contains an url which a user has visited on our site. We want to figure out what’s the most popular 100 urls visited by our users. How to do it?

Answer:
Because the log file is fairly large you should read the log-file using a stream-reader.
Don’t read it all in the memory.
I would expect it is feasible to have the number of possible distinct links in the memory while we work on the log-file since it is links within a website.

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

The runtime is O(n) + O(m*log(m)) where n is the size of the log-file in lines and where the m is number of distinct found links.

Here’s a C# implementation of the pseudo-code. An actual file-reader and a log-file is not provided.
A simple emulation of reading a log-file using a list in the memory is provided instead.

The algorithm uses a hashmap to store the found links. A sorting algorithm founds the top 100 links afterward. A simple data container data-structure is used for the sorting algorithm.

The memory complexity is dependent on expected distinct links. The hashmap must be able to contain the found distinct links, else this algorithm won’t work.

Program.cs

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

// Author: Kunuk Nykjaer
public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

Reference:
http://stackoverflow.com/questions/10733983/getting-top-100-url-from-a-log-file

Written by kunuk Nykjaer

May 25, 2012 at 1:01 pm

Posted in Algorithm, Csharp

Tagged with

How does the Amazon recommendation system work? – Analyze the algorithm and make a prototype that visualizes the algorithm

leave a comment »

I think the recommendation systems are interesting.

I decided that I wanted to learn how the Amazon recommendation works in theory and afterwards implement a demo-site that visualized the algorithm. My implementation is not identical as the Amazon’s version but it follows the same principle. My main focus is to illustrate and visualize the concept. I used a item-to-item matrix table for simplicity.

Reference reading materials:
(1) http://www.cs.umd.edu/~samir/498/Amazon-Recommendations.pdf
(2) Amazon Recommendation Patent
(3) http://stackoverflow.com/questions/2323768/how-does-the-amazon-recommendation-feature-work
(4) http://maya.cs.depaul.edu/mobasher/papers/ewmf04-web/node1.html
(5) http://blog.echen.me/2011/02/15/an-overview-of-item-to-item-collaborative-filtering-with-amazons-recommendation-system/

I recommend reading (1). This is only 5 pages long and easy to read. Amazon’s algorithm is based on item-to-item filtering. In short: Amazon developed their own recommendation system based on items rather than users, because there are fewer items than users (scalability). The list of visited items by any user is stored in a item-to-item matrix table. The recommendation algorithm are calculated by using the Cosine Similarity function on the vectors from the matrix table.

I recommend reading (2) for technical insights and design overview of how Amazon has implemented it.

For the demo-site I wrote a short abstract. The demo-site can be seen here.
http://jory.dk/AreaRecommendation/.
The demo project can be downloaded here

Abstract: 
How does the Amazon recommendation works? 

This is about visualizing the item to item collaborations filtering mechanism using a item-to-item matrix table.
The item-to-item matrix, the vectors and the calculated data values are displayed.

There are n different items and the item recommendation can display up to m items.

There are implemented different item-to-item neighborhood functions. 
A simple max count of seen neighbor items, the Cosine Similarity and the Jaccard Index.

A tracker keeps track of visited items for any user and is saved to a matrix table.
To make it simple only the relation between previous and current viewed item are tracked in this example.

Design

The demo-site has two pages: Home and item page. The item page shows specific information regarding the viewed item and the recommendation is displayed here.

There are 5 different components

  • Multiple view – shows multiple components, all the items are displayed here
  • User view – shows specific information about the current user in the session
  • Item view – shows detailed information about the current item
  • Recommendation view – shows recommended items based on the current item
  • Data view – visualizes the data structure used by the recommendation algorithm

Interactions


This shows how the tracker collects the data from the users in to the matrix table. (The illustrated tracking method is a simplified version. You could also iterate the viewed items when a user view a new item to save all the item-to-item relation for the viewed items).

When a new visitor user3 sees the item A,
the recommendation system founds out the closest match are the Items B and C.

The general idea of the Amazon recommendation engine is to locate item vectors which are similar in pattern for the current viewed item vector.

e.g. A vector with pattern [1,1,1,0,0,0,0] is more similar to vector [0,1,1,0,1,0,0] than to vector [0,0,0,1,0,1,1].

I think Amazon uses Boolean values for the vectors and the structure of the matrix table is different where I think they use something like n*m matrix table where n is users and m is items. You can read more about it with the given reference links.

Written by kunuk Nykjaer

March 4, 2012 at 6:43 pm

Running time analysis – hashset used for running time optimization – C# example

leave a comment »

Imagine you want to find the list of distinct numbers of all possible duplicate numbers
from two lists with unknown length and unknown content.
You want to do it as fastest as possible.

e.g.
List a = [5,2,2,6,4,3,3]
List b = [2,2,2,1,3,4,3,5]
You should find [2,3,4,5]

How would you do it?

Do you use the usual nested for loops iteration?
If you don’t mind using some extra memory you could use the hashset trick.
See the code below for the examples.

If you worry about the memory you could sort the lists
and iterate them which takes in total O(nlogn) + O(mlogm). That is not covered here.

The hashset technique takes O(n+m) which is faster.

If you don’t know that a hashset is here are some wiki readings:
http://en.wikipedia.org/wiki/Hash_table
http://en.wikipedia.org/wiki/Set_(abstract_data_type)

The code shows two styles: one that runs O(n*m) and one that runs O(n+m) with different implementations.
I will use C# Linq also to show you how it can perform bad when you don’t use it correct (ver2 vs. ver4 in the code)

Source code and demo.
Program.cs

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

/// <summary>
/// Author: Kunuk Nykjaer
/// </summary>
public class Program
{
    const int Lengh = 1000;
    const int Range = 10000;
    static readonly Random Random = new Random();
    static readonly List<int> List1 = new List<int>();
    static readonly List<int> List2 = new List<int>();
    
    public static void Main(string[] args)
    {
        for (int i = 0; i < Lengh; i++)
            List1.Add(Random.Next(Range));
        for (int i = 0; i < Lengh; i++)
            List2.Add(Random.Next(Range));

        Console.WriteLine("n and m has length: {0}", Lengh);

        RunningTime1();
        RunningTime2();
        RunningTime3();
        RunningTime4();
        RunningTime5();

        Console.WriteLine("press.."); Console.ReadKey();
    }
      
    private static void RunningTime1()
    {
        Console.WriteLine("1.  O(n*m)  Classic nested for loops");
        var sw = new Stopwatch();
        sw.Start(); //O(1)
        var equals = new HashSet<int>();
        foreach (var i in List1) //O(n)
            foreach (var j in List2) //O(m)
                if (i == j) //O(1)
                    equals.Add(i); //O(1)

        sw.Stop();

        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime2()
    {
        Console.WriteLine("2.  O(n*m)  Linq version quck style");
        var sw = new Stopwatch();
        sw.Start();

        // from... +    Distinct 
        //O(n*m)   +   O((n+m)/2)    )
        var equals = (from i in List1 from j in List2 where i == j select i).
            Distinct().ToList();

        sw.Stop();

        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime3()
    {
        Console.WriteLine("3.  O(n+m)  Linq");        
        var sw = new Stopwatch();
        sw.Start();

        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)

        foreach (var i in List1) //O(n)
            set.Add(i); //O(1)

        foreach (int i in List2.Where(set.Contains)) //O(m)
            equals.Add(i); //O(1)

        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime4()
    {
        Console.WriteLine("4.  O(n+m) foreach loop");
        
        var sw = new Stopwatch();
        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)
        sw.Start();
        foreach (var i in List1) //O(n)
            set.Add(i);
        foreach (int i in List2) //O(m)
            if (set.Contains(i)) //O(1)
                equals.Add(i); //O(1)

        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
    
    private static void RunningTime5()
    {
        Console.WriteLine("5.  O(n+m) for loop");
        
        var sw = new Stopwatch();
        sw.Start();

        var equals = new HashSet<int>(); //O(1)
        var set = new HashSet<int>(); //O(1)

        int count1 = List1.Count; //O(1)
        for (int i = 0; i < count1; i++) //O(n)
            set.Add(List1[i]); //O(1)

        int count2 = List2.Count; //O(1)
        for (int i = 0; i < count2; i++) //O(m)
        {
            int j = List2[i]; //O(1)
            if (set.Contains(j)) //O(1)
                equals.Add(j); //O(1)
        }
        
        sw.Stop();
        Console.WriteLine("Elapsed msec: {0}, equals len: {1}", 
            sw.ElapsedMilliseconds, equals.Count);
    }
}

O(n*m) and O(n+m)

n and m has length: 10000
1.  O(n*m)  Classic nested for loops
Elapsed msec: 1582, equals len: 900
2.  O(n*m)  Linq version quck style
Elapsed msec: 8446, equals len: 900
3.  O(n+m)  Linq
Elapsed msec: 8, equals len: 900
4.  O(n+m) foreach loop
Elapsed msec: 1, equals len: 900
5.  O(n+m) for loop
Elapsed msec: 1, equals len: 900

Ver2 is a bad implementation using Linq. Linq can cost you running time if you use it poorly.

O(n*m) and O(n+m)

n and m has length: 20000
1.  O(n*m)  Classic nested for loops
Elapsed msec: 6285, equals len: 3234
2.  O(n*m)  Linq version quck style
Elapsed msec: 34932, equals len: 3234
3.  O(n+m)  Linq
Elapsed msec: 10, equals len: 3234
4.  O(n+m) foreach loop
Elapsed msec: 4, equals len: 3234
5.  O(n+m) for loop
Elapsed msec: 3, equals len: 3234

The poor version takes 30 sec. and the hashset trick are maximum 10 msec.

Here is a test with larger list, the O(n*m) versions were to slow to include.
O(n+m) with large list

n and m has length: 1000000
3.  O(n+m)  Linq
Elapsed msec: 505, equals len: 90594
4.  O(n+m) foreach loop
Elapsed msec: 537, equals len: 90594
5.  O(n+m) for loop
Elapsed msec: 499, equals len: 90594

Even for large list, the time is below 1 sec.
For nested loops it would have taken 12 days!
(source: Algorithm Design book by Jon Kleinberg and Éva Tardos, chapter 2, O(n*n) where n=1.000.000)

The lesson to take is:
When you are about to code something using nested for loops, first see if the list is large, if yes consider using this trick if fast running time is important.

Written by kunuk Nykjaer

March 4, 2012 at 12:52 am

Follow

Get every new post delivered to your Inbox.