Getting Top 100 URLs from a Log File

Question

Getting Top 100 URLs from a Log File

In one of my friends, the following question was asked. Can someone tell me how to solve it?

We have a rather large log file, about 5 GB. Each line of the log file contains the URL that the user visited on our site. We want to find out which top 100 URLs our users visited. How to do it?

+5

algorithm data-structures

Animesh Porwal May 24 '12 at 8:42

source share

3 answers

Haozhun · Answer 1 · 2012-05-24T08:50:37+0000

If we have more than 10 GB of RAM, just do it directly with hashmap.

Otherwise, split it into multiple files using a hash function. Then process each file and get the top 5. With "top 5" for each file, it will be easy to get a common top 5.

. , . . , top5.

Thomash · Answer 2 · 2012-05-24T09:22:40+0000

URL- ( , , ), URL-, (, URL- ).

O (n * Log (n)).

3 ( 5 N) :

     File1 File2 File3
url1   5     0     5
url2   0     5     5
url3   5     5     0
url4   5     0     0
url5   0     5     0
url6   0     0     5
url7   4     4     4

url7 3 , .

Kunukn · Answer 3 · 2012-05-24T09:47:33+0000

, . . , , .

// Pseudo
Hashmap map<url,count>
while(log file has nextline){
    url = nextline in logfile
    add url to map and update count
}

List list 
foreach(m in map){
    add m to list         
}

sort the list by count value
take top n from the list

- O (n) + O (m * log (m)), n - m - .

# . . .

The algorithm uses a hash map to store found links. After that, the sorting algorithm sets the top 100 links. The sorting algorithm uses a simple data container data structure.

The complexity of the memory depends on the expected individual links. The hashmap should be able to contain the found individual links, otherwise this algorithm will not work.

// Implementation
using System;
using System.Collections.Generic;
using System.Linq;

public class Program
{
    public static void Main(string[] args)
    {
        RunLinkCount();
        Console.WriteLine("press a key to exit"); 
        Console.ReadKey();
    }

    class LinkData : IComparable
    {
        public string Url { get; set; }
        public int Count { get; set; }
        public int CompareTo(object obj)
        {
            var other = obj as LinkData;
            int i = other == null ? 0 : other.Count;
            return i.CompareTo(this.Count);
        }
    }

    static void RunLinkCount()
    {
        // Data setup
        var urls = new List<string>();
        var rand = new Random();
        const int loglength = 500000;
        // Emulate the log-file
        for (int i = 0; i < loglength; i++)
        {
            urls.Add(string.Format("http://{0}.com", rand.Next(1000)
                 .ToString("x")));
        }

        // Hashmap memory must be allocated 
        // to contain distinct number of urls
        var lookup = new Dictionary<string, int>();

        var stopwatch = new System.Diagnostics.Stopwatch();
        stopwatch.Start();

        // Algo-time
        // O(n) where n is log line count
        foreach (var url in urls) // Emulate stream reader, readline
        {
            if (lookup.ContainsKey(url))
            {
                int i = lookup[url];
                lookup[url] = i + 1;
            }
            else
            {
                lookup.Add(url, 1);
            }
        }

        // O(m) where m is number of distinct urls
        var list = lookup.Select(i => new LinkData 
             { Url = i.Key, Count = i.Value }).ToList();
        // O(mlogm)
        list.Sort();
        // O(m)
        var top = list.Take(100).ToList(); // top urls

        stopwatch.Stop();
        // End Algo-time

        // Show result
        // O(1)
        foreach (var i in top)
        {
            Console.WriteLine("Url: {0}, Count: {1}", i.Url, i.Count);
        }

        Console.WriteLine(string.Format("Time elapsed msec: {0}",
           stopwatch.ElapsedMilliseconds));
    }
}

Edit: This answer has been updated based on comments.

Added: runtime and memory complexity analysis.
added: pseudocode
added: explain how we manage a fairly large log file

Getting Top 100 URLs from a Log File

More articles: