Q: You are given a logfile with multiple requests in the following form:-
Timestamp Customer_ID Page_ID
Assume that the logfile is sorted on timestamp. Find a way of generating the most common 3 page sequence. A 3 page sequence is a sequence of pages such that a user visits these 3 pages one after each other.
T1 C1 P1
T2 C2 P2
T3 C1 P2
T4 C2 P1
T5 C1 P1
Here, the most frequent 3 page sequence is P1->P2->P1. C2 here is a corner-case because C2 only visits 2 pages, asked about this and was told to discard such sequences.
I gave 2 solutions:
Straightforward hashmap: Hashmap<sequence, integer> in java, where sequence is a key generated by page visits. Eg. key(1->2->3) = 123 where 1,2,3 are page ids.
O(n) time: O(n) for separating out page visits according to customers, O(n) for going through each possible sequence and incrementing count in hashmap(O(n - k) actually), and finally O(n) in finding the maximum. Space would be O(n) in all scenarious. Here n is the total number of logfile entries, and k is the sequence length.
Compressed Trie of page sequences: much more scalable, much lesser memory in a real-world scenario, and slightly more time to get the maximum. The Leaves store the count of each sequence. Can have variable page sequences(k), uses much lesser memory in real-world scenario because almost 90% of users visit home first, and then probably search/detail/cart/checkout. To get the most popular/max-occurring page sequence: can keep a running maximum count. Implemented HashSet every trie-node for faster child-node retrieval(it is not characters but pageids, so we cannot just keep a 26-sized array map).
Time Complexity analysis: Time taken to create trie: For ‘n’ sequences with all unique nodes,O(n) to parse for all users, O(n), for every page id, we have to create a node, but for real-world scenarios, it will be much faster(because sequences will often be repeated), finally O(n) for calculating maximum because everytime we get a sequence, we compare against the maximum count encountered till that sequence. Here n is the number of entries in the logfile.
Memory Complexity: O(n) to create trie(again, much better in real-world scenario).
All of this is assuming that page sequence size is a constant.
I implemented the 2nd one. Was rejected and the reason given after the interview was that I had issues in understanding Data Structures. The interviewers agreed with my preference for trie during the interview process, but the final result was given to me by the HR.