Denis's Blog: February 2010

Saturday, February 13, 2010

3 Idiots and Five Point Someone

I recently watched the Bollywood comedy movie 3 Idiots directed by Rajkumar Hirani. This movie is heartbreaking as well wonderfully entertaining when it comes to the story. Though the movie is said to be an adaptation from the novel "Five Point Someone – What not to do at IIT" by Chetan Bhagat, I feel the book is quite different from the movie. But you may find the book is also interesting, and you may have the same kind of experiences which appear in the book. Yeah, that's the beauty of this story.

The book can be found here.

Tuesday, February 9, 2010

Association Rule Mining with Extended Vertical Format Data Mining

I and my final year project team members at Department of Computer Science and Engineering, University of Moratuwa conducted a research on better alternative to the Apriori algorithm, and proving the efficiency enhancement by using a dataset. Under the supervision of Dr. Shehan Perera, we analyzed the Apriori algorithm, and came up with a better implementation which is supposed to be more efficient than its predecessors.

Current databases are very large sizes, reaching Tera-bytes and Peta-bytes, and the trend towards further increase. With this explosion of growth of databases of particular importance is the question of scalability of data mining techniques. Therefore, to find association rules require efficient scalable algorithms that allow solving the problem with in a reasonable time.
Large companies for decades accumulated data on their customers, suppliers, products and services. Due to high rate of development of e-commerce working in Web start-ups can turn into a huge enterprise in a matter of months, rather than something those years. And, as a consequence, will grow rapidly and their databases. Data mining, which is also called knowledge discovery in databases provides organizations with the tools developed to analyze the large collection of information to find trends, patterns and relationships that can help in making strategic decisions.

Traditionally, that the algorithms of data analysis assumed that the input database containing a relatively small number of records. However, the size of modern databases is too large, which is why they can not be fully deployed in the main memory. Extracting data fro m your hard drive is considerably slower access to data located in memory. Therefore, to methods of data mining used to work with very large databases, to become effective, they must possess a significant level of scalability. The algorithm is called scalable, if sustained capacity of main memory with an increase in th e number of records in the input database, its execution time increases linearly.

Recently, researchers have focused their efforts on the study of scalable algorithms for data mining in very large data sets. Here it's described an efficient and scalable frequent item-set mining method with Apriori algorithm.

Apriori algorithm is proposed for mining frequent item sets for Boolean association rules. It operates on databases having transactions to learn the association rules. Apriori algorithm is a base algorithm proposed by R. Agrawal and R. Srikant in 1994, on which many researches are done, and improvements are suggested for the general case as well as a specific subset of the applicable data. Due to the huge amount of data that is mined in the present applications, even a small performance gain on the algorithm will result in a considerable amount of throughput gain. Some enhancements to Apriori algorithm sacrifice the accuracy for a better response time. Sampling is a simplest example, where accuracy is lost in favor of performance gain.

Hash-based technique, Transaction reduction, Partitioning, Dynamic itemset counting, and multilevel and multidimensional association rules are some of the other common enhancements proposed to improve the efficiency of Apriori algorithm.

Apriori algorithm generates candidate sets and tests them to find the frequent itemsets, significantly reducing the size of candidate sets. Algorithms such as Frequent-pattern growth (FP-Growth) mine frequent itemsets without candidate generation. Both the algorithm sets have their own advantages and disadvantages. Many hybrid algorithms have been proposed and still researched to suit the general case, or mostly a particular case specialized for a given dataset.

Happy new year Sri Lanka!

For the past four years, I usually spent the new year's eve at my boarding place with my colleagues as we get ready for exams in the upcoming month of January. (Though it's not official, January seems to be a month for the final semester exams in my university :)).
Though we were away from our homes, somehow we got the chance to celebrate the new year with few of traditional new year foods at our boarding place.

Normally for Sri Lanka , the new year falls on April 13th or 14th every year. Food is the essential part of New Year festivities in Sri Lanka. Sinhalese food is very rich in nutrition. They prepare sweet meats such as mung kavum, konda kavum and unduvel. There is also an old tradition of preparing Kiri Bhath (milk rice) with rice.
Even though Sri Lankans also celebrates new year's day on January 1st as well.

Enterprise mashups powered solutions for banking services

I came across an article about a real world implementation of mashup technology in banking services. Here it explains how enterprise mashups bridge the information gap in financial services.