Abstract:
This technical paper consists of the study of data mining algorithm in cloud computing. Cloud Computing is an environment created in user’s machine from online application stored in clouds and run through web browser. Therefore, it is essential to manage user’s data efficiently. The topics discussed here are why data mining is required, theoretical study of two algorithms K-means and Apriori, their explanation using flow chart and pseudocode, their implementation in language and comparison between the two algorithms on the basis of time complexity.
Keywords: clusters, data sets, item, centroid, distance, converge, frequent item sets, candidates.
Introduction:
Data Mining in Cloud Computing applications is data retrieving from huge collection of data sets. The process of converting a huge set of data sets into human understandable form. Information which is insight can reveal through data mining by capturing database and repurposing. Cloud Computing Applications provide on demand service. It provides end-users unlimited data storage thus require mining mass amount of data. There are various techniques and logics helpful in extraction of information and reveal it.
Here, two algorithms K-means clustering and Apriori are described for data retrieving. These two techniques have different strategies for winnowing database. K-means was proposed by Stuart Lloyd in 1957 therefore also known as Lloyd’s algorithm and was first used by James MacQueen in 1967 .K-means consists of the most simplest way to divide the data set into tufts by calculating Euclidean’s distance between initial centroid and item until it converges to a single item. This process is revised for each group thus database is assorted. In Apriori basic concept is mining frequent item sets effectively. The subset of frequent item set should also be frequent. This generation of frequent item set is based upon relationship amongst attributes.
Conclusion:
First we determine the initial centroid then calculate the the distance using Euclidean’s distance formula comparison between distance and calculation of centroids includes operations: Two substraction, One summation, Two multiplication, and One square root. In this algorithm candidates are generated only for those which are frequent so gives better performance but cost increases due to production of huge candidate set so complexity increases.