Clustering with Multi-Viewpoint based Similarity Measure(2011)

Note: Please Scroll Down to See the Download Link.

Abstract:

All clustering methods have to assume some cluster relationship among the data objects that they are applied on. Similarity between a pair of objects can be defined either explicitly or implicitly. In this paper, we introduce a novel multi-viewpoint based similarity measure and two related clustering methods. The major difference between a traditional dissimilarity/similarity measure and ours is that the former uses only a single viewpoint, which is the origin, while the latter utilizes many different viewpoints, which are objects assumed to not be in the same cluster with the two objects being measured. Using multiple viewpoints, more informative assessment of similarity could be achieved. Theoretical analysis and empirical study are conducted to support this claim. Two criterion functions for document clustering are proposed based on this new measure. We compare them with several well-known clustering algorithms that use other popular similarity measures on various document collections to verify the advantages of our proposal.

Category:

          Microsoft ASP.NET Based Web Application.

Objective : 

  • Clustering is a division of data into groups of similar objects.
  • Representing the data by fewer clusters necessarily loses certain fine details, but achieves simplification.The similar documents are grouped together in a cluster, if their cosine similarity measure is less than a specified threshold

EXISTING SYSTEM:

Existing Systems greedily picks the next frequent item set which represent the next cluster to minimize the overlapping between the documents that contain both the item set and some remaining item sets.  

  • In other words, the clustering result depends on the order of picking up the item sets, which in turns depends on the greedy heuristic. This method does not follow a sequential order of selecting clusters. Instead, we assign documents to the best cluster.

PROPOSED SYSTEM:

  • The main work is to develop a novel hierarchal algorithm for document clustering which provides maximum efficiency and performance.
  • It is particularly focused in studying and making use of cluster overlapping phenomenon to design cluster merging criteria. Proposing a new way to compute the overlap rate in order to improve time efficiency and “the veracity” is mainly concentrated. Based on the Hierarchical Clustering Method, the usage of Expectation-Maximization (EM) algorithm in the Gaussian Mixture Model to count the parameters and make the two sub-clusters combined when their overlap is the largest is narrated.
  • Experiments in both public data and document clustering data show that this approach can improve the efficiency of clustering and save computing time.
  • Given a data set satisfying the distribution of a mixture of Gaussians, the degree of overlap between components affects the number of clusters “perceived” by a human operator or detected by a clustering algorithm. In other words, there may be a significant difference between intuitively defined clusters and the true clusters corresponding to the components in the mixture.

Modules:

      1. Admin

      2. Employee

      3. Global K mean

Admin:

Admin plays a major role in our system. In our system He has responsibilities on modifying content in the system. He can view all related resources and User details. He can delete an employee/user from the system. He can view reports of the registered people, which is implementation of clustering.

Employee:

An employee should be registered with the system. After registration he will get user id and password. By using this login id and password he can visit his home page. He has permissions to update his personal profile. He can view the reports.   

Global K mean:

Every user of the system can view the implementation of Global K means algorithm. In this module it takes the data from the data base as input and provides processed data to the viewer. That is it will analyses the existing data in the data base. Based on the selected type of employee characteristic ((Employee Qualification, Technology wise)  it will generate  the n number of clusters. 

HARDWARE AND SOFTWARE REQUIREMENTS:

Software Requirements:

Language                         :  ASP.NET, C#.NET

Technologies                  :  Microsoft.NET Framework

IDE                                     :  Visual Studio 2008

Operating System           :  Microsoft Windows XP SP2 or Later Version

Hardware Requirements:

Processor                       : Intel Pentium or more

RAM                                  : 512 MB (Minimum)

Hard Disk                        : 40 GB

Click here to download Clustering with Multi-Viewpoint based Similarity Measure(2011) source code