Data Mining with Big Data (2014)

Note: Please Scroll Down to See the Download Link.

Data Mining with Big Data


Big Data concern large-volume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This paper presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This data-driven model involves demand-driven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the data-driven model and also in the Big Data revolution.


Ø The rise of Big Data applications where data collection has grown tremen dously and is beyond the ability of commonly used software tools to capture, manage, and process within a “tolerable elapsed time.” The most fundamental challenge for Big Data applications is to explore the large volumes of data and extract useful information or knowledge for future actions. In many situations, the knowledge extraction process has to be very efficient and close to real time because storing all observed data is nearly infeasible.

Ø The unprecedented data volumes require an effective data analysis and prediction platform to achieve fast response and real-time classification for such Big Data.



] The challenges at Tier I focus on data accessing and arithmetic computing procedures. Because Big Data are often stored at different locations and data volumes may continuously grow, an effective computing platform will have to take distributed large-scale data storage into consideration for computing.

] The challenges at Tier II center around semantics and domain knowledge for different Big Data applications. Such information can provide additional benefits to the mining process, as well as add technical barriers to the Big Data access (Tier I) and mining algorithms (Tier III).

] At Tier III, the data mining challenges concentrate on algorithm designs in tackling the difficulties raised by the Big Data volumes, distributed data distributions, and by complex and dynamic data characteristics.


Ø We propose a HACE theorem to model Big Data characteristics. The characteristics of HACH make it an extreme challenge for discovering useful knowledge from the Big Data.

Ø The HACE theorem suggests that the key characteristics of the Big Data are 1) huge with heterogeneous and diverse data sources, 2) autonomous with distributed and decentralized control, and 3) complex and evolving in data and knowledge associations.

Ø To support Big Data mining, high-performance computing platforms are required, which impose systematic designs to unleash the full power of the Big Data.


?      Provide most relevant and most accurate social sensing feedback to better understand our society at realtime.



1.     Integrating and mining biodata

2.     Big Data Fast Response

3.     Pattern matching and mining

4.     Key technologies for integration and mining

5.     Group influence and interactions



Integrating and mining biodata:

We have integrated and mined biodata from multiple sources to decipher and utilize the structure of biological networks to shed new insights on the functions of biological systems. We address the theoretical underpinnings and current and future enabling technologies for integrating and mining biological networks. We have expanded and integrated the techniques and methods in information acquisition, transmission, and processing for information networks. We have developed methods for semantic-based data integra-tion, automated hypothesis generation from mined data, and automated scalable analytical tools to evaluate simulation results and refine models.


Big Data Fast Response:

We propose to build a stream-based Big Data analytic framework for fast response and real-time decision making.

·        Designing Big Data sampling mechanisms to reduce Big Data volumes to a   manageable size for processing

·        Building prediction models from Big Data streams. Such models can adaptively adjust to the dynamic changing of the data, as well as accurately predict the trend of the data in the future; and

·         A knowledge indexing framework to ensure real-time data monitoring and classification for Big Data applications.

Pattern matching and mining:

We perform a systematic investigation on pattern matching, pattern mining with wildcards, and application problems as follows:

v Exploration of the NP-hard complexity of the matching and mining problems,

v Multiple patterns matching with wildcards,

v Approximate pattern matching and mining, and

v Application of our research onto ubiquitous personalized information processing and bioinformatics


Key technologies for integration and mining:

We have performed an investigation on the availability and statistical regularities of multisource, massive and dynamic information, including cross-media search based on information extraction, sampling, uncertain informa-tion querying, and cross-domain and cross-platform information polymerization. To break through the limitations of traditional data mining methods, we have studied heterogeneous information discovery and mining in complex inline data, mining in data streams, multigranularity knowledge discovery from massive multisource data, distribution regula-rities of massive knowledge, quality fusion of massive knowledge.


Group influence and interactions:

v Employing group influence and information diffusion models, and deliberating group interaction rules in social networks using dynamic game theory

v Studying interactive individual selection and effect evaluations under social networks affected by group emotion, and analyzing emotional interactions and influence among individuals and groups, and

v Establishing an interactive influence model and its computing methods for social network groups, to reveal the interactive influence effects and evolution of social networks.



] Processor    -        Pentium IV

] Speed                   -        1.1 Ghz

] RAM          -        512 MB (min)

] Hard Disk   -        20GB

] Keyboard    -        Standard Keyboard

] Mouse         -        Two or Three Button Mouse

] Monitor      -        LCD/LED Monitor



ü Operating System          -        Windows XP/7

ü Programming Language -        Java/J2EE

ü Software Version           -        JDK 1.7 or above

ü Database                        -        MYSQL



Xindong Wu, Fellow, IEEE, Xingquan Zhu, Senior Member, IEEE, Gong-Qing Wu, and Wei Ding, Senior Member, IEEE, “Data Mining with Big Data”, IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 1, JANUARY 2014.

Click here to download Data Mining with Big Data (2014) source code