XML Query System

Note: Please Scroll Down to See the Download Link.

 

ABSTRACT

XML has become the lingua franca for data exchange and integration across administrative and enterprise boundaries. Nearly all data providers are adding XML import or export capabilities, and standard XML Schemas and DTDs are being promoted for all types of data sharing. The ubiquity of XML has removed one of the major obstacles to integrating data from widely disparate sources – namely, the heterogeneity of data formats. However, general-purpose integration of data across the wide area also requires a query processor that can query data sources on demand, receive streamed XML data from them, and combine and restructure the data into new XML output — while providing good performance for both batch-oriented and ad-hoc, interactive queries.

The aim of this project is to design and develop a simple java based Query Engine that reads and parse the XML documents. And searching collections of XML documents that use an XQuery front end. The engine will be developed as a straightforward component API that allows it to be easily embedded in end user applications.

Existing System:

For many years, a wide variety of domains, ranging from

scientific research to electronic commerce to corporate information systems, have had a great need to be able to integrate data from many disparate data sources at

different locations, controlled by different groups. Until recently, one of the biggest obstacles was the heterogeneity of the sources’ data models, query capabilities, and data formats. Even for the most basic data sources, custom wrappers would need to be developed for each data source and each data integration mediator, simply to translate mediator requests into data source queries, and to translate source data into a format that the mediator can handle.

The emergence of XML as a common data format, as well as the support for simple web-based query capabilities provided by related XML standards, has suddenly made data integration practical in many more cases. XML itself does not solve all of the problems of heterogeneity — for instance, sources may still use different tags or terminologies — but often, data providers come to agreement on standard schemas, and in other cases, we can use established database techniques for defining and resolving mappings between schemas. As a result, XML has become the standard format for data dissemination, exchange, and integration. Nearly every data management-related application now supports the import and export of XML, and standard XML Schemas and DTDs are being developed within and among enterprises to facilitate data sharing   In many data integration applications, XML is merely a “wire format,” the result of some view over a live, dynamic, non-XML source. In fact, the source may only expose subsets of its data as XML, via a query interface with access restrictions, e.g., the source may only return data matching a selection value, as in a typical web form. Since the data is controlled and updated externally and only available in part, this makes it difficult or impossible to cache the data. Moreover, the data sources may be located across a wide-area network or the Internet itself, so queries must be executed in a way that is resilient to network delays. Finally, the sources may be relatively large, in the 10s to 100s of MB or more, and that may require an appreciable amount of time to transfer across the network and parse. We refer to these types of data sources as “network-bound”: they are only available across a network, and the data can only be obtained through reading and parsing a (typically finite) stream of XML data. To this point, integration of network-bound, “live” XML data has not been well studied. Most XML work in the database community has focused on designing XML repositories and warehouses, exportingXMLfrom relational databases, adding information retrieval-style indexing techniquesto databases, and on supporting query subscriptions or continuous queries  that provide new results as documents change or are added.

Clearly, both warehousing and indexing are useful for storing, archiving, and retrieving file-basedXML data or documents, but for many integration applications, support for queries over dynamic, external data sources is essential. This requires a query processor that can request data from each of the sources, combine this data, and perhaps make additional requests of the data sources as a result. To the best of our knowledge, no existing system provides this combination of capabilities. The Web community has developed a class of query tools that are restricted to single-documents and not scalable to large documents. The database community’s web-based XML query engines, such as Niagara and Xyleme, come closer to meeting the needs of data integration, but they are still oriented towards smaller documents (which may be indexable or warehoused), and they give little consideration to processing data from slow sources or XML that is larger than memory.

Proposed System:

In this project, we describe Tukwila’s architecture and implementation, and we present a detailed set of experiments that demonstrate that the Tukwila XML query processing architecture provides superior performance to existing XML query systems within our target domain of network-bound data. Tukwila produces initial results rapidly and completes queries in less time than previous systems, and it also scales better to large XML documents. The result is the first scalable query processor for network-bound, live XML data. We validate Tukwila’s performance by comparing with leading XSLT and data integration systems, under a number of different classes of documents and queries (ranging from document retrieval to data integration); we show that Tukwila can read and process XML data at a rate roughly equivalent to the performance of SQL and the JDBC protocol across a network;

System Detailed Description:

Querying XML

During the past few years, numerous alternative query languages and data models for XML have been proposed, includingXML-QL and XSLT. XSLT is a single-document-oriented query language consisting of rules: each rule matches a particular path in an XML tree and applies a transformation to the underlying

subtree. XML-QL was a data-oriented query language,adapted from the semistructured database community, and could join data across documents, but had few document-oriented features. Recently theWorldWideWeb Consortium has combine the features of these languages with its XQuery language specification and accompanying data model. The XQuery data model defines an XML document as a tree of ordered nodes of different content types (e.g., element, processing instruction, comment, text), where element nodes may also have unordered attributes. For example, the XML document of Figure 1 can be modeled as the tree of Figure 2. In this diagram, we have represented elements as labeled nodes, text content as leaf nodes, attributes as annotations beside their element nodes, and special IDREFtyped reference attributes as dashed edges from their elements to their targets (where the target element is identified by an ID-typed attribute of the same name).

The XQuery language is designed to extract and combine subtrees within this data model. It is generally based on a FOR-LET-WHERE-RETURN structure (commonly known as a “flower” expression): the FOR clause provides a series of XPath expressions for selecting input nodes, the LET clause similarly defines collection-valued expressions, the WHERE clause defines selection and join predicates, and the RETURN clause creates the output XMLstructure. XQuery expressions can be nested within a RETURN clause to create hierarchical output, and, like OQL, the language is designed to have modular and composable expressions. Furthermore, XQuery supports several features beyond SQL and OQL, such as arbitrary recursive functions.

XQuery execution can be considered to begin with a variable binding stage: the FOR and LET XPath expressions are evaluated as traversals through the data model tree, beginning at the root. The tree matching the end of an XPath is bound to the FOR or LET clause’s variable. If an XPath has multiple matches, a FOR clause will iterate and bind its variable to each, executing the query’s WHERE and RETURN clause for each assignment. The LET clause will return the collection of all matches as its variable binding.

A query typically has numerous FOR and LET assignments, and legal combinations of these assignments are created by iterating over the various query expressions. An example XQuery appears in Figure 3.We can see that the variable $b is assigned to each book subelement under the db element in document books.xml; $t is assigned the title within a given $b book, and so forth. Our version of XPath includes extensions allowing for disjunction along any edge (e.g., $n can be either an editor or author), as well as a regularexpression-like Kleene star operator (not shown). In the example, multiple match combinations are possible, so the variable binding process is performed in the following way. First, the $b variable is bound to the first occurring book. Then the $t and $n variables are bound in order to all matching title and editor or author subelements, respectively. Every possible pairing of $t and $n values for a given $b binding is evaluated in a separate iteration; then the process is repeated for the next value of $b. We observe that this process is virtually identical to a relational query in which we join books with their titles and authors — we will have a tuple for each possible _title, editor|author_ combination per book. The most significant difference is in the terminology; for XQuery, we have an “iteration” that produces a binding for each variable, and in a relational system we have a “tuple” with a value in each attribute. The RETURN clause specifies a tree-structuredXML constructor that is output on each iteration, with the variables replaced by their bound values. Note that variables in XQuery are often bound to XML subtrees (identified by their root nodes) rather than to scalar values. The result of the example query appears in Figure 4. An item element is output for each possible combination of bindings.

Number of Modules

The system after careful analysis has been identified to present with the following modules.

XML Core Engine

Word Processing Module

Data Reader

Streaming Engine

Required Hardware

1.   256 MB RAM.

2.   20 MB Hard Disk Space

3.   P-IV Processor.

 Required Software

1.   Windows 2000/XP operating system.

2.   Internet explorer 6.0 or Mozilla Fire Fox

3.   Java 2 Platform, Standard Edition, v 1.4.2 (J2SE) or above for core Java development tool- http://java.sun.com/j2se/

4.   XQuery

5.   Eclipse for Development

Click here to download XML Query System source code