Motivation

Big data is generally stored in relational databases, such as Oracle, DB2, SQL Server, and MySQL, and in data warehouses such as Terradata. This data is generally heterogeneous and distributed, making it difficult to query accurately, quickly, and thoroughly. Although big data environments are in the process of migrating to the scalable, fault‐tolerant cloud environment, the cloud remains experimental in nature, due to its lack of adequate data security and the unrealized need for a query tools utilizing the Map Reduce paradigm. As a result, data remains distributed in many formats, both structured and unstructured, and only non‐essential data is currently stored in the cloud.

Need

DAS will allow end users to query distributed data sources in natural language without having to know the source formats and locations. In the government space, most distributed archives and databases, such as NASA’s DAAC and the DoD’s DCGS, are autonomously maintained. Additionally, our personal communications with personnel from big retailers such as Sears and Walmart reveal that their databases are also highly distributed and heterogeneous with less than 10% residing in cloud environments, and that it takes almost a day for an analyst to extract data from relevant sources after the request is placed. Our approach will allow analysts to query data sources directly in natural language and will reduce this one‐day turnaround time to within seconds.

Approach

DAS searches distributed structured and unstructured “big data” sources by semantically analyzing natural language queries regardless of data’s location, content or format. DAS accepts natural language queries from a web‐based user interface, deploying “intelligent agents” to scan unrelated data sources and return answers to support the decision‐making process. DAS is format‐agnostic. DAS allows users to perform distributed search within the cloud without users needing to already know the format or locations of individual data sources. In addition, it is not necessary for these data stores to be traditional relational, nor do they need to be on the same network. Agent‐assembled data is analyzed for underlying trends. This is a non‐trivial exercise, with agents building and executing queries based on natural language user input. Secured Agents will build temporary tables from multiple unrelated data sources by taking computations to data sources, thus avoiding large downloads. Machine Analytics is uniquely positioned in this market place.

In summary, DAS answers queries through the following stages:

  • Accept a search query from a user in natural language via a web interface (e.g. “What are the capitals of the states bordering New York?”)
  • Automatically translate the query to a set of sub‐queries by deploying a combination of planning and traditional database query optimization techniques.
  • Generate a query plan represented in XML and guide the execution by spawning intelligent agents with various types of wrappers as needed for distributed sites.
  • Merge the answers returned by the agents and return them to the user.

Innovation

Our approach is innovative because no other currently available technology can query distributed data sources, and its extreme need is justified above. Our natural language query translation, using hybrid deep linguistics processing and machine learning, and the plan generation along with XML representation and distributed execution, is unique and is Machine Analytics’ trade secret. Machine Analytics currently has two Natural Language Processing and one Semantic Search patents pending.

Demo

DAS demonstration exists in a LAN setting mimicking distributed data sources – available on request.