The majority of the world’s digital universe is unstructured in the form of text, images, audio and video. Texts are especially significant given subjective opinions are more often expressed in texts in the form of articles, blogs, interviews, assessments, etc. It is almost an impossible task for human analysts in any business vertical to manually go through the available data and extract actionable intelligence that are buried inside textual documents. What we need is a tool that automatically search and extract useful information from texts for use in a variety of applications, including document summarization, topic extraction, sentiment analysis, and identity management from open source.


To search and extract information from a textual source, one needs to be able to categorize documents and semantically analyze the content of each document in an automated matter and then extract evidence, embedded in it to propagate into relevant analytics models. Simple keyword lookup is not enough, as the occurrence of a particular word carries different meaning depending on the context surrounding it. We need capabilities to structure information to analyze, extract, and then represent content in a suitable format from which evidence can be extracted semantically.


ATP provides a set of powerful techniques for categorization and structuring of unstructured textual documents. For document categorization, ATP offers both supervised and unsupervised powerful clustering techniques, including Bayesian Classifiers, Latent Semantic Analysis, Latent Dirichlet Allocation, and Support Vector Machine. For information structuring, ATP employs deep Natural Language Processing (NLP) techniques for information structuring that includes stemming, POS tagging, chunking, and named‐entity and co‐reference resolutions, and then extract and represent in the form of subject‐predicate‐object triples (e.g., (John Smith, plays, football) is extracted from “John Smith plays football” recognizing the named entity “John Smith”).


On the information categorization side our innovation is not only efficient implementations of mathematically well‐founded and cutting‐edge clustering techniques but also a kernel‐based supervised technique and its implementation that exceeds performance of Naive Bayesian Classifier. On the information structuring side, Machine Analytics innovations are proprietary in‐house algorithms for named‐entity and co‐reference resolution, information extraction and summarization. Machine Analytics currently has two Natural Language Processing and one Semantic Search patents pending.


A robust implementation of ATP exists in platform independent Java. The engine is accessed via an API for its categorization, extraction, and summarization functionalities. Demonstration is available upon request.