Big Data Lab | COMStar Tech Institute

AI & Big Data Research & Technology Commercialization

Researchers at COMStar Tech and The George Washington University have addressed the challenges of AI & Big Data by developing a groundbreaking approach to information storage, information retrieval, knowledge extraction, feature clustering, and pattern prediction based on a new type of brain computational model. It moves away from the John von Neumann’s computational model that has dominated the development of computers since the 1940s, toward a design that mimics known characteristics of the human brain’s processing of vast amounts of data. In contrast to traditional information systems aimed at exact results or approximate facts, the new technology is striving for knowledge extraction.

The initial idea of the computational brain model was first presented in the paper “ On Clusterization of "Big Data" Streams” published by ACM. This paper has been already recognized and raised substantial interest. This is a foundation for many future developments.

The objective of this research is to bring forward a new simple and efficacious tool for one of the most demanding operations of “Big Data” methodology – storing, searching, clustering, predicting of diverse information items in a data stream mode and retrieve them in efficient speed.

Current Research

The following research focuses cover different aspects of this side and show how this can be applied to the design of the brain to copy with Big Data problems
These combined techniques for the Big Data systems improve speed enormously while still achieving adequate levels of accuracy. Specifically, the approximate search can achieve “speed-up” ratios of 500x - 6000x over baseline naïve search technique. Among the fields that would benefit the most from this innovation, we have “Big Data” processing, machine learning, database management, artificial intelligence, and web applications. All algorithms were implemented using both C/C++ and Python.

Intelligent Software-Defined Storage (SDS)

It refers to storage infrastructure managed and automated by intelligent software, as opposed to by the storage hardware itself. It can enable the access to very large data of diversified files and enhances speed and efficiency to the storage for various data.

Intelligent Scalable Clustering

This new approach can find models and discover useful knowledge and insights as well as uncover hidden features or characteristics that naturally divide the cases. We provide a fast and noise-robust pattern prediction and classification algorithm by using some characteristics derived from our novel intelligent scalable clustering scheme. The average complexity to cluster each input pattern is O(1).

Cyber-Physical Stream Processing

This is a novel approach that in very high probabilities and space requirements efficiency. CPS extracts most frequent item with appearance frequency as low as 2%. This algorithm can process data streams using single-pass (or on-the-fly) algorithms to provide up to the moment analysis and statistics on current arrival streams.

A New Multi-Core Pipelined Architecture for Executing Sequential Programs for Parallel Computing

this promising architecture is to process efficiently Big Data streams on-the-fly while it can process sequential programs on a parallel-pipelined model. The new architecture offers several advantages over conventional models. It reduces complexity, data dependency, high-latency, and cost overhead of parallel computing.

Multi-core Pipelining with FPGAs

The experiments of the pipeline prototype have further confirmed a noticeable improvement by pipelining on the reconfigurable FPGA design.

Extracting Parallelism in Mobile GPUs/CPUs with Combinatorial Architectures

Advantages

Intelligent systems feeding by “Big Data” streams
Artificial intelligence systems applied to diagnoses, decision-making, and gaming
Produces huge cost savings by reducing storage and processing costs for very large (> terabytes) data systems –“Big Data” systems
Provides a highly efficient data access which surpass current approaches limited by regular computational models
Improves response time with “speed-up” ratios of 100x-6000x
High accuracy and fast speed
Linear algorithms and approaches for on-the-fly streaming processing
Both structured data and unstructured data can be processed and analyzed efficiently
High dimensional data can be processed

Patents

Intelligent Scalable Clustering (Being filed by COMStar Tech in 2016)
Multi-core Pipelining (Being filed by COMStar Tech in 2016)
Intelligent Software Defined Storage (filed by GWU in 2014)
Cyber-Physical Stream Processing (filed by GWU in 2014)
Multi-Layer Multi-Processor Information Conveyor with Periodic Transferring of Processor's States for On-The-Fly Transformation of Continuous Information Flows and Operating Method Therefore, US PATENT No. 6145071, owned by George Washington University.

Applications

These inventions could get numerous applications in our Big Data intelligent system. We have developed several demo systems and verified then using real-world data for several important applications as follows. More applications could benefit from these systems are mentioned below as well.

Fuzzy Searching, Text Mining, Voice Search

Searching through a large volume of data is very critical for companies, scientists, and searching engines applications due to time complexity and memory complexity. We have developed a new approach for using fuzzy techniques in searching through big data. This method is introduced by linear time complexity for generating the dictionary and constant time complexity to access the data and update by new data sets, also updating for new data sets is linear time depends on new data points. The demo system is based on searching for letters of English. It can be used for any other languages as well. The searching speed is very fast. Potentially, this technique can be used for speech recognition, image searching, etc.

Biomedical applications and searching for genes disorders in genome databases

We have developed one demo system for such biomedical applications. This demo basically shows how a new disease could be recognized in synthetic data. The system receives a patient record based on answering tens of yes or no questions and then assigned the disorder that this patient might have with a probability (chances) of occurring. The assignment is computed by collecting the vote of the majority on each cluster. Our system has the following advantages:
• Accumulates several millions of patients’ records
• Presents biomarker data in the specific form that are clustered automatically
• For a new biomarker case entering the system, its diagnostics changes are assigned immediately

Disease Diagnosis

We have developed the diagnosis system using our new approach. The diagnosis experimental results for real-world disease data, Breast Cancer Wisconsin (Diagnostic) Dataset and SPECT Heart Diagnosis Dataset, have been verified that our new tech increases the quality of the clustering and prediction, and reduces the overall computational cost. In comparison of conventional machine learning algorithms, K-NN, Naive Bayes, and other statistical models, our new intelligent algorithm can outperform them and reach very high accuracy (around 90%) and very fast speed.

Protein Prediction

Progress in bioinformatics has resulted in the accumulation of information about genomic sequences and protein structures and these information is anticipated to be applied to drug discovery. We are working on achieving this objective and aims at building a technology to predict changes in the activity of proteins (gain or loss of function) based on information about genomic variation; in particular, mutations involving amino acid changes. Our new approach has been verified to have abilities to
• precisely predict changes in protein activity based on information about genomic variations in cancer cases (gain or loss of function)
• experimentally verify the predicted activity changes
• highly versatile and accurate prediction not restricted to single-family genes
• improve the technology based on the results

Fraud Detection

Fighting fraud continues to rank high among strategic business drivers in retail banking. The common way for banking institutions to relay on after-the-fact manual reports and threshold rules. The bank is always trying to find that knowledge from structured data. However, most of the knowledge is obtained from unstructured databases. Furthermore, frauds need to be detected in real time. That is, the transaction should be evaluated, and an authorization or decline decision should take place prior to funds movement. Moreover, Analytics is the only way to actually detect fraudulent patterns of behaviors efficiently. But most of these related data are unstructured, such as customer account info, credit score, behavior, transaction type, payment, history records, geographical locations, comparisons, are need to be analyzed. Both good and bad behavior changes unpredictably over time. The detection is not just set the threadsholds to determine frauds. Fortunately, recently we have developed a novel approach based our new big data technology to handle bank fraud detection, identity protection, etc. in collaboration with some banks in U.S.

Computer security

Security is becoming an increasingly important concern as applications become more frequently accessible over networks and are, as a result, vulnerable to a wide variety of threats. This system can be used for Intrusion Prevention System (IPS) or Intrusion Detection System (IDS) for both knowledge-based and behavior-based cases. The 23-Questions will be utilized to examine the behavior of a given user, program or malware and issues alarm if this thing might cause harm. It is very useful for tracking unauthorized intruders in many areas, such as bank fraud detection.

New-Generation Multi-core Architectures

The pipelined and combinatorial architectures we proposed could be used for new generation unconventional multi-core architectures, heterogeneous computing, and streaming computing.

The main feature of the suggested systems is flexibility to various Big Data research strategies. We would like to get application users and researchers interested in this approach, and then see how we can fit our operations to their specific needs. This new tech not only is very suitable for genomic applications, such as speech recognition, fuzzy searching, protein prediction, biomarkers, Precision Medicine, cancer diagnostics, fraud detection, computer security, scientific research, judicial investigations, systems security, large systems diagnostics, network classification, social media, and many others, but also can be effectively applied to various computational intelligence problems. In addition, another advantage of this new tech is to process unstructured data. It has been used to analyze data patterns, clustering, and perform predictive analysis.

AI & Big Data Lab