Data Analysis – Quality Assurance, Data Analysis & Science

Data Analytics

Data analysis includes two major areas. First, the analyses of structured and non-structured raw data to extract useful knowledge from them. Second, statistical evaluation of structured data. QADAS focuses on domain and organization-specific requirements and conditions. Implementation of standard and ad-hoc models and methods.

Generally, data analysis includes several typical classes of tasks. Which of them should be used and what methods are optimal depends on a particular system.

Data formalization.
Data formalization aims on making it evaluable and compatible. That includes three basic tasks. First, integration of data from different sources to achieve certain uniformity. Second, making unsorted pieces of information structured. Third, making data quantitative.

Clustering or segmentation.
Clustering involves sorting the instances in a dataset into subgroups, which contain similar instances. One of the biggest challenges is to decide which attributes to include and which to exclude in order to get the best results.

Anomaly or outlier detection.
Anomaly or outlier detection involves identifying instances that do not conform to the typical data in a dataset or data cluster. That can relate to the data attributes or to the aim of the data use – contextual outliers. There are different outlier detection methods. For instance, proximity-based methods, grid-based methods, distance-based methods, or clustering-based methods.

Association-rule data mining.
A technique to detect groups of items that frequently co-occur together. Unlike clustering and anomaly detection, which are looking to identify similarities or differences between instances in a dataset, data mining focuses on looking at relations between data attributes.

Prediction.
Predictive analytics is the use of data to predict future trends and events. It Includes classification, pattern recognition, and regression.

Missing data treatment.
Missing data is data that is not captured for variables of the investigated area or model. It can naturally skew the results of any evaluation and produce biased estimates that lead to invalid results. Missing data reduces the statistical power of the analysis, which can distort the validity of the results.

When dealing with missing data, we can use two primary methods to fix the situation. That is imputation or the removal of data. The imputation method develops reasonable guesses for missing data. It can be useful when the portion of missing data is relatively low. The second approach is to remove affected data to reduce bias. However, that can result in a lack of observations and low reliability of the analysis.

The aim is to find an optimal solution in a particular situation and specific model.

Data visualization.
Data visualization is the graphical representation of information and data. It uses visual elements like charts, graphs, or maps. By this, corresponding tools provide a comprehensible way to see and understand trends, outliers, and patterns in data.