Data quality
Data quality is prerequisite for success in business and scientific research. Its definition depends on the context. Data analysts define a set of data quality dimensions. There are many dozens of them. Their concepts often overlap and repeat themselves under different terms. The dimensions reflect different aspects of data quality. For the proper evaluation and planning improvement measures they should be formalized quantified.
QADAS supports data quality formalization concepts, provide solutions and develop tools for their evaluation.
The most popular data quality dimensions are broken into several categories.
Intrinsic category
Correctness. Data correctness or accuracy refers to the degree to which data represents real objects. In many cases to evaluate correctness, the data are compared to some reference sources.
Validity. Validity refers to the degree to which data values comply with rules defined for the system. These can be external rules, for instance, regulation in the finance area, or internal system rules. Validity has associations with the correctness, completeness, and consistency of the data.
Uniqueness. Uniqueness means that no duplications or redundant information are overlapping across all the datasets of the system. It means that entities modeled in the system are captured and represented possibly only once within the proper component or the database segment. Uniqueness ensures that no entity exists more than once within the data.
Integrity. When we assign unique identifiers to different objects (customers, products, etc.) within our system, we simplify the management of the data. At the same time, that automatically introduces the requirement, that this object identifier is used as a foreign key within the whole data set. This is referred to as referential integrity. Rules associated with referential integrity are constraints against duplication and non-consistency.
Reliability/Consistency. Data reliability has two aspects. The first aspect relates to the functioning of different data sources in the system. It should be ensured, that regardless of what source collects the particular data or where it resides, this data cannot contradict a value, which resides in a different source or is collected by a different component of the system. The second aspect relates to the closeness of the initial data value to the subsequent data value.
Data Decay. That is the measure of the rate of negative change to data.
Objectivity. Reflects the extent to which information is unbiased, unprejudiced, and impartial.
Reputation. It means the extent to which users regard the information in terms of source and/or content.
Contextual category
Completeness. The dimension means that certain attributes should be assigned values. Completeness rules are based on the following three constraint levels:
- Mandatory attributes that require a value.
- Optional attributes, which may have a value based on some conditions.
- Inapplicable attributes, which may not have a value (for instance, a maiden name for a single male).
Data Coverage. Reflects the degree to which all required records in the dataset are present.
Amount of data. Reflects the extent to which the volume or quantity of available data is appropriate for the tasks.
Effectiveness or usefulness. Reflects the capability of the data set to enable the achievement of specified goals or fulfill specified tasks with the accuracy and completeness required in the context of use.
Efficiency. Reflects the extent to which data can quickly meet the needs of users.
Timeliness (currency). Refers to the degree to which data is up-to-date and to the extent to which data are correct despite possible time-related changes.
Timeliness (availability). Refers to the extent to which data are available in the expected time frame.
Credibility. Reflects the degree to which data values are regarded as true and believable by users and data consumers.
Ease of manipulation. Reflects the extent to which data are easy to manipulate and apply to different formats.
Maintainability. The measure of the degree to which data can be easily updated, maintained, and managed.
Representational Category
Interpretability. The degree to which data are presented in an appropriate language, symbols, and units of measure.
Consistency. Consistency reflects the plausibility of data values. That is, the extent to which data is presented in the same format within a record, a data file, or a databaseand that semantic rules are preserved all over the system.
Conciseness. Reflects how compact information is. The extent to which it is compactly represented without losing completeness.
Conformance / Alignment. Refers to whether data are stored and presented in a format that is consistent with the domain values.
Usability. Reflects the extent to which information is clear and easily used.
Access category
Availability / Accessibility. The ease, with which data can be consulted or retrieved by users or programs.
Confidentiality. The degree to which disclosure of data should be restricted to authorised users. Relates to the security dimension.
Security. The degree to which access to information is appropriately restricted.
Traceability. Availability of the data lineage. That is the possibility to identify the source of data and transformations they have passed.