We have to assume that the input data may be stored either in one multi-model DBMS or in a polystore, i.e., a set of single-model or multi-model systems. To infer a multi-model schema, we first need to process all types of input data. In addition, for some systems, a (partial) user-defined schema may exist. Or, for some models, a verified single-model inference approach may exist. These results can be integrated into the multi-model result.

First, in “Basic building blocks” section, we introduce two basic building blocks of the proposed approach:

  1. 1

    A type hierarchy used for inference of data types and

  2. 2

    A unifying representation of a single record or a possibly existing schema called Record Schema Description.

Next, we provide an overview of the general workflow of the algorithm in “Schema inference workflow” section. Then, we introduce in detail the inference approach, namely its record-based (“Record-based local inference algorithm” section) and property-based (“Property-based local inference algorithm” section) version. Finally, we discuss the process of inference of integrity constraints (“Gathering candidates for ICs and redundancy” section).

Basic building blocks

To be able to process all the considered data models using a single unified approach, we propose two auxiliary structures – the type hierarchy (see Type hierarchy” section) and the record schema description (see “Record Schema Description” section). The type hierarchy enables us to cope with the problem of distinct sets of simple types used in the models (or their system-specific implementations). The record schema description enables one to describe a schema of one kind (including the case of a schema of a single record) regardless of its model(s).

Type hierarchy

The type hierarchy is a simple tree-based representation of basic data types that can occur in the input models and their mutual relationships. It enables us to quickly find a common supertype that covers a set of data types even though not all of them are supported in each model.

The data types supported across the data models, forming the set ({mathbb {T}}’), are represented as vertices (v_i) and the natural hierarchy between the types is represented as edges (e : v_j rightarrow v_i), when values of (v_j) involve also values of (v_i). For example, the fact that String is a generalization of Integer is represented as String (rightarrow) Integer. Additionally, we assign a unique prime (p_i) or number 1 to each vertex (v_i). The numbers are assigned using the BFS algorithm, starting from 1 assigned to the root of the hierarchy and ensuring that (p_i < p_j) if (e : v_i rightarrow v_j). The integer representation of node (v_i) is then computed as (T_i = prod _{j=0}^{i} p_j), where (p_0, dots , p_i) are prime numbers (or 1 assigned to the root) assigned to vertices (v_0, dots , v_i) on the path from root node (v_0) to (v_i). The concept of union type UT, is recursively defined as a union (denoted using operator (oplus)) of types (T_i, i = 0, …, n), i.e., (UT := oplus _{i=0}^n T_i), where (T_i) is a simple type or a union type. We will denote the set of all types, i.e., both basic and union types as ({mathbb {T}}).

Next, we can introduce an associative and distributive operator bestGeneralType, defined as (T_i sqcup T_j := gcd(T_i, T_j)), where gcd is the greatest common divisor of prime products (T_i) and (T_j) representing the best general type of (T_i) and (T_j).

In addition, we introduce additional associative and distributive operator typeMerge, denoted as (diamond), defined as:

  • (T_i diamond T_j := T_i oplus T_j) if (T_i ne T_j)

  • (UT diamond T := UT oplus T) if (T nsubseteq UT)

  • (UT diamond T := UT) if (T subseteq UT)

  • (UT_1 diamond UT_2 := UT_1 oplus UT_2) if (UT_1 nsubseteq UT_2)

  • (UT_1 diamond UT_2 := UT_1) if (UT_2 subseteq UT_1)

Example 4.1

Let us illustrate an example of the type hierarchy of data types used in Fig. 1. The hierarchy illustrated in Fig. 10 is represented as a tree having an artificial root AnyTypeFootnote 10 and then the natural type hierarchy. As we can see, we do not consider the hierarchy typical for programming languages, i.e., an object being a supertype. Instead, we consider representation hierarchy, i.e., natural conversions between data types. For example, since anything can be represented as a String, it is the root of the tree. Additionally, String is assigned with 1, i.e., String is a supertype of all types.

As we can also see, each node is assigned with prime (in the BFS order), whereas the data type itself is represented as a product (corresponding to the path from String to the data type). For example, (String := 1), (Tuple := 1 times 2 times 19 times 43), and (Double := 1 times 11 times 31).

Additionally, in the example we also represent any ComplexType as an even number, because (Collection := 1 times 2) is a supertype of all complex types. This allows us to quickly distinguish simple and complex types using binary operations (lowest bit value test).

Fig. 10
figure 10

An example of Type Hierarchy

Example 4.2

Having the data type hierarchy from Fig. 10, we can represent a union type as a union of data types represented as products. For example:

  • A union type (UT_s := Double oplus Long) of two simple types is computed as a simple union of products, i.e., (UT_s := (1 times 11 times 31) oplus (1 times 11 times 37))Footnote 11.

  • A union of two complex types (UT_c := Tuple oplus Map) is represented as (UT_c := 2 times 19 times 43 oplus 2 times 23).

  • Finally, a union of two union types (UT_u := UT_s oplus UT_c) is represented as a union (UT_u := 11 times 31 oplus 11 times 37 oplus 2 times 19 times 43 oplus 2 times 23).

Example 4.3

Computing the best general type can be done as follows:

  • Having a union type of simple types (UT_s := 11 times 31 oplus 11 times 37), the best best general type is computed as (gcd(11 times 31, 11 times 37) = 11 cong Number).

  • Having (UT_c := 2 times 19 times 43 oplus 2 times 23), the best general type is computed as (gcd(2 times 19 times 43, 2 times 23) = 2 cong Collection).

  • Finally, the best general type of (UT_u := 11 times 31 oplus 11 times 37 oplus 2 times 19 times 43 oplus 2 times 23) is (gcd(11 times 31, 11 times 37, 2 times 19 times 43, 2 times 23) = 1 cong String).

Note that the implementation of finding the best general type may follow the hierarchical structure of the tree. Therefore it is not needed to compute (gcd()) in an explicit way (with the exponential complexity of the algorithm). Instead, we traverse the tree from its root, and we try to divide all the type representations by a prime assigned to the node. If it returns an integer, we traverse deeper. Otherwise, we try the sibling nodes. Having all the siblings processed and no subtree to traverse, the gcd is found.

Record schema description

The Record Schema Description (RSD) enables us to describe a schema of one kind regardless of its model(s). It naturally covers also the case of a trivial schema of a single record. So, in the proposed multi-model inference process it serves for representation of:

  1. 1.

    All types of input schemas, i.e.,

    1. (a)

      an existing user-defined schema,

    2. (b)

      a single-model schema inferred using a single-model approach, and

    3. (c)

      a basic schema inferred using the local schema inferrer in the remaining cases, i.e., for kinds without a schema,

  2. 2.

    Intermediate schemas created during the inference process by the global schema inferrer, and

  3. 3.

    The resulting multi-model schema that is transformed to a required output form.

The RSD has a tree structure, and it describes the structure of a property, a record, or a kind (i.e., a set of records) because all the three cases can be treated in the same way, as we have discussed above. The root of an RSD corresponds to a root property of the respective data model (e.g., the root XML element or the anonymous root of a JSON hierarchy), or it is an artificial root property with trivial settings encapsulating the properties (e.g., in the relational or graph model). An RSD describing a property (or a record or a kind) p is recursively defined as a tuple rsd = (name, unique, share, id, types, models, children, regexp, ref), where:

  • name is the name of property p extracted from the data (e.g., _id, person) or it can be anonymous (e.g., in case of items of JSON arrays or an artificial root property).

  • unique is the IC specifying uniqueness of values of p. Its values can be T (true), F (false), or U (unknown) for intermediate steps of the inference process.

  • share = ((share_p), (share_x)) is a tuple, where (share_p) is the number of all occurrences of property p and (share_x) is the number of parent properties containing p at least once. Note that (share_p > share_x) reflects so-called repeating property, i.e., property forming an element of an array. Also note that if we combine (p.share_x) with (pp.share_p) (of any type), where pp is the parent property of p, we get the optionality of p, i.e., (p.share_x = pp.share_p) reflects a required property, while (p.share_x < pp.share_p) reflects an optional property.

  • id is the IC specifying that the property is an identifier. Its values can also be T, F, or U with the same meaning.Footnote 12

  • types is a set of data types that cover the property. For a simple property it involves simple data types (i.e., String, Integer, …). For a complex property it involves the following values:

    • Array, i.e., ordered (un)named (not) unique child properties (e.g., child elements in XML or items of arrays in JSON),

    • Set, i.e., unordered unnamed unique child properties (e.g., items of Set in column store Cassandra), and

    • Map, i.e., unordered named unique child properties (e.g., attributes of a relational table).

    As we will see later, in the final phase of the inference process, the set is reduced to a single resulting datatype.

  • models is a (possibly empty) set of models (JSON = JSON document, XML = XML document, REL = relational, GRAPH = graph, COL = column, KV = key/value) that involve the property. If the set contains more than one item, it represents cross-model redundancy. If the value of models within a child property changes, it corresponds to embedding one model to another.

  • children is a (possibly empty) set of recursively defined child properties.

  • (Optional) regexp specifies a regular expression over the set children, or its subset (e.g., in case of schema-mixed systems or case of XML elements and attributes, forming together child properties).

  • (Optional) ref specifies that the property references another property. Since we do not specify any restriction on the referencing and referenced property models, we also cover self-references and the third possible combination of multiple models, i.e., inter-model references.

Example 4.4

Figure 11 provides sample RSDs of kinds from Fig. 1 (having the respective colors) in their textual form. Each node is described as a tuple of values of its above-listed components (in the given order) in curly brackets. If the set children is not empty, in curly brackets, there occur the child properties described in the same way.

Fig. 11
figure 11

Having the unifying representation of all possible types of input data having any of the supported models, we can propose a much more straightforward multi-model inference approach. It is based on an essential feature of RSDs – the fact that two RSDs can be merged to describe a common schema of the respective kinds. The merging strategy is a part of the proposed approach (see “Merging of RSDs—function merge()” section).

Schema inference workflow

The proposed inference process takes into account the following features and specifics of the multi-model environment:

  • Various aspects of the combined models and their specifics known for popular multi-model DBMSs [50] (such as sets/maps/arrays/tuples, (un)ordered properties, various treatments of missing values etc.),

  • Local ICs, various types of redundancy (both intra-model and inter-model), and intra/inter-model references,

  • (Partial) schemas required/allowed in selected models or inferred single-model schemas,

  • Possible, but not compulsory user interaction involving modification of suggested candidates (for ICs, redundancy etc.) and specification of non-detected cases, and

  • Processing of Big Data, i.e., maximum possible reduction of unnecessary information and parallel processing.

The input of the inference process is formed of the following:

  1. 1.

    A non-empty set of single/multi-model DBMSs ({mathcal {D}}_1, {mathcal {D}}_2, …) which together contain a set of kinds (kappa _1, kappa _2, …, kappa _{N}). Each kind is associated with its model(s). For each model supported in a particular DBMS ({mathcal {D}}_i) we also know whether it is schema-less/full/mixed and whether the order of sibling properties of a kind must be preserved.

  2. 2.

    A (possibly empty) set of predefined schemas (sigma ‘_1, sigma ‘_2, …,) (sigma ‘_{n}), (n le N), (partially) describing selected kinds.

  3. 3.

    A (possibly empty) set of user-specified input information which can be of the following types:

    1. (a)

      A redundancy set of kinds (RK = { kappa _1, kappa _2, …, kappa _{r} }, r le N) which describe the same part of reality, i.e., they will have a common schema (sigma). (Note that there is no restriction on the models the kinds in R can have. On the other hand, we do not know the schema of all kinds at this stage, so the redundancy cannot be specified at a higher level of granularity.)

    2. (b)

      A simple data type assigned to a selected property.

    3. (c)

      A local IC assigned to a selected property. The possible constraints involve identifier, unique, or (not) null.

    4. (d)

      A reference represented by an ordered pair of properties where the first one is the referencing property and the second one is the referenced property.

In other words, the user specifies the data on which the inference process should be applied. Depending on the type of the database system where it is stored, i.e., its specific features, the inference process can re-use an existing user-defined schema, it knows whether the order of siblings should be preserved, etc. Eventually, at the beginning or during the inference process (see “Architecture and implementation” section), the user can provide a partial inferred schema, user-specified simple data types, ICs, and references for selected properties, as well as redundantly stored kindsFootnote 13.

The general workflow of the inference process has two main phases—local and global. In the local phase, the process assumes as the input a large set of data, and the task is to reduce the information in parallel efficiently, i.e., we infer basic local schemas for each kind. The aim of the global phase is to merge the local schemas and enrich them with additional information gathered from the input data (i.e., ICs and references). Since we can assume that the number of all kinds in the whole multi-model schema is several orders smaller than the amount of input data, this phase does not need to be parallelised.

The workflow consists of the following stages:

  1. 1.

    Local schema inferrer For each kind (kappa) we generate its local RSD as follows:

    1. (a)

      If (kappa) has a predefined schema (sigma ‘_{kappa }), we transform it into RSD representation.

    2. (b)

      Otherwise, we generate for (kappa) a basic RSD using a parallel approach as follows:

      1. (i)

        We generate a trivial RSD for each record (or property, depending on the selected type of the algorithm—see “Record-based local inference algorithm” section and “Property-based local inference algorithm” section) of κ.

      2. (ii)

        We merge (see “Merging of RSDs—function merge()” and “Forest appender—function addToForest()” section) trivial RSDs of κ and eventually all kinds in its respective redundancy set to RKκ a basic RSD.

  2. 2.

    Gathering of footprints Parallel to the local schema inference, for each kind (kappa) we gather its auxiliary statistics, called footprints (see “Gathering candidates for ICs and redundancy” section), as follows:

    1. (a)

      Phase map We gather footprints for each value of each property (p_i^kappa) of (kappa).

    2. (b)

      Phase reduce We merge all footprints of values of each property (p_i^kappa), resulting in an aggregated footprint of property (p_i^kappa).

    3. (c)

      Candidate Set We apply a set of rules on the merged footprints of each property to produce a set of candidates for redundancy, local ICs, and references. Note that when the structure of kinds is inferred, the redundancy can be specified at the level of (complex) properties. A redundancy set of properties (RP = { pi _1, pi _2, …, pi _{s} },) (s in {mathbb {N}}), where each property is a part of some kind regardless its model, describes the same part of reality.

    4. (d)

      The user can confirm/refute the candidates at this stage or add new ones.

  3. 3.

    Global Schema Inferrer Having a unified RSD representation and footprints for each input kind (kappa), we generate the final multi-model schema as follows:

    1. (a)

      (Optionally), we perform the full check of candidates. It is not done if the user confirms the candidates.

    2. (b)

      We merge all redundancy sets of properties, i.e., for each property (pi _i in RP_j), (i,j in {mathbb {N}}) we extend its schema by joining RSDs of all properties in (RP_j).

    3. (c)

      We extend the RSDs with discovered references.

    4. (d)

      We create the final multi-model schema formed by all inferred RSDs.

  4. 4.

    We transform the resulting set of RSDs and respective ICs to the user-required output.

Next we introduce two versions of the local inference algorithm—record-based and property-based. The former follows the usual strategy in the existing works, i.e., “horizontal” processing; the latter introduces a “vertical” optimisation for complex data.

Record-based local inference algorithm

The more intuitive approach, so-called Record-Based Algorithm (RBA), considers a record, i.e., the root property including all its child properties, as a working unit. The input of Algorithm 1 consists of the particular database wrapper (w_D) (implementing specific behaviour and features of the system D) and set (N_D) of names of kinds whose schemas are to be inferred. Having initiated an empty schema S (i.e., a forest of RSDs), the algorithm infers the schema of each kind (kappa) during three logically different steps:

  • Preparation phase The data is first loaded using a particular database wrapper (w_D), and then each record is in parallel mapped into an RSD describing its trivial schema. The result of the preparation phase is a collection R (possibly containing duplicities) of RSDs describing schemas of individual records within kind (kappa).

  • Reduce phase Next, the collection R is merged using function merge() (see “Merging of RSDs—function merge()” section) in parallel into single (r_kappa) describing the basic RSD of kind (kappa).

  • Collection phase The inferred schema (r_kappa) is added to set S.

Having all the kinds processed, the resulting schema S is returned.

figure a

Merging of RSDs—function merge()

During the merging process we merge the information, and we modify respectively the regular expression describing the data. In particular, having two RSDs (rsd_1) = ((name_1), (unique_1), (share_1), (id_1), (types_1), (models_1), (children_1), (regexp_1), (ref_1)) and (rsd_2) = ((name_2), (unique_2), (share_2), (id_2), (types_2), (models_2), (children_2), (regexp_2), (ref_2)), the merging process creates the resulting RSD rsd = (name, unique, share, id, types, models, children, regexp, ref) as follows:

  • Within local stage, names (name_1), (name_2) are always equal, i.e., only properties having the same name are merged, therefore (name := name_1).

  • unique is set to minimum value of (unique_1), (unique_2), where F < U < T. In other words, the fact that a property is not unique (i.e., either (unique_1) or (unique_2) is set to F) cannot be changed. If neither (unique_1) nor (unique_2) is set to F but at least one is set to U, we have to wait to finish parallel checking of the data. Otherwise, having (unique_1) and (unique_2) set to T, the resulting unique is set to T.

  • share := ((share_p), (share_x)), where (share_p := share_{p_1} + share_{p_2}) and (share_x := share_{x_1} + share_{x_2}).

  • Similarly to unique, the same principle applies for id.

  • (types := types_1 diamond types_2) (see “Type hierarchy” section).

  • (models := models_1 cup models_2).

  • (children := children_1 cup children_2), whereas the child properties with the same name are recursively merged too.

  • regexp is the result of merging of regular expressions (regexp_1) and (regexp_2) using existing verified schema inference approaches [4, 5].Footnote 14 If (regexp_1 = epsilon), then (regexp := regexp_2). Else (regexp := regexp_1).

  • If (ref_1 = ref_2), then (ref := ref_1). Otherwise, it has to be resolved by the user/default settings.

figure b

Property-based local inference algorithm

Alternatively, instead of a whole record, the working unit of the local inference process can be a single property, whose eventual child properties are processed separately too. This strategy is not common in the existing approaches; however, it may lead to a performance boost (as we show in “Experiments” section) when, e.g., the record is highly structured and contains a large number of nested properties. Moreover, this version of the algorithm can be merged with the mining of footprints (see “Gathering candidates for ICs and redundancy” section) into a single algorithm.

To work with individual properties instead of records, we need to be able to distinguish their RSDs. Therefore, we introduce the notion of a hierarchical name that uniquely identifies each property p by concatenating the names of properties on the path from the root property to property p, where each step is separated by a delimiter ’/’ (slash). As the 0th and 1st steps, we use the name of the system and the kind where the property occurs. Additionally, if the property is an anonymously named element of an array, the name of the property consists of ‘_’ (underscore) and its data type.

Example 4.5

For example, property productId from the JSON document model in Fig. 9 has hierarchical name: /mongoDB/Order/items/_Object/productId

The input of the Property-Based Algorithm (PBA) (see Algorithm 3) also consists of the particular database wrapper (w_D) and set (N_D) of names of kinds whose schemas are to be inferred. Having initiated an empty schema S, the algorithm processes each kind (kappa) as follows:

  • Preparation phase For each property p of each record, the hierarchical name of the property and the trivial RSD of the property is extracted from the data in parallel, forming a collection NP of pairs ((name_p, rsd_p)). Next, the grouping according to (name_p) is performed, resulting in the set GP of pairs ((name_p, P)), where P is a set of RSDs of properties with the same (name_p).

  • Reduce phase Next, each collection P is in parallel aggregated using function aggregateByHierarchicalName() (see “Aggregating of RSDs—function aggregateByHierarchicalName()” section) into (rsd_p) describing the basic RSD of property p with hierarchical name (name_p). The resulting set of pairs ((name_p, rsd_p)) is denoted as AP.

  • Collection phase: Set AP is iterated and each (rsd_p) is added into schema S, continuously enriching and building the schema of kind (kappa) using function addToForest() (see “Forest appender—function addToForest()” section).

Finally, the resulting schema S is returned as the result.

figure c

Aggregating of RSDs—function aggregatebyhierarchicalname()

Generation of a basic (local) RSD consists of generating an RSD for each property and their aggregation into a common property schema. During the process, we aggregate the information, and we modify respectively the regular expression describing the order of the nested properties.

As we can see in Algorithm 4, having a collection of RDSs (rsd_i) = ((name_i), (unique_i), (share_i), (id_i), (types_i), (models_i), (children_i), (regexp_i), (ref_i)), (i = 1, …, n), the aggregation process creates the resulting RSD (rsd_p) = (name, unique, share, id, types, models, children, regexp, ref) corresponding to a schema of property p as follows:

  • The names (name_i) are always equal, i.e., only the properties having equal hierarchical name are aggregated, therefore (name := name_1).

  • Unique is set to minimum value of (unique_i), (i = 1, …, n), where F < U < T.

  • Share is set to the sum of shares, i.e., share := ((share_p), (share_x)) = ((sum ^n_{i=1}share_{p_i}), (sum ^n_{i=1}share_{x_i}))

  • Similarly to unique, the same principle applies for id.

  • (types := types_1 cup … cup types_n), whereas if there appear two types (t_i, t_j in types), s.t. (t_i subset t_j), then (t_i) is removed from types.

  • (models := bigcup ^n_{i=1}{models_i}).

  • regexp is the result of merging regular expressions (regexp_1,) (dots ,) (regexp_n) using an existing verified approach.

  • Within the aggregate function, children is always an empty set as individual children have their own hierarchical name and thus are processed separately. Its content is resolved in later stage of the algorithm within function addToForest() (see “Forest appender—function addToForest()” section).

  • For all (ref_1, dots , ref_n) either (ref_i = epsilon) or all the values of (ref_i) are equal, therefore (ref := ref_1) is selected. If (ref = epsilon), the references are resolved after applying the candidates stage (see “Global phase” section).

figure d

Forest appender—function addToForest()

The purpose of this function (see Algorithm 5) is to join RSDs describing the schema of particular properties to form an RSD corresponding to a schema of the whole kind. Moreover, the RSDs describing a schema of a single kind are grouped into a forest. Having pairs ((name_p,rsd_p)) which are alphabetically ordered according to (name_p) (in ascending order), the parent property is always included in the schema S before its children (note that locally the schema is a tree). If the properties are not ordered, then if any parent property is missing, we can insert an empty placeholder of the not-yet-processed parent allowing it to include its children. As soon as the parent is being processed, it replaces the placeholder.

figure e

Gathering candidates for ICs and redundancy

To detect integrity constraints and redundancy efficiently, we utilise a two-stage approach:

  1. 1.

    We efficiently detect a set of candidates.

  2. 2.

    The user can confirm/refute them or request a full check.

For this purpose, we introduce a set of lightweight and easy to compute footprints and we apply them to compute candidates for ICs (i.e., primary keys, intra- and inter-model references, and interval-based value constraints) and redundancy in data. A naive approach would compare active domains of all pairs of properties. Instead, when walking through all the data during the schema inference process, the same access can be exploited to mine statistical (and other) information about the active domains, i.e., the footprints. They can be then used to compare active domains and determine the desired integrity constraints more efficiently.

Property domain footprint (PDF)

For each property p, we define an active domain descriptor utilizing basic statistics and the Bloom filter [54], so-called Property Domain Footprint (PDF). It is represented as a tuple PDF = (count, first, unique, required, repeated, sequential, min, minHash, max, maxHash, totalHash, averageHash, bloomFilter).

  • Count is the number of items of the active domain.

  • First is the number of parent properties in which p occurs.

  • (unique in {texttt {T}, texttt {F}}) represents the uniqueness of values within a particular active domain. It is computed by counting the occurrence of each item of the active domain.

  • (required in {texttt {T}, texttt {F}}) represents the nullability of the value of a particular property. It is computed by comparing p.first of property p with pp.count of its parent property pp, i.e., (required := (p.first = pp.count)).

  • (repeated in {texttt {T}, texttt {F}}) represents whether the property is a direct element of an array (T) or a single value (F). It is computed using auxiliary features count and first, i.e., (repeated := (count div first > 1)).

  • (sequential in {texttt {T}, texttt {F}, texttt {U}}) represents the possibility of an active domain of a simple property to represent a sequence of integers. The default value is U (if the property is not of data type Integer). The sequential feature is computed using auxiliary features min, max, and count, i.e., (sequential := (max – min = count – 1)).

  • min is the minimum value of the active domain.

  • minHash is the minimum hashed value of the active domain. It allows to compare values of distinct data types efficiently and prevents the comparison of possibly extensive data, e.g., BLOBs.

  • max is the maximum value of the active domain.

  • maxHash is the maximum hashed value of the active domain.

  • totalHash is the sum of hashed values of the active domain.

  • (averageHash := (totalHash div count)) represents the average of hash of unique values within the active domain.

  • bloomFilter is an array of “small” size (sigma) describing a much larger active domain K of property p at the cost of false positives (i.e., equal hashes of two different values). Using multiple hash functions (H_1(), dots , H_n()) returning values from ({1, dots , sigma }), each distinct value (k in K) is hashed and each value of (bloomFilter[H_i(k)]) is incremented.

PDF miner algorithm

The purpose of the footprint miner (see Algorithm 6) is to create a PDF for each property. First, the data are loaded from the database store in the form of records using a particular database wrapper. For each property p of each individual record a footprint f is created describing a single value of active domain of a certain property. The hierarchical name (name_p) is attached to each footprint instance. Next, the instances are merged to create distinct unique sets of each active domain using function mergeValueDuplicates(). Then, the distinct values are first grouped by function groupByKey(), resulting in set GP of pairs ((name_{p_i}, F_i)), where (F_i = {f_{i_0}, dots , f_{i_n}}) is the set of footprints. Second, they are grouped to determine the footprint (f_p) describing the whole active domain. To do so, merge function aggregateByHierarchicalName() (see Algorithm 7) is applied.

figure f
figure g

Finally, the aggregated property features are appended to the tree structure representing the data and the missing features (i.e., repeated, sequential, and averageHash) are resolved.

Candidate builder algorithm

Having computed footprint (f_p) for each property p, we can determine candidates for identifiers, references, and data redundancy. We propose Algorithm 8 that consists of three phases:

  1. 1.

    Identifier candidate The candidates for identifiers (C_{ident}) are inferred from the footprints in set F. An identifier must be unique within the active domain of the respective property p and required. Also, the property can not be a direct element of an array. Therefore, the algorithm tests whether (f_p.unique = f_p.required = texttt {T}) and (f_p.repeated = texttt {F}).

  2. 2.

    Reference candidate Candidates for references (C_{ref}) are inferred on the basis of several observations. A reference is a property that refers to the identifier of another kind. Therefore, we search the Cartesian square (F times C_{ident}) excluding pairs (cc), (c in C_{ident}) in order to find pairs (fc), s.t. f is the footprint of the referencing property (p_f) and c is the footprint of the referenced property (p_c). Additionally, the active domain of referencing property (p_f) must form a subset of active domain of the referenced property (p_c). Therefore, we compare active domains of both the properties using function formsSubset() (see Algorithm 9), i.e., we analyse footprints f and c using the following rules:

    • The referencing property does not have to be strictly unique, as the one-to-many (as well as many-to-one or many-to-many) relationship may occur.

    • The referencing property does not have to be required as the lower bound of relationship may be zero-to-one/many.

    • It must hold that (f.minHash ge c.minHash), i.e., the referencing property (p_f) does not contain a smaller value than the referenced property (p_c).

    • Similarly, it must hold that (f.maxHash le c.maxHash).

    • Finally, only if all the above conditions are satisfied, Bloom filters are compared. To denote that property f is a reference to property c for each pair of elements f.bloomFilter[i],  c.bloomFilter[i] it must hold that f.bloomFilter[i] (le) c.bloomFilter[i]. In other words, active domain of (p_f) must be a subset of active domain of (p_c).

    Additionally, we distinguish between strong and weak reference candidates. Having sequential, unique, and required set to T for both the referencing and referenced properties may imply that there is no relationship between the properties (i.e., both properties form a technical (auto-incremented) identifier of their kind). Therefore, such a combination may lead to a weak candidate for a referenceFootnote 15.

  3. 3

    Redundancy candidate Finally, reference candidates may be extended into data redundancy candidates (C_{red}). Naturally, each pair of referencing and referenced properties having footprints f and c store redundant information. However, we assume that redundantly stored data should cover a more significant part. Hence, we check their descendants and siblings to find more pairs of properties whose active domains form a mutual subset. If there is at least k pairs of neighbouring properties forming such subsets, the reference candidate (fc) is marked as a weak reference candidate and, together with its neighbourhood, extended into a redundancy candidate. If for all the pairs of properties in the redundancy candidate the active domains are equal, we speak about full redundancy. Otherwise, i.e., when one kind contains only a subset of records of another kind, it is a partial redundancy. Also, note that only redundant properties are considered as a part of redundancy, even though both kinds may contain properties having the same name. If multiple properties can form a redundancy pair with the same property, it is up to the user to decide.

figure h
figure i

Example 4.6

Figure 12 introduces an example of footprints of selected properties of the multi-model data from Fig. 9 and their application for detection of candidates. The upper part of the Fig. contains the data, i.e., a subset of properties from relational table Product (violet), nested documents _ from JSON collection Order (green) and nested elements Product from XML collection Invoice (grey). The bottom three parts correspond to the three phases, where we can see the values of respective necessary features of footprints.

In order to determine all the footprint features, the duplicate values are removed, the unique values are hashed (using, e.g., rolling hash (rh(v) := v_{rh}) for each (v in {mathbb {V}})) and then minHash, maxHash, averageHash, and bloomfilter are computed as was described. averageHash is rounded to floor(averageHash) and to compute the Bloom filter (of size = 2), hash functions (h_1(v_{rh}) := v_{rh} mod 2) and (h_2(v_{rh}) := floor(sqrt(v_{rh})) mod 2) are used.

Regarding the detection of candidates, first all footprints are iterated and their unique, required, and repeated features are checked. In this particular case, only the footprint of property id satisfies that unique and required are both set to T and repeated is set to F, therefore only a single identifier candidate is created and propagated to the next phase of the algorithm. Also note that id is probably a technical identifier based on auto-increment (i.e., sequential is also set to T).

Next, all relevant distinct pairs of footprints are compared to find candidates for references. If any footprint feature does not satisfy the requirements for a reference, it is denoted by the red colour. As we can see, only properties productId (JSON) and productId (XML) satisfy the requirements and therefore form the set of reference candidates (C_{ref}).

Finally, reference candidates are checked to form redundancy candidates, whereas (k = 2). In this case, we compare siblings (as there are no further nested properties) of pairs (id, productId) (JSON) and (id, productId) (XML). In the former case, there is a redundancy between properties title (REL, JSON), brand (REL, JSON), and price (REL, JSON). In the latter case, there is a redundancy only between properties title (REL, XML). In the former case, the number of pairs (3 ge k), therefore, the reference candidate is extended into the redundancy candidate, and the former reference candidate is marked as weak. In the latter case, the reference candidate remains unchanged as (1 < k), and it does not form the candidate for redundancy.

Also note that (idtitlepricebrand) (REL), (productIdtitlepricebrand) (JSON) form a partial redundancy, since multiple requirements are violated, e.g., features average are not equal (see the bold font).

Fig. 12
figure 12

An example of selected footprints and building of candidates

Fig. 13
figure 13

Example of redundancy, k=1, k=2, k=3

Global phase

The local phase consists of inference of local (single-system, single-kind) schemas described as tree-based RSDs and building a set of candidates for identifiers, references, and redundancy. The global phase applies the knowledge gained in the previous steps and joins RSDs using candidates for references and redundancy into the resulting global multi-model schema. It can also begin with an optional full check of candidates, i.e., removing false positives.

Checking of candidates

Depending on the implementation, either the user may confirm/refute the suggested candidates (or denote user-specified candidates) using an interactive interface (see “Architecture and implementation” section). Or, (s)he may decide which subset of candidates for references and redundancyFootnote 16 will be thoroughly checked. The checking itself is implemented as a distributed MapReduce job:

  • Checking of references involves mapping of each value of a referencing property ref into tuple ((value, texttt {REF})) while each value of referenced property id is mapped to tuple ((value, texttt {ID})) as well. Reduction and further mapping by key takes place as follows:

    • If the list assigned to value contains both REF and ID, the result is mapped to 0.

    • If the list assigned to value contains only ID, the result is mapped to 1.

    • Otherwise, the result is mapped to (-1).

    Finally, the minimum value is selected. If the result is (-1), the candidate for reference is not valid and removed. If the result is 0, it is a full reference (i.e., the active domains of ref and id are equal). The result of 1 denotes that property ref refers to a subset of the active domain of property id.

  • Checking of redundancy is similar; in addition we have to check values of all neighbouring redundant properties of the referencing and referenced properties ref and id. First, for each record its redundant properties are mapped to tuple (value({red_subrecord, source})), where value is the value of referencing/referenced property, (red_subrecord) are values of ordered redundant properties in the record, and (source in {texttt {REF}, texttt {ID}}). Next, the tuples are reduced by key and then mapped as follows:

    • If the list assigned to value contains two equal sub-records from distinct sources, the result is mapped to 0. If the sub-records are not equal, the result is mapped to (-1).

    • If the list assigned to value contains only sub-record from the source REF, the result is mapped to (-1).

    • Otherwise, the result is mapped to 1.

    Finally, the minimum value is selected. If the result is (-1), the candidate for redundancy is not valid, i.e., either there is a sub-record in the kind containing the referencing property ref that is not a part of the kind containing referenced property id, or the sub-records with the same value do not share the same values in all redundant neighbouring properties. If the result is 0, the redundancy is full. Otherwise, the redundancy is partial.

Joining of RSDs

Joining of RSDs may be implemented variously, depending on the selected output of the inference process. Either the RSDs, together with the confirmed/thoroughly checked candidates for identifiers, references, and redundancy, are transformed to an output format, such as XML Schema [55], JSON Schema [47] etc. Or, a more abstract representation using ER [51] or UML [9] can be used. How the information is captured depends on the selected format. For example, for representation of redundancy, the globally defined XML elements and respective XML references can be used. In the implementation of the inference approach (see “Architecture and implementation” section), we also use a simple visualisation of the forest of RSDs, where the identifiers, as well as redundant parts of the trees, are respectively graphically denoted, and the trees are interlinked with a new type of edges representing the references.

Example 4.7

Depending on the selected parameters of the algorithm, we can get as a result, e.g., the ER model depicted in Fig. 1. If we look closely, e.g., at entity Product, in Fig. 13 we can see an example of its alternative result depending on parameter (k = 1) or (k in {2,3}). The colours represent the respective “overlapping” models and properties (relational, JSON document, or XML document). If (k = 1), we would get one common schema for kind Product represented in all the three models. If (k in {2,3}), we would get a common schema for the relational and JSON document model and a different schema for the XML document model.

Note that property productId is an identifier only in the relational table Product (illustrated by the purple colour). Also, note that non-redundant property price has probably a different meaning in distinct models. In the case of the XML document model, it could be a purchase price, whereas, in the relational and JSON document model, it could be a selling price.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.springeropen.com/)