A Schema-less Data Model for Extensible Data Types

Persisting data is one of the most common operations in programming, but at the time, one of the most misunderstood and difficult to implement without mixing paradigms. The basic problem—I speculate— is for the circumstance that persisting data is present in almost every application we use today. That encourages the idea that the databases should be the hub of all the programming discipline and applications. But that line of thinking was not always the case. The entrance of the No-SQL databases and other persistence techniques, besides the ubiquitous relational databases, motivates the return to treat storage as always should be: an output mechanism, an output device for applications.

The proposal below is about a storage data model. It presents the structure of an ideal storage for any type of data in a strictly modular way. The model represents any subject with common or fixed properties and history, without an explicit expression of its content. That is, as the model shows, the main feature: separate structure from content. We, as programmers, should construct the structure of the software entities, and treat variations as content, details to be posponed or lately bind to those structures. As such, the proposed model should be used by many different and disparate applications by simply translating, adapting, the software objects, to record entities.

Schema-oriented databases

It’s well-known that relational database modelers try to represent real-world entities in tables. When you see a schema of those databases, immediately you can imagine the domain it covers, the relations between the entities involved, and so on. That kind of orientation to modeling databases could be called “schema-oriented” design: the model resembles the subject that matters.

This concession, nonetheless, evidence the mixing concepts of object-orientation with storage mechanism. The database doesn’t need to represent the world, the database is an output mechanism of the software pieces that represents the world under the object-oriented paradigm.

Schema-less orientation

There is another way to approach the persistence layer of business objects data. (There is more than one, of course.)

Years ago, with the supporting of XML in my favorite RDBMS, I incorporate XML columns in fixed-schema tables. The purpose was to enable new properties for exiting entities, adding multiple XML new schemas. A simple Employee entity could immediately have new attributes by defining XML content stored in those columns. Also, I could store complete objects related to the entities at very little cost. The same mechanism provides me with the language tools (XPath, etc.) to express or store the data as records without any special treatment from the clients.

That mixed solution—fixed schema part with a less fixed or schema-less part—is very common in different platforms. See, for example, the Content Management Systems. It’s regular for them to provide multiple and repeated columns as in Column1, Column2, Column3, etc. or indicating the data type as in String1, String2, String3,…, Date1,… Boolean1, etc.

Another attempt is to model generic tables with columns like: Entity, Attribute, Value, etc., and so on, defining the complete metadata of a data type.

In the lines below, I will show you one model that provides such extensibility and complying with well-established object-oriented principles.

A schema-less model

What do I want to say with the above title? What does it mean? In these lines I want to show a relational Data Model (DM), that is, a set of related entities (tables), supporting the storage of complex Data Types (DT) by entering new properties and values, with no schema changes or column addition. The new entities are supported by new content and not by new a structure.

The Subject

This DT starts with a root entity: the Subject

subject

This subject could be any entity. The simple representation is for illustration purposes only. A real subject could be any entity: Profile, Person, Provider, Employee, etc. In this table you could define any stable property of the subject (no-volatile properties), that can’t change over time. For example, the DOB (date of birth) and DOD (date of death) for a person entity. Other properties, as stable they might seem, as the identifiers of subjects, or also gender for humans, could change over the time.

The uuid should be a unique identifier implementation to uniquely identify the subject when multiple databases are supported. In those cases, we could use only this column as a primary key replacing the id column (commonly implemented as an integer by the auto-numbering mechanisms of the database system).

Subject Properties

Next we could see the subject properties:

subject_properties_history

This table subject_properties_history has two sets of attributes to denote time span (dates) or enabled qualities (start_at: datetime, start_end: datetime, is_active: boolean). Those attributes express the reality that the properties have a moving history that might be change over time or simply disappear. Notice—and this is extremely important— that this table doesn’t know about the details of the concrete property type values. But, as any entity, it also have a type. This type could be anything you want, but I have notice that is very useful to distinguish the whole set of historical properties as actual, archive, systemX, or any other similar criteria.

Abstract Data Types

But, before entering into the details of the property values, let’s review another part of the model. Lets see the representation of the data abstractions:

adt

This is the most simple type representation. A type that supports inheritance. With this model you could define hierarchies as Person -> Employee, Country -> State -> City, or vehicle->car. The adt_description attribute is optional (the declared intentions of the designer of the type). It’s metadata about the types.

Abstract Data Type Properties

But where are the data type properties? The properties or members of the ADT are

adt_properties

Notice that the adt_property table also has date and time attributes (property_start, property_end) to provide historicity to the abstract type properties. In such way, a property could became in the future, or has a limited lifespan. Also, see the numeric enum_value column to support enumerations or numeric constant named-values properties. For example:

ADT Enum: Color

ADT, Properties: Red=0, Green=1, Blue=2, etc.

Primitive and Common Types

The last property_data_type_id attribute is a special one. This attribute adds another constraint or link (besides the FK relation) with the ADT table to indicate the primitive or complex type of the property itself, because the properties are types themselves. For example, you could define a Person data type, with properties as FullName, NickName, DOB, IsCitizen, etc., and remark in the ADT table that they are Text, Dates, Booleans, respectively. The representation allows to define those also. I recommend that the first values you should define in the abstract_data_types are the primitive or basic data types for the properties as this:

primitive_adt_values

Pay attention to the names of those types. They try to be agnostic to any implementation of the model. Those values are general words, any word you like, to represent the common primitive or common types.

Subject Property Instances

Now we have all to link the subject properties history with the ADT properties. This relation is one of association. But not always a simple one. Some associations has qualities. There could be many properties during a subject history. Even the same property could appear more than one at the same time.

subject_property_tags

This relation is totally flexible. It represents any text property of the subject. Or, as the model allow, the tag_value might be empty for properties that are sufficient with no content. Look that the implementation is a many-to-many relation as any association is in the relational world. But also notice that the history and the properties types of the model has no knowledge of each other, besides the declaration of the history type. That’s the role of the linked tables. They act as abstractions to relate concrete entities.

With this separation, the historicity of the subject properties is not the same of the lifespan of the ADT properties. The details of how you want to manage this relation (expiring the subject property with the expiration of the ADT property, e. g.) depends of the model implementer.

The last part of the model is the zone where you store the other values. Those are the values of dates, numbers, booleans, binaries, etc., that are the subject property content.

all_subject_properties

The tables with the prefix subject_property_X store the values of the ADT properties to which the subject properties are associated. The tables know about the ADT properties and history, but it’s not reciprocally, the other way. The ADT table and its ADT property, and the history tables are totally decoupled from the environment. The other tables, that represents the implementation details, the concrete values, depend on those tables, instead. Details should depend of Policies, not vice versa.

See that we have one table for each type the model captures. The date data is stored has a range. Also, the numeric types. You could define more complex values (with multiple properties or ranges), but the real magic is to limit their content to one primitive type for all those values (numbers, dates, booleans, etc.) as in the numeric and date range tables above.

Evaluation of the Model with SOLID

The well-known SOLID principles (R. Martin) are a set of guidelines, or axioms, you might use as a quality-assurance mechanism for the models designed with object-orientation. Let’s review our model from that perspective.

SRP Single-Responsibility Principle

This principle states that a software module should have only one reason for change. That is, that the entities should be highly cohesive to one domain. As you could confirm, each entity (table) of the model is dedicated to one, and only one responsibility. They have only one reason for change and are highly cohesive. The application of the Normalization forms are a set of guidelines to relational tables design that result in such cohesive entities, if correctly applied.

OCP Open-Close Principle

Is this model open to extensions and closed to modifications? The model was created to just that principle in mind. Allowing the representation of any data type, and the creation of any type and property, the model separate the schema from the content, the interface of the database with the real or concrete values.

The expansiveness of the model is granted by adding new abstract types and properties as values in tables. But also by adding new subject_property_X tables to persist new set of values, without changing the others.

LSP Liskov Substitution Principle

Tables can’t inherit from other tables. The notion of super-types and sub-types is not present in the relational world. But, that doesn’t mean that the underline concept of inheritance is represented in some way.

In the model, the auto-reference in the abstract_data_types table allows a parent-child relation within types, as we already said above about hierarchies. Those records are equally treated and can replace one by the other. The other example of this parent-child relation is in the adt_properties table. The field property_data_type_id is a reference to another type, converting the properties in a kind of type also, that inherit from the primitive types or any other. With the model, this movement from the same table ADT or within its properties and parent types is transparent.

ISP Interface Segregation Principle

This principle encourages low-coupling and modularity. A strict application of the high-cohesion principle will imply the segregation of any fat interface into little multiple interfaces. In the model, instead of constructing entities with multiple columns and disparate attributes, we split them to a more granular schema around the primitives. That will result in a more decoupled relations, when the dependency of a client code of an entity of Person, for example, only has to retrieve from one simple, highly normalized, table than a bloated table with dozens of columns (its interface) he don’t need nor use.

From another perspective, managing little tables, improves performance when the query engines has to deal with little logical structures instead of heavy ones.

DIP Dependency Inversion Principle

This principle dictates that all concrete implementation package should be linked by abstractions. It’s the “low-coupling” principle applied to the packages.

The relational world doesn’t have abstract tables. All tables are concrete. But not all tables represent an entity attributes implementation. Not all tables store data of the world we want to persist. That is the case with all linked tables. Those tables decouples real-value entities (ADT properties and subject properties, as in the example) with a sort of link that allows the navigation back and froward, without the knowledge of the main tables. That simulates the role of the abstract or interface in the object-oriented world, when the concrete classes in the associated packages don’t know about the others directly, but by mediation of abstractions.

Conclusions

The data model I present here hides the business semantic from any viewer. It’s insufficient to see the database design model to reveal its content domain. The same model could be used to represent a Library managing system, or a Healthcare application.

The model could store any type of data, primitive or complex. The managing complication of the model is to create a Data-Access-Layer (DAL) with procedures, functions and views to retrieve or operate in the data, applying the semantic part of the solution. This DAL will reveal the domain of the database and its data storage rules. With the same storage model, you could design multiple DAL for different systems, extending (OCP again) the capabilities of the model to unexpected new applications.

Anuncios

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión /  Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión /  Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión /  Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión /  Cambiar )

w

Conectando a %s