Applying object technologies to ETL solutions, part 1

CRR_3158_001_big_mouthObject technologies had been around the programming discipline at least since the first soft-object (if I’m allowed to use this term) was mentioned. What is the software thing I call an object? What is to use object technologies? This little article explore an implementation of Object-Oriented programming principles and patterns for the native-procedural ETL solutions. In the following lines I will talk about the SOLID patterns, the IPO model, the GoF patterns, the GRASP principles, and the most used data structures and algorithms for ETL. Although the list seems too dense for a quick treatment, it will be all concise to the  most important features.

Two of the most used mechanisms for ETL solutions are: the extraction of data from disparate sources to a text file destination; and, on the contrary, the loading of plain text files (or similar sources)  into databases. Although those procedures are different in scope, they share some similarities. We can think of any of them as the inverted version of the other. These are regular and very old—but well-established—mechanisms to exchange data in B2B relations.

In the first case, the E(xtract) part of the ETL acronym retrieves data from different sources to a flat-file destination. In the other case, from a plain text file we L(oad)—the L in the ETL acronym—the content into a relational database (or any other type of database). For both procedures it might be required to T(transform) the data—yes, the T of the ETL—to clean it, to validate it against rules and constraints, or do any other manipulation. Both procedures are undertaken directly with a procedural-oriented mind, because the ETL also declares the workflow of the whole operation. From the workflow perspective the procedure for extractions would be: query the data from multiple sources, combine it, clean and transform the result, and then, store it in the plain file destination with a provided layout. But here we noticed immediately that those steps (or any other intermediate step as a security layer for encryption, e.g.) can be segregated into well-defined components under almost any ETL development tool.

Let’s review this example for extraction case. I use SOLID principles all the time for my regular object-oriented applications. It’s a good practice to write code that is founded only by empirical-validated principles and design patterns. Those principles, and their application, justify or deny many decisions in the design or in the implementation process of almost any software piece.

By SRP (Single-Responsibility-Principle) I tend to segregate all functionality at the component level. One component (function, class, package, etc.) must adjust to one, and only one task. In the extraction example I could split the functions with a set of Queriers, Joiners, Cleaners, Transformers and TextWriters components. The result could discourage the most enthusiast of object technologies for the proliferation of multiple items, but in the long run the benefits will be apparent.

By OCP (Open-Close-Principle), the components solutions should be expandable with the capability to add another Querier for another type of source, or a new Transformer to adjust a bit of data to another validation rule or format. But also, the design must allow the addition of an intermediate security layer for further encryption, or data validation layer to identify data quality exceptions, without a massive refactoring of the other components or architectural changes.

The LSP (Liskov-Substitution-Principle) is more obscure to apply in ETL solutions. In the ETL world there is no inheritance as in the object oriented world. Usually the ETL tools allow a workflow which calls any process chain, or steps you want. Within this perspective you can call a «parent» module with the same interface as any of their «child» modules, encapsulating the inner execution to callers and resembling inheritance in some way. In further lines this concept will be related to the «Controller» GRASP principle.

By ISP (Interface-Segregation-Principle), I return to SRP, by separating each functionality in its own module, but also avoiding bloated components, and strongly coupled client-server relations. For the interface design, under the ETL model, I prefer the application of IPO (input-process-output) model. With a clear, well-defined, and simple Input and Output, each component is completely decoupled to Process (split, merge, validate, transform) the data. But also each component is a simple bit of unit, concentrated to a simple and unique task. It’s very easy for implementation (and the implementer), for extension and test. Also it’s possible to put another link along the chain without any perjudicial impact to the others.

But ISP not only cares about separate functionality in a clear interface. The importance of the principle is to decouple client-server relations. Any client should not be force to implement or take contact with an interface that it will not use. Bloated objects enclose many operations diminishing the code capability of re-utilization. This is a kind of «hard-coded» architecture anti-pattern. How you can reuse an Encryptor component, for example, if all its operations are mingled with other non-related operations?

And again, the difficult part when applying this principle is to explain the high amount of multiple packages, modules or components, and to attack the «monolithic» syndrome—all code in one place—of many software development teams.

For the DIP (Dependency-Invertion-Principle), I choose to mix «Indirection» and the concept of a «Controller» under the GRASP principles. Under rare circumstances I would start the call the ETL chain directly by the first link. I prefer the use of a «Controller», «Director», «Manager», «Master» or any global entry point (a kind of a Façade within the GoF patterns) who’s in charge of tasks distribution, of managing the workflows, and applying the global logging, exception handling, and monitoring requirements. This abstraction avoid that the concrete modules call themselves directly, and allow us to change the inner chain of calls as desired without damage the main process workflow.

In a following part I will go deep into the application of GoF patterns, the GRASP principles and other object-technologies to ETL solutions. Remember that ETL it’s not object-oriented, but good patterns and principles could be propagates to other disciplines and programming models with good impact. They are good teaching tools, and encapsulate the «secretes of the trade».

Anuncios

Un comentario

Responder

Por favor, inicia sesión con uno de estos métodos para publicar tu comentario:

Logo de WordPress.com

Estás comentando usando tu cuenta de WordPress.com. Cerrar sesión /  Cambiar )

Google+ photo

Estás comentando usando tu cuenta de Google+. Cerrar sesión /  Cambiar )

Imagen de Twitter

Estás comentando usando tu cuenta de Twitter. Cerrar sesión /  Cambiar )

Foto de Facebook

Estás comentando usando tu cuenta de Facebook. Cerrar sesión /  Cambiar )

Conectando a %s