ETL – Extract, Transform, Load. Raw Data coming from various sources including internal and external sources need to be cleaned and prepared so it can be integrated in a meaningful way before any analytics can be performed over it. There are many tools that can be used to perform the ETL. With this information in mind:
Some of the trending ETL tools that are out there in the industry and which ones you have used. What are the some of the challenging ETL process you have come across or solved?
What are the limitations to the ETL step itself compared to Graph Data representation? Are there any use of Machine Learning Algorithms to perform ETL? Finally:
I believe one of easiest ETL tools to use is Alteryx. Alteryx allows you to input data and preform many different transformations in the data and then output the results you build. What makes it easy to use is how the program is designed. The user interface is all drag and drop and once your transformation is complete it will look like a process flow which is really easy to read and follow. Alteryx has a great community when you are struggling to google answers on as well. It’s a very expensive program which may limit the amount of people that have used it. I’m currently using Alteryx at my workplace but I don’t have access to an Alteryx Gallery (Server). This would allow me to schedule my workflows so they run automatically.
One of the challenges I came across during and ETL exercise is storing data. Some of the data at my workplace is a “point in time” data. Meaning the data can only be measured on the day of. What this doesn’t allow is good historical tracking. For example for our customers we know how many of them have an email address today. Lets say today we had 100 email address. Yesterday we don’t know if we had 90 or 100. So what I do is each month I query the email table to join to our customer table and store a monthly history. I use an insert statement to add records to a table I’ve already setup each month. This allows me to measure if the number of emails for our customers have increased or decreased over time.
With a relational database that you would use ETL in a star schema you’re setting up Dimension and Fact tables. In the graph data models you use one transactional data set and tell the software what fields to merge or connect on.
Several ETL (Extract – Transform – Load) tools are available for data management, such as Oracle, Informatica, Stitch, SAS, and Qlik. The hospital uses the Cerner EMR software that is part of Oracle. The Cerner Command Language (CCL) is the primary database query and scripting language used to extract healthcare data from the EMR Oracle database. Programmers can write CCL queries and scripts from the raw data and transform the information into operational reporting formats. Most reports are sent to their Discern Explorer Menu, their proprietary way for the end user to access reports and data. The CLL results can also be sent to other databases for custom SQL queries as well as the data warehouse for real-time dashboards.
Our CCL programmers take care of the CCL code and transform it into the EDW. My part is to validate the accuracy of the data going to the EDW for the end user to extract into reporting. This is often challenging as I need to understand where the source data is coming from, what data points to pull, and what the end result should be.
Relational Databases are great for storing and processing data. This practice has been around for several decades. They are used in Graph databases and do great with online transaction processing, analytical processing, business intelligence, ETLs, and advanced analytics. The information from the database can be processed into visualizations and graphs. RDBMS links data together through modeling. Graph databases offer flexibility and are easier to work with when linking information. However, graph modeling falls short when querying complex relationships.
I could not locate any information on true machining learning algorithms to perform ETL. Still, many algorithms provide a more automated approach to extracting, processing, and transforming the data.
The following is an example of our source data in the EMR and how it is stored, extracted, transformed, and loaded into the EDW for the end user to use for data mining, reporting, and Qlik dashboards.