Untitled

PROJECT: Data Remapping Engine on Kafka Connect
TEAM DESCRIPTION: SOLO, INTERN POSITION
GOAL: Effortlessly define a means of converting data from one avro schema type to another with the ability to reverse the change if needed
PROBLEM STATEMENT: Define an effective means of changing the data from format A to B without imposing delays from an external service which would handle this.
OUTCOME: Deployed a Kafka connect plugin which integrated into the current Kafka buffer transforming the data into either format with a latency of <=5ms per task
DURATION FOR POC [Proof Of Concept]: 3 weeks

CHALLENGES:
 - Issues with parsing the data sent to the Kafka buffer from a Python producer in Java due to a faulty serialization configuration
 - Define a means of allowing a quick change to the current platform in the event of a future schema change
- Allow a setup which could be setup to transform the data from A to B dependent on an admin configurable option
 - An effective means of monitoring the service to detect if there was a failure in it due to an incorrect data format
 - A scalable service which would be able to handle the current throughput of data, which was bench marked at 50 messages / second

SOLUTIONS:
 - Fixed a faulty data serialization issue by parsing the string-like version of the serialized struct data into an equivalent JSON format which could then be used with the aid of  Regular expressions.
 - Enabled quick changes to the current transformation direction using custom designed configurations on Kafka connect.
 - Managed multiple versions of the schemas and fixed the previous python serialization issue by managing different schema versions for different formats on a tool called Schema registry, also allowing upgrades of previously outdated ones.
 - Auto generated classes from the avro schema allowing for easy management of new code once the old schema becomes obsolete.
 - Delivered effective bench marking stats during the code review hitting records of <=5ms / message or 200 messages per second, which took care of the current 50 messages / second throughput.
 - Documented and implemented various performance driven configurations for the Kafka producer and consumer to allow for much faster throughput and better data retention in the case of the client failure.
 - Built a notification service which sent emails when ever the client went down or an error occurred for better notice and recovery actions.
MISCELLANEOUS: Tests, Coverage data and Documentation were also generated for this project within the stated time frame.
TECHNOLOGIES: Java, Confluent Kafka Connect, Regular Expressions, Schema Registry

THE LONG STORY
---------------------------
I worked as an intern at a company called GRIT Systems whose business was centered on configurable IOT devices which sent data about the power and energy usage of different power sources [generator, inverter etc], which are connected to such meters.

The problem began when a new set of meters were being made and an issue regarding changing the data format was brought up due to a revision that would remove redundant fields and add new ones to fulfill new functionality of these devices.

The devices push data to a Kafka buffer which then drops the data into elastic search which is like a big database, using an elastic search connector which sucks the data from the Kafka buffer into it. The elastic search and Kafka cluster have a schema which is a structure that validates that the data structure being sent is the same in both. The connector breaks when it finds data that doesn't match that schema, hence the issue. We needed to find an effective way of making sure the data gets changed once it gets to Kafka, before getting sent into elastic search to ensure things continue smoothly. We couldn't change the schema also, cause the cluster held about 40GB of data and remapping the entire cluster was too much of a computational and costly task.
The end goal would with the idea would be to setup another topic which would hold the non-conforming data, transform it to follow the right structure and then push that to the main production cluster.

I had an idea of setting up a python client that would pull the data, remap the JSON data and then send it back to Kafka. This idea had been built in part later on and reviewed by I and another engineer but it was too slow and would impose bandwidth and management issues since it would constantly have to be running to pull data and it still wasn't fast enough to remap the data in time, just crossing over 5 messages / second which wasn't it.

I later had the idea of trying to use Kafka streams and had heavily invested time in researching / developing how to use it but this didn't seem to work with our type of data as it wasn't fit for heavily nested data. I later ended up trying out Kafka connect which was similar but allowed you to build your own custom application logic to handle the data rather than an already coupled one with helper map-reduce like functions.

Kafka Connect was the most promising framework and after some deliberation, I began work on it over a two week sprint to come up with a POC demonstrating the features I had seen with it which would solve the challenges already highlighted above. After setting up the minimal configurations for the producer, I noticed that the structure was returning a Struct object instead of a JSON string from our test buffer, which was getting data from a python producer setup on a Raspberry Pi in the local workshop. After two days debugging whether it was my configuration, I came to a conclusion that it was the python client after setting up a java producer which produced the right JSON response.

During the period of trying to fix this problem, I came across multiple libraries such as Twitter's Bijection libraries, multiple JSON conversion libraries such as GSON, Guava, Apache Spark (I later nearly gave up on Kafka Connect but suffered the same issue there too) and others to try to convert the Struct response to the JSON being transmitted.

I later found out that the Struct object comprised of both the schema and the JSON object which was sent hence the issue despite being deserialized. After some days checking out the python code, I discovered that the client was serializing the data using the Avro schema and also sending the serialized data to Kafka which was then serialized again by the schema registry which held and managed the current schema. This was later confirmed after some tests with the library, the error was later corrected.

After a few days with no success, I noticed that the string representation of the Struct data structure was very akin to a JSON format if I changed the structure just a bit. Give or take a few hours and some regular expressions work, I was able to transform the data into JSON and successfully deserialize it using GSON. Performance benchmarks and notifications followed and the initial details regarding it were well achievable.

A week of cleanup was done and the project went live for a successful pre-test and test run.