Data Engineering: The Hottest Career You Can Pursue

Jun Kim
Inside Outcome
Published in
7 min readFeb 6, 2020

--

Making data engineering “sexy” (source: reddit.com) (Images may be subject to copyright.)

Making data engineering “sexy”

Data science is the sexiest profession of the 21st century — according to Harvard Business Review anyways. And I agree. We get to use cutting-edge technology to solve interesting problems with immense impact while being paid to do so. Yes, this is another article that is going to explain how sexy data science is and how you should work in data science. Well more specifically, data engineering.

Get paid to solve interesting problems? Count me in. (source: imgur.com)

(TLDR; Data engineering is sexy. Cutting-edge technology is sexy. Big impact is sexy.)

What is Data Engineering?

You don’t know what data engineering is? Data engineering is the core aspect of data science that focuses on the design and development of data pipelines that perform data ingestion. It is the necessary step of data science in order for any enterprise to pull actionable business insights out of their data. If you would like to read more about it, check out Robert Chang’s article on the guide to data engineering. It reads like a data engineering bible: https://medium.com/@rchang/a-beginners-guide-to-data-engineering-part-i-4227c5c457d7.

(TLDR; data engineering = data pipelines.)

Before I go any further, I know what you are thinking. Data ingestion does not sound too sexy but hear me out, because there is a way to make it sexy (the real reason behind me writing this article). Essentially, there are two ways to ingest data: batch processing or real-time processing (streaming).

(TLDR; you can ingest data in a batch or in a stream (real-time) and real-time is the sexy way.)

What is Batch Processing?

Simply put, batch processing is a process of jobs executed in batches. The real life example is dirty laundry in a hamper. Doing laundry every Sunday after you have collected a week’s worth of dirty clothes is considered batch processing. A business-y example is a computer program that aggregates for the sum of sales for a prior day or week. In most cases, batch processes tend to be triggered by a time-based trigger (e.g. every Sunday).

(TLDR; batch processing ingests data in batches.)

Laundry in a batch 👍 (source: alamy.com)

What is Real-time Processing?

Real-time processing is a process of jobs executed in real time. The real life example is washing your clothes as soon as they get dirty. You may think that this example is a poor application of real-time processing and you are absolutely right. Some processes (like doing your laundry) are better left as batch than real-time unless your favorite shirt is dirty and you have a date tonight. A more relevant and appropriate real life example is washing your hands after each time you go use the bathroom. A business-y example is a computer program that runs whenever a new sale is made to get a running sum of sales for the day. Unlike batch processes, real-time processes tend to be triggered by an event-based trigger (e.g. sale).

(TLDR; real-time processing ingests data in real-time.)

Washing just a shirt? I don’t know about that 🤔(source: alamy.com)

What Got me Thinking about Real-time Process

I work as a data engineer for a health-tech company called Outcome Health (OH). OH is the leading point-of-care experience platform that supports the patient and physician relationship by providing relevant content for the most important moments of the health journey.

Proof of play is an auditing process that shows our stakeholders that relevant content (advertisement for example) was displayed on digital displays of more than 100,000 devices OH has installed in doctor’s offices all over the States. OH’s proof of play process leverages the technology of image processing and EXIF data. With automated proof of play process, OH is able to offer its stakeholders 3 T’s: transparency, trust, and technology.

(TLDR; Outcome Health uses proof-of-play to make sure the devices play the correct content.)

My First Real-time Data Pipeline Project

I joined OH last October to help passage this proof-of-play batch process to run in real time. The proof-of-play process ran in batches, meaning that at the end of each day, it ran the auditing process of all screenshots saved for the prior day. The impact of this lag was not too abysmal but needed to be addressed. As it meant the proof-of-play process was not fully optimized. This lag meant OH was not always able to utilize all of its resources to its maximum capacity.

(TLDR; Outcome Health wanted proof-of-play to run in real-time and I got hired to work on it.)

Tech Stacks for Real-time Process

Now the fun part. For the real-time process of proof-of-play, OH’s data engineering team decided on the tech stacks of AWS micro-services (Lambda, S3, and RDS) with python as the programming language. The reasoning behind using AWS was that managed AWS services improve productivity as well as reduce overhead expenses, and plus, OH is a Python + AWS shop.

(TLDR; I built real-time process using Lambda, S3, RDS, and Python.)

Diagram of real-time proof-of-play process

Considerations

There were four major considerations that were taken into account in the design and development of real-time proof-of-play process, which were

  1. Concurrency
  2. Error handling
  3. Performance (speed and accuracy)
  4. Cost

Concurrency and Error Handling

Upload count of screenshots per minute

It was estimated that we would see at most 750 screenshots per minute (as you can see in the graph above). With concurrency limit set at 1,000 as default, we were more than safe to push real-time proof-of-play lambda into production. But we needed to be REALLY sure. That is why we chose to asynchronously invoke our lambda function. You can read more about asynchronous invocation here: https://docs.aws.amazon.com/lambda/latest/dg//invocation-async.html.

Use of asynchronous invocation meant Lambda would manage the function’s asynchronous event queue and attempt to retry on errors all on its own. Even in the cases of reaching over the concurrency limit, lambda requests would be throttled and moved back into the queue and be retried for up to 6 hours.

Concurrency, check. Error handling, check.

Performance and Cost

Distribution of billed duration of 6,000 test Lambda requests

Addressing the concern of speed performance, we ran speed performance tests to make sure each auditing process does not take too much time to run. We learned that with the cold start of each EC2 instance required to run an AWS lambda function taken into account, it only takes 10,800 ms/run on average. We also ran accuracy performance tests to make sure each auditing process that is run in real-time produces the same results as the batching processing auditing process produces.

Estimate of AWS Lambda cost

To sum up, at 10,800 ms/run and 300,000 runs/month, it costs approximately $6.83/month to run the real-time processing of proof of play. How much would you pay to be able to address your core business problems immediately rather than a day from today? In other words, how much would you pay to be sexy?

Performance, check. Cost, check.

(TLDR; Concurrency, error handling, performance, and cost were considered and addressed in building the process.)

Conclusion

If you followed along the entire way without just reading the TLDRs, then I thank you and hope we are on the same page about data engineering — that it is not just about moving data from point A to point B. It is the art of moving data from point A to point B in the most efficient, logical, and robust way possible. And that in most cases, making your data pipelines real-time opens up the door to many different opportunities that you were not able to explore before. And also that, data engineering is indeed sexy.

Appendix I: Concurrency Testing

In order to best replicate the concurrency problem on a local machine, I utilized multiprocessing to write S3 objects that acted as triggers to the lambda function.

If you are interested in reading more about it, check out: https://docs.python.org/2/library/multiprocessing.html.

Appendix II: Where the Lambda Request Data Came from

Each lambda request that is run spits out a Cloudwatch log that contains the following relevant information: billed duration, max memory used, and timestamp of each request. The analysis of lambda requests were done using Cloudwatch log data.

--

--