Why Is My ETL So Expensive? Common Mistakes

Have you ever looked at your ETL costs and wondered why they are so high? Well, we’ve all been there. Today, we will cover the most common problems and solutions!

  1. Keep your stakeholders educated

    One of your key roles as a Data Engineer/Analytics Engineer is to transfer knowledge and make sure that your non-technical roles are also on the same page. That can help you avoid some pitfalls!

    Let me share a small story here: At one company I worked for, one of our colleagues created a Google Sheet that was using the biggest and heaviest tables we’ve had in our DWH. They set this Google Sheet to be refreshing every 5 minutes, and so our costs went through the roof!

    And the devil is in the details: Our ETL cycle only run every 4 hours, so the data was not actually being updated every 5 minutes as they thought. If we made it clear to our stakeholders before, we would have saved thousands of dollars.

  2. Keep your logic in DWH

    In modern data warehouses, storage is inexpensive; that’s why you can consider building OBT (One Big Table). That way, your stakeholders can avoid creating complex joins and logic by themselves, potentially saving you their queries cost. Another benefit is that such tables can unify the logic throughout the organization and help avoid data mistakes. You can read more about OBT here!

  3. Look at usage patterns

    When creating a new table, think about how the table will be grouped and filtered. Based on that, you can implement clustering and partitioning which will significantly reduce the cost of querying the table you’ve built - and in an easy way! Just define the clusters and partitions in your configuration part.

  4. Consider changing tables to incremental instead of rebuilding with each run

    An usual challenge when building history tables is how to structure them so that they always have the updated data, but without generating additional costs. One solution for that can be incremental tables. Incremental tables update only the new or changed data rather than recreating the entire dataset, which reduces the amount of data processed and lowers costs. By using a combination of timestamps or unique identifiers, these tables track changes efficiently, ensuring that only relevant updates are applied.

  5. Create a tracking dashboard

    Even if you implement the best of the best practices, pitfalls can still occur. Building a dashboard that would track your costs and identify the most expensive tables, or the tables that are of highest usage, or even the users that generate the highest costs can help you quickly identify the urgent fixes. From there, you can discuss the best solution with your team or your stakeholders.

    If any of these problems sounds familiar, feel free to ping me - together we will find a solution! ;)

Next
Next

Building Your First Data Pipeline in Google Cloud Platform and dummy data: A Beginner's Guide