Freeing up resources - Automation

Engineering (not just AI) can automate tasks, freeing up people to do more valuable things. In 2023, automation led to a 15% reduction of operational headcount needs. Here are some highlights:

Data

Snagajob's strength lies in its data, and its data is underpinned by Kafka. However, Kafka is useless for ad-hoc analytics. Enter Snowflake. By converting Kafka data (topics) into Snowflake data (tables), real-time analytics is possible.

The process to making this a reality required multi-team coordination that meant new features would be live, and weeks would go by before detailed analysis could occur (thankfully, no data is lost no matter how long it took).

The sequence looked like this

- Product team asks for a topic to be provisioned
- Data team creates topic, ensuring it follows best practices (product team cannot develop until this is complete)
- Product team deploys code that writes to new topic
- Data team develops code to ingest the data raw
- Data team develops code to flatten the data into easy-to-use tables
- Any new fields on the topic, the above process must repeat (other than step 1)

This process was intentionally created to ensure data followed patterns and practices (to compensate for data quality issues of the past). But it added way too much friction and cost way too many resources.

How it works today

- Product team immediately starts development, no communication necessary.
- Data team is notified of the newly created data/topic, and can review for anomalies, asynchronously.
- Internally developed code/infrastructure automatically ingests new data/topics both RAW and flattened into Snowflake tables. Additionally, all additional fields added are automatically flattened too.

Product teams can develop faster, operational costs on the data team are dramatically reduced (but oversight still exists), and the company now knows how new features are doing instantly.

There are other benefits too:
- Costs reduced (Snowflake way down, AWS up a tiny bit)
- One set of code to support (instead of two PER topic)
- Monitoring level went from crude, to full-featured (for free, leveraging our internal libraries instead of through a third-party tool with only logging).

Deployment

Deploying new code is risky. Problems most often occur when there is a change. Change in code, infrastructure, load, etc. We built a blue/green deployment pipeline to mitigate how much traffic is exposed to change at a time, through traffic ramping.

The cost for reducing risk was time (we would have backed up deployment queues too, wasting developer time). This tradeoff can make sense where bad deployments are common. However, looking at the last month, how often did the ramping mitigate a disaster? Zero. Quarter? Zero. Year? Zero. We weren't mitigating the risk, and the cost was too high.

In this case, the solution was to remove automation. Forty-five minutes each deployment given back to engineers. Bonus, it cut tech debt (legacy infrastructure) and reduced costs (running two sets of infrastructure for a shorter time).

↑Back to Top