Reflections on the Tech Stack of Flux

In the early days of Flux, I often sent a link to our tech stack to candidates. In my mind, a big draw of working at a startup is the freedom to assemble a tech stack from the beginning, experiment with it, and learn from the experiences. Now is my time to say goodbye to the stack and write some thoughts.

Background: Flux is an HR Tech startup with six engineers for the first three years, acquired by Beamery in 2021, and grew to 16 engineers in 2022. The product consists of a web application and a recommender system (matching engine).

Languages and Frameworks

We stick to two programming languages: JavaScript and Python. This is one of the best choices we made. It allowed us to spend less time on toolchains and more time building the actual business.

The web app is all JavaScript, Vue.js for the front-end and Node.js/Express.js for the backend.

Vue.js delivered the simplicity and productivity we hoped for. It was easy for new engineers to pick up and get productive. We used Vuetify as the component library, and it was OK. The choices of component libraries for Vue are limited compared to React. Our biggest complaint is the painful migration from Vue 2 to 3, lots of rewriting… I’d probably go for something other than Vue the next time.

Node.js/Express.js worked reasonably well for a web API server. We never had big issues with Node’s performance, and every common function seems to have a few packages available on NPM to choose from. It’s quite easy and pleasant to build web APIs with Node.js. My complaints are the legacy/kinkiness of Node.js (e.g., losing complete stacktrace when legacy promise codes are used) and maturity of NPM packages (it’s not uncommon to find a bunch of choices but none of them is solid).

Express.js is very unopinionated, which means for every function you’d need to find a package. Vue.js is better but we also need to pick a component library and some other stuff. We picked most packages OK but there were two big ones I kind of regret:

  • We picked Objection.js as ORM as I wanted to use a SQL-friendly one and Sequelize really turned me off. But Objection.js was too niche and went out of maintenance recently.
  • Vuetify is the component library we chose, which also turned out to have a lot of issues we had to deal with. But the choices of component libraries for Vue.js are pretty limited anyway and Vuetify seems already the least sucking one.

There are some benefits of using JS for both front-end and backend. Some tools can be shared, e.g., NPM, eslint, some libraries such as lodash and axios. But there are still plenty of differences that need to be figured out separately, e.g., testing tools, test coverage. Since we ended up building another API server for the matching engine in Python, I’d probably prefer Python for all backend API servers next time.

I didn’t write a lot of Python at Flux so I have less opinion about Python. My impression with the limited exposure to Python is the “sugar” tools are a little behind JS, e.g., linting. But when there is a popular choice, it’s often more mature and solid than the JS options, e.g., ORM, API documentation, http client retry…

We used FastAPI for Python API. No complaint there.

We used Tensorflow at first, then switched to PyTorch. It seems PyTorch is the more prevalent choice these days.

Data

For OLTP database we used AWS Aurora Postgres RDS. It’s pretty great. Stable and performant. The performance insights is always helpful for us to identify performance issues. The replication slot acted up a few times over years, but none was on production and was not hard easy to resolve.

For data warehouse we chose Snowflake at the beginning. It was great, very easy to start and barely needs any “maintenance”. But on year 5 we moved to BigQuery for two reasons:

  1. Snowflake was slow to “compile” a query. A 200-line interactive query can take ~6 seconds to “compile” and then 2 seconds to execute. We tried to increase the compute size and talked with Snowflake support and couldn’t find a way to reduce this time. In comparison, on BigQuery it only takes 3~4 seconds for the query to run. So there is a big performance difference even on a small dataset.
  2. Snowflake’s contract is not very friendly. They asked me to sign an annual contract and buy credits. And leftover credits can only be rolled over to the next year if you renew the annual contract. So if you don’t renew, you lose leftover credits instead of rolling them over to pay-as-you-go. This pissed me off.
  3. Snowflake is more expensive than BigQuery. Our monthly spending was reduced ~20x.

For data replication from OLTP to data warehouse, we used StitchData for a long time. It was pretty good for small amount of data but can get expensive when the data volume increase. Also it’s a SaaS service so the data have to flow out of our VPCs, which can be less desirable in some use cases (large volume data, privacy-savvy clients such as banks or large enterprises). We switched to AirByte as a self-hosted version and it didn’t go so well. There were errors after errors that appeared to be rough edges of AirByte. When I left we were trying to switch to Google DataStream. It was surprising how the “L” in ELT, a part that was supposed to be simple, turned out to be pretty complex.

We used DBT for data transformation for reporting and other purposes, and it was a big win. No complaint about it.

We also used Segment to collect data (pageviews, vendor data) and feed into data warehouse. It worked pretty well.

For ML workoad we used AWS Step Functions for some data pipeline orchestration. It did a decent job as a general-purpose orchestration tool. Sometimes it felt a bit clunky to write the configurations in CDK codes, but we could probably do a better job there.

For BI system we used Mode at first, for its embedding capability. After a few years we replaced embedded reports with native builds, and moved from Mode to Metabase for internal use.

Platform

We’re mostly an AWS shop. AWS is pretty solid in general. Some products are clunky but overall the service and support have been satisfactory. We somehow ended up using GCP and Azure as well: GCP for BigQuery, Azure for a “Skill Extraction” API at first, then OpenAI API.

Our compute was almost all serverless, either on ECS Fargate or Lambda. I intentionally avoided Kubernetes, and ECS worked just fine for our use cases. Lambda’s 15-min run time limit and package size limit did became an issue, so eventually we dockerized almost all workload.

We used Datadog for monitoring and logging. Datadog’s metric-based monitoring was a good solution. Almost all logs went into CloudWatch Logs then shipped into Datadog for easier search and integration with monitors.

We used Sentry for error alerting and APM. Sentry is a great product and I really like it. We used Datadog APM at first and found Sentry APM was better.

For Infra-as-code we used AWS CDK, it was a pretty good tool for an AWS shop. We introduced some Terraform later for Datadog monitoring and GCP. If I knew we’d use multi-cloud, I’d probably go for Terraform from the beginning.

Others

Besides the ones mentioned above, Flux also used a bunch of SaaS tools. I’ll make a list here with brief comments.

  • Atlassian suite: Jira, Confluence, Bitbucket, OpsGenie. They work reasonably well and have tight integrations. It got expensive later, but migration cost would be high and disruptive so we sticked to it.
  • Forest Admin: it was a convenient “back office” tool for data browsing and a place to build small tools. But it got quite expensive later.
  • SendGrid: transaction email service. Works well.
  • AWS SNS/SQS: simple queuing solution. Works fine.
  • Cypress: for e2e testing. Pretty good tool. Easy to start, but somewhat hard to write really good tests.
  • AWS API Gateway: for host mapping and rate limiting of external API. It was an experiment project to do an API Gateway + Lambda serverless web server setup for preview environments.
  • OpenVPN: for accessing resources in private subnets. The stability of it became an issue a few years later. Every few weeks it would go down and we’d have to reboot it.
  • FullStory: awesome tool to watch user sessions.

Overall I was happy with the majority of Flux’s tech stack choices. We did some switches and learned things along the way. Now onto the next venture.

Published
Categorized as blog

2 comments

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.