HomeEventsAboutGet Involved
SD

HubSpot's Stack - Built for Shipping at Scale

At HubSpot we've designed our team structure, development processes and technical architecture to promote strong team ownership and iteration velocity. HubSpot makes a marketing and sales platform trusted by more than 10,000 customers to power their online presence. Those businesses rely on our SaaS platform to host their website, blog, landing pages and forms, collect web analytics, manage their contacts, deliver marketing emails, support their social media efforts and much more. The capability is relatively broad -- each functional area has many companies that focus on just one vertical -- and is growing as we continue to introduce new products.

Our stack is not only a function of building B2B SaaS offerings and the company's stage and size, but also the cultural values that we want our products and organization to reflect.

Small, Autonomous Teams are the core

Each team is typically comprised of a tech lead (TL) and two developers working with a product manager and designer. We keep teams small to eliminate scale challenges and communication overhead (e.g. very few meetings). This also allows the tech leads to be deeply technical and product-focused as well as spend time coaching the two developers they work alongside. (There is a small number of people managers who support the TLs by focusing on organizational issues and structure.) This unit owns a functional part of the product (e.g. "Social Media") and is chartered to make meaningful progress for its customers. Around this team are others that provide services for user testing, data management, reliability monitoring, etc. to allow them to focus on solving customer's problems. The product team is full-stack such that they can build whatever is needed without external dependencies or approval. This process is informed by many feedback loops: usability testing, direct customer interactions, usage tracking, customer support and other stakeholders. The team decides what to build, how best to implement them and manage their ongoing operation. If something is broken they are responsibile for fixing it -- there is no QA team to offload responsibility to. If the user flow is confusing then they own iterating on it. When customers are excited about what they have shipped the team get the kudos.

The result has enabled rapid progress and harnessed developer passion for improving customer's experience. Often as teams grow individual contributors are forced into people/project management - this structure allows for technical mentoring while minimizing the number of full-time people managers. The primary challenge in this model is driving design and technical consistency. Conway's Law poses an obstacle that requires teams to communicate effectively to avoid silos and diverging. There is a strong set of peer communities (e.g. PMs, Java back-end developers) across product teams that work to be on the same page. Initiatives that are cross-cutting are sometimes harder and often lead to scenarios of "eventual consistency" where teams evolve toward a similar goal over time.

Microservices

The HubSpot products are comprised from 300+ different web services, and dozens of static front-end apps. Together these microservices form the products our customers buy. Most web services are written in Java using the Dropwizard framework, and the front-end largely uses Backbone and React in CoffeeScript. A single team will likely own several services. The exact scope of a service ranges but having more than ~5 actively contributing developers can lead to coordination overhead. Services communicate through RESTful JSON APIs or by leveraging a messaging system like Apache Kafka.

This architecture aligns well with having clear ownership and quick iterations. There are over 1,000 separately deployable units, that can be scaled independently. Each family of services is owned by a team so the appropriate people are notified of a service problem. It has facilitated scaling the organization as there is a pattern for forming new teams and spinning up new services. There is additional complexity to understanding how a distributed architecture is performing and managing configuration complexity (ex. services using different versions of a shared library). For smaller teams this trade-off from a monolithic app (ex. Rails) may not be advantageous until a tipping point in the amount of code or number of developers.

Approaching Continuous Delivery

The goal of shipping frequently is first and foremost to increase our rate of learning. It enables getting rapid feedback from users and data points on the code's scalability.

To allow small teams to focus on improving the product we've developed many internal tools that make shipping code simple and provide a safety net. Any commit to our hundreds of GitHub repos will trigger a Jenkins build of the master branch that utilizes standard buildpacks for different frameworks (ex. Java Dropwizard, static apps, etc). Assuming all tests pass pushing the deploy button will put that build on the relevant hosts. Initial deploys of a build are put on a shared QA environment which aims to mirror the production environment as much as possible (except in scale). Only those builds deployed to QA are eligible to be deployed to production. The goal is for a "Heroku-like" experience where developers have zero friction to pushing small changes frequently, and are shielded from needing to know the exact steps being performed.

Core to the goal of small, safe deploys is feature flags (or gates). As a feature develops over time it can be safely merged into master behind a flag in stages. This separates deploying code from releasing the feature. The primary advantage is that the developer controls who sees the new functionality, while ensuring that it is technically sound in production. The typical feature might progress from shown to just the developer, to her team, to customers to a beta group, and then all customers. Depending on the scope of the feature that may happen within hours or weeks. At any time the feature can be hidden again without another deploy. No longer should developers have to wait weeks to merge branches, or pray when they deploy a large set of changes to all customers (especially on a Friday at 5pm). The downside is some additional code complexity -- effectively extra "if" statements -- and a clean-up task after a feature is fully released. It is a simple concept that has been around for years but still appears to be in less use than it deserves.

The only way to feel comfortable deploying frequently is to have insight into state of the running service. In addition to off-the-shelf tools like exception tracking and tracing tools we've invested in ensuring every service has built-in health checks with reasonable defaults. An internal project, dubbed "Rodan", instruments services to collect standard metrics (e.g. requests/sec, server errors) as well as developer-defined ones. Each service is part of a family that has configurable alerts and PagerDuty integration to let teams set appropriate thresholds. It has struck a useful balance by having developers own their alerting rules while avoiding swamping inboxes with email alerts.

Having a distributed architecture means that recording application metrics are often the best way to what's truly happening. Our applications capture a lot of runtime metrics (e.g. from Rodan, Dropwizard Metrics, etc). To explore and visualize those metrics a HubSpot developer created and open-sourced Lead.js. It pulls time-series data from Graphite or OpenTSDB and lets users interactively graph metric data. This has been invaluable for understanding behaviors when things go wrong.

Mesos, Singularity and beyond

HubSpot has long been a heavy user of Amazon EC2 for server hosting. More recently we've taken advantage of Apache Mesos and an open-sourced app we created called Singularity to manage our large EC2 cluster. In conjunction they have driven significantly higher density per server to reduce cost. Previously most services provisioned instances in isolation because coordination was too complicated. Even more importantly than cost changes has been to insulate developers from details about any particular instance. Now the platform can handle problems like instance failures or availability-zone issues that previously required developer intervention. Mesos has been adopted by companies like OpenTable and Groupon but has required significant investment to see the benefits of this new platform. As we invest in our stack we look to solve programatically with small developer teams, rather than larger numbers of people in specialized operations roles.

This stack has evolved considerably over time as HubSpot has grown, and we expect that to continue. In particular as we look to scale our existing products, and build completely new products there will be a series of significant challenges to solve. Hopefully the principles behind how we choose to build software will lead us as we tackle them.

For more from HubSpot's Product Team, follow @hubspotdev or check out http://dev.hubspot.com/blog