Balena's Support as Field Research

Balena’s support is quite unlike that of most other technology companies. At balena, every engineer, from the most junior to the most senior, is responsible for spending some time (ideally 20% of their work-week) providing support to our users. The obvious benefit of this is that our customers get top tier support from the very people who built the products, but an arguably bigger benefit is that our team gets to see how our products work out there in the real world.

Balena’s CEO, Alex, has long joked about renaming support to “field research”. Readers familiar with our emphasis on feedback loops may already recognize why. By having our team provide boots-on-the-ground support to our users, we are fully exposed to the realities of how our products work (and fail) in practice. Through the loop process of identifying patterns, brainstorming with the team, and iterating on our products, the feedback we receive from our support channels directly shapes the future of what we make.

Scheduling Support

As a remote, asynchronous team scattered across the globe, we are able to offer support 21 hours of the day. To make sure that support shifts are fully covered, while also ensuring no one team member has to provide support for too long at a time or outside of their self-specified working hours, we use a constraint-programming scheduling algorithm. You can read more about it in our Scheduler section.

A Case Study of the Loop at Work

A central idea that informs our understanding of support is that each support thread is most often the result of a highly motivated individual taking the time out to seek help. We’re keenly aware that the vast majority of issues with our products won’t ever be surfaced to us through support. Because of this, we imagine that, for every support issue we see, there are likely several times the number of people out there with the same problem who don’t report it. On top of that, as our products develop and we continue to onboard new users, there will be another 10x worth of people in the future who will eventually run into that same issue. Keeping this in mind, we often find ourselves building features directly in response to particular support threads, especially if the issue is one that we could see a lot of people running into.

One example of such a feature build was the addition of device metrics to our dashboard on balena cloud. This element provides a simple graphical interface with device stats such as CPU load, temperature, and current storage capacity displayed. What’s now a standard feature for any provisioned device started out as a support thread of a customer who had 40% of their fleet (over 500 devices) mysteriously stop responding. Let’s take a deeper look at this particular support inquiry and follow its timeline through our internal loop process as it eventually evolved into a major feature release.

Part 1: The Support Thread

In July of 2020, we received a support request from one of our enterprise customers. They had woken up one day to find that more than 40% of their fleet was unresponsive.

Screen Shot 2021-10-27 at 2.26.53 PM.png

After the customer provided us with access to their fleet, we were able to quickly pinpoint the issue to be a lack of storage space. The affected devices had simply run out of available space on their SD cards and stopped working. We provided some guidance to the customer on freeing up space, and helped them resolve this particular issue.

Part 2: Internal Discussion

After the support thread itself was closed, however, internal discussion about the issue continued. As a preventative solution for the future, one member of the team suggested we create content in the form of a series of blog posts to help newer users become more aware of the causes of common fleet failures. Others pitched in and agreed this was a good idea and something our users would benefit from. A little while later, David, a member of our team who had helped handle the original support issue, floated a prior discussion around better device metrics on the fleet dashboard as another potential solution. The idea was that if the fleet owner had been able to see a simple visual of the remaining storage capacity for all the devices in their fleet, they would have noticed that storage was filling up and known to take steps to prevent an outage.

David then decided he wanted to take the discussion to one of our weekly product calls, where every week we discuss the state and future of our balena.io product. So he created a brainstorm topic with a summary of the issue. Creating the brainstorm topic, and attaching it to the agenda for the appropriate call, did not require David to do any context switching. Instead, he used the “#product” hashtag directly inside of the Flowdock thread where our internal discussion was happening, including a summary of the problem and proposed solution, and our cross-platform integration automatically created a brainstorm topic inside of Jellyfish with the details that David included in his Flowdock message:

Screen Shot 2021-10-27 at 2.28.02 PM.png

When the time came for the actual product call, David explained the problem, referencing the original support thread, and also offered the device metrics idea as a potential solution. The team agreed that this solution made sense, and the discussion was then earmarked to be brought to an architecture call. Our “arch calls”, as they’re known, get into the nuts and bolts of the architecture behind our products. Once this second call took place, and the technical feasibility of a device metrics dashboard was established, the team got the go-ahead to begin the work required to make this feature a standard for any provisioned device.

It’s important to take a step back here and note that, so far in this process, there has been no formal push to move this initiative along from any body of management. The team members involved in resolving the original support thread self directed in starting internal dialogues, ideating on potential solutions, and bringing the topic to the weekly calls. The rest of the team then pitched in, and the whole discussion followed a very natural path that didn’t rely on any managers, timelines, or external pressure. This is exactly the process that we want our culture to encourage.

Part 3: Release

Any time a new product feature gets the go-ahead from the broader team, the first step of implementation is to draft a “spec”. This spec outlines exactly what will be built, divvies up the work among members of the team, and sets down a tentative timeline leading up to the feature release.