September 28, 2023
In our previous article, we introduced the concept of device reliability engineering (DRE) and explored the importance of integrating DRE practices into embedded team workflows. In this article, we'll take it further to explain why IoT developers need to implement DRE techniques to scale device fleets without compromising quality.
The Growing Challenges of IoT
When developing IoT devices, one universal truth prevails - the inevitability of shipping bugs within your code. No matter how rigorous your QA process, how talented your team is, or how long you delay launch, your device is bound to ship with issues. Security holes, battery drain, random resets -- any of these bugs threaten to disrupt the customer’s experience. A wristwatch owner might grumble about a battery that depletes too quickly, or a store may have to refuse customers while their POS system is down. Depending on the significance of the use case, disgruntled users may simply be annoyed or escalate their dissatisfaction on social platforms, potentially causing severe harm to your business's reputation.
Advancements in connectivity and hardware have paved the way for the development of extraordinary use cases at the edge. However, these very developments also increase the probability of encountering crippling challenges during product development, operations, and maintenance. Having a plan for firmware architecture that can account for these issues—before they become problems–is critical.
Those building software faced similar challenges and responded by building software reliability engineering (SRE) techniques for developing and maintaining software systems whose reliability can be quantitatively and regularly evaluated. It’s necessary. Consider embedded software pioneer Jack Ganssle’s contention about the reality in software engineering: that the elite (top 1%) inject just about 11 bugs in requirements gathering through coding per thousand lines of code while the lower 99% average about 120 bugs per KLOC.
Applying the same maxim to IoT device development underscores the need to plan for firmware development and updates in advance.
A More Scalable & Sustainable Strategy
Back to the unhappy customers. Any company or developer of IoT devices knows there will be bugs, security issues, and missing features, and anticipating these post-launch issues should be part of the IoT product lifecycle. DRE includes the engineering practices, infrastructures, and tools that can be used to manage device reliability at scale, post-launch.
End users place a premium on reliability, yet the consistency of reliability varies greatly across software, hardware, and systems. The IoT ecosystem’s interconnectedness requires hardware developers to approach product development differently than previous embedded devices. The most significant shift is away from a launched-and-done approach, where developers wrote static firmware for commoditized products and had no further interaction or engagement with the product once launched. Now, whether it’s Bluetooth LE, LTE, Zigbee, or other mesh networks, IoT devices are connected and regularly transmitting sensitive and personal data to and from the cloud. Device manufacturers must integrate the types of long-term reliability tools that have proven successful for their software counterparts.
Adopting DRE addresses the high price points and near limitless reach of modern connected devices across industries and offers IoT developers a solid development approach for building reliable, updateable, and long-lasting devices.
Employing DRE Techniques
Implementing three key DRE techniques can help teams everywhere build more reliable, high-quality devices and de-risk product launches.
1. Comprehensive OTA Update Management
Over-the-air (OTA) updates ship new software, firmware, or other data to connected devices over the cloud, giving you an insurance policy against your issues and eliminating the need for product recalls. Deploying OTA firmware updates means developers can push out updates to fix bugs and release new features while keeping devices operational.
The key is correctly architecting OTA to make it more likely to work well and ensure optimal test coverage. Successful systems will ensure you have granular control and visibility into your releases to reduce potential risks.
Granular release management with A/B tests, beta tests, and other experiments enables methodical and low-risk releases. Conducting these tests is necessary when working with industrial customers who want updates on different schedules. One way to test releases is to create groupings of devices, called Cohorts, which allow groups of devices to be updated separately.
Complete upgrade path control is necessary to handle most complex migrations. Having access to multiple release types enables this control. With must-pass-through releases, you can force devices to receive a specific software version before moving to any future version when the migration is not forward-compatible. Or you can use delta releases to apply a lightweight update to devices in the field.
Incremental updates reduce the risk of shipping any new bugs that arise to all devices. Every release contains potential issues, so gradually rolling out updates limits the blast radius of a new issue and, crucially, prevents a problem from impacting all customers simultaneously. Developers can use staged rollouts to restrict the number of devices receiving a release until they have the confidence that the release is working as expected.
Firmware update security is a necessary and one of the most critical parts of OTA architecture. Firmware code signing proves a file was created by a trusted source and hasn't been tampered with by creating a verifiable signature for a file. By implementing signature verification in a bootloader, developers can identify the authenticity of a given firmware update, and the bootloader then can decide to either warn the user, void the device's warranty, or simply refuse to run an unauthenticated binary.
After launching a product, IoT developers need access to performance metrics to measure device fleet health accurately. Some of the most common metrics to prioritize are:
The system collecting these metrics needs three essential characteristics:
By collecting different datapoints for individual devices, developers can investigate reports of device anomalies reported through customer support or engineering teams. Organizations should capture metric information and changes in metric behavior on a timeline. By doing this, the customer support or engineering team answering customer calls can identify operational correlations, such as battery use and writing to flash. A robust metric system makes that possible.
One massive benefit of capturing performance metrics is that it can be done asynchronously, a handy feature for limited-connectivity devices. Beyond individual metrics, there should be some level of aggregation and dashboards to indicate the health of overall fleet performance and a way to identify data trends quickly.
Another key use for metrics is alert configuration. A metric system should have a way to configure alerts. When certain conditions are met, set up a system to send immediate alerts via email, Slack, or incident management platforms instead of waiting for a team member to review the charts.
Traditionally, debugging starts with various reports of different customer issues. While they all may relate to the same problem, customers don’t provide consistent, detailed descriptions for support specialists answering the phone or reading emails to understand the issue entirely. Eventually, the business gets enough feedback and manually converts it into different logs. With sufficient data, teams get devices back in the lab and back to the engineers. This process takes time. This process is also ridiculously expensive.
Remote debugging accelerates the process and enables it at a much lower cost. Devices report issues automatically by feeding them into a cloud pipeline that analyzes that data, collates the reports of individual crashes and errors on devices, intelligently groups traces of the same issue type together, and shares those reports with engineering.
By adopting DRE techniques, developers can derisk a product launch, prepare for the inevitability of post-launch issues, and deliver a continuously improving, higher-quality IoT product overall.
The Memfault platform enables businesses to improve IoT devices with DRE. Memfault gives your team scalable and sustainable processes that will increase engineering efficiency and collaboration to accelerate IoT and edge device delivery while minimizing risk.
If you’re interested in learning more about Memfault, talk to our team.