Unknown Unknowns

Unknown Unknowns

Mark Sage - 12 min read - 26/10/2024

Return

“We know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say, we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know.”

This is what US Secretary of State, Donald Rumsfeld said during a news briefing on February 12th, 2002 when responding to a question on potential Iraqi weapons of mass destruction.

Whilst it could be argued there were potentially a lot of unknown unknowns which never actually materialised as ‘knowns’ in that time, the phrase itself is actually quite useful as it talks to how we can assess and manage risks.

From an operational perspective, being ‘operationally ready’ is not simply being able to run your loyalty programme, but also being ready for when things go wrong. In this respect, getting to know these ‘unknowns’ is a key part of ensuring you’re actually operationally ready.

We can’t plan for things we don’t have information on or can’t consider — the things we don’t yet know — but it does suggest a requirement of ensuring we do know; that we attempt to get the information so that we can make decisions.

Essentially, we’re trying to turn as many ‘unknown unknowns’ into ‘known knowns’ so that we can plan for them.

With operational readiness, there is always an element of the unknown. Things can go wrong that you weren’t expecting and it’s the support framework you put in place that helps to mitigate these risks and ultimately to respond to them when they happen. However, with operational readiness, it’s much better to draw out these potential issues before launch so that you can prevent them from happening entirely.

To do this requires a systematic approach to uncover these issues, gather information on them and put in place plans, product features and processes to manage them.

The challenge sometimes though is that there is a tendency to focus on the ‘happy path’.

The happy path refers to the way in which you expect users to participate in the programme. How they enroll, how they earn and burn and how they check their points. Consideration may be given to alternative user flows such as ‘forgotten password’ or ‘change of address’, but these are still just part of the happy path — the normal, expected usage for ‘normal’ users.

It can be tempting to assume that users are like us — have the same phone as us, have the same understanding as us.

The issue here is that this ‘segment of one’, where we apply our own view of the world on the wider user base, means we overlook the myriad of ways other users may interact with us.

They may be using an older handset with a smaller screen; they may turn off mobile data to save money; they may be using a temporary pay-as-you-go SIM card; they may be connected to a different app-store; they may no longer have access to the mobile or email they signed up with — they may not actually even have a smart device capable of taking part.

All these differences create unique ‘happy paths’ for your unique customers. For them, that situation is ‘normal’ and it’s your solution that should account for it and cater for it. It might not match what you planned, but it doesn’t make it incorrect.

As an example, I joined a travel retailer who was launching a new loyalty programme which had been in design for 2 years and as I joined, we were just heading into the final UAT phase.

Like yuu Rewards it had an app-first strategy as we wanted to create a much tighter connection with our customers and a richer, more rewarding marketing channel. Also, like yuu Rewards, we had an OTP (One Time Password) as part of the enrolment flow so that we could verify the persons mobile device and ensure the account was setup and managed securely.

The challenge however, was that our members were travelers. They would generally be crossing a border to shop with us and when you cross a border, you typically change telco provider. Suddenly, after launch, we had a situation where customers didn’t have roaming enabled on their phone, or they had a temporary travel SIM, or they only had a data travel package, or the local telco didn’t send roaming SMS in a timely way.

Whatever the reason, we were faced with a customer base who didn’t act like a domestic customer and our OTP process was failing, meaning enrolments were failing.

It didn’t fail for everyone, but it failed for enough people that we had to scramble for a temporary work-around whilst we re-planned and re-designed.

It didn’t have to be that way though. It didn’t have to be a case of ‘we don’t know what we don’t know’. We could have known, but to do that, we would have had to put ourselves in the customers shoes.

Tobe clear, yuu Rewards wasn’t perfect either and we had unexpected operational issues — the biggest being an assumption that users would remember at least some of their security credentials. In short, they didn’t — but that’s another story we’ll look at later.

However, we did get a lot right, and one of the reasons for that is the systematic process we utilised to map out and uncover issues a customer might have.

We called this process ‘A Day in the Life’ (DITL) and it was a way of mapping out, step by step, what each stakeholder might need to do. We didn’t invent this approach, it was something that one of the team members, Julie Scarth, had utilised whilst working at Virgin Atlantic on their loyalty programme. It was though, a great tool for helping to dive down into the detail and force us to question each step.

The DITL format breaks down the various user journeys such as ‘Downloading the App’, ‘Migrating an Account’ or ‘Enrollment’ so that we first ensure we have all the different capabilities covered.

Then, for each capability, we break it down further such as ‘Enrolment via iOS’ or ‘Enrolment via WeChatMP’; then further still such as ‘Enrolment via a mobile that hasn’t been used before’ or ‘Enrolment via a mobile that has been previously used’.

The idea with Day in the Life (DITL), is that as a team, you write down all the different steps and drill into them in all the different directions.

It looks like a mix of user stories and acceptance tests, but these are essentially written from fresh.

We didn’t want to rely on assumed test coverage, by simply using the original product requirements (or agile stories) which is what UAT would traditionally do. Instead, we imagined ourselves as the customer, starting from the marketing materials and working through the processes of joining and participating in the programme.

From scanning the QR code, to downloading the app, each step was mapped out and consideration given as to whether the user could go in a different direction.

What if the OTP didn’t arrive on time? Did they wait? Did they try again? What if it never arrived?

What if the mobile they used was already known to the system? Had they enrolled before? Had someone else enrolled last year on a pre-paid SIM that had now been recycled and re-issued?

Many of these journeys we’d already thought about during development, but that didn’t matter. As part of the DITL process, we would be re-thinking them again and re-testing them.

When mapping out the Day in the Life processes, it can be easy to only think about the end customers — the user. However, there are many users, and the DITL approach is specifically named to ensure we consider all of them.

So, we mapped out a Day in the Life of a Call Centre Agent, a Day in the Life of an In-Store Team Member, a Day in the Life of Marketing, a Day in the Life of Operations.

The purpose here, was to force us to look at the product and processes through the eyes of each and every stakeholder, to ensure we had systems and processes in place to serve these.

Again, we didn’t assume we’d already got things right and so didn’t start from existing SOPs or ‘Ways of Working’. Instead, we re-thought through the likely scenarios for each user of the different systems and mapped these out.

Whilst the Day in the Life (DITL) approach served us well in ensuring we’d maximised the coverage across the different functional areas, and tried to uncover as many unknown unknowns as we could, this hadn’t addressed the non-functional areas.

These non-functional areas such as system performance, security, reliability, and availability, also needed to be tested to uncover any unknown unknowns.

This is a much more well-trodden path, with established approaches such as performance testing, pen(etration) testing and integration testing; all of these designed to test the limits of the system in different ways.

One of the challenges with non functional unknown unknowns however, is in that interface between business and IT.

Take performance testing for example. The first question IT ask is what’s your expected peak load. How many enrolments per day? How many transactions? Not just per day, but per second. When you’re sending an SMS for each enrolment and unveiling the new programme all at the same time, that peak can get pretty high, pretty quickly.

If you choose to send a compelling offer as a push notification to all members at lunch-time, what’s your expected response? (For the record, the expected response was not to bring down our systems for about 20 minutes… so it’s worth thinking about!)

Generally, the tech teams won’t want to be accountable for coming up with these numbers for two reasons.

Firstly, if they are wrong, and they’ve under estimated the business demand, then the buck tends to stop with them, even though they don’t know what the marketing plan is, and they don’t know how popular it is likely to be. Ultimately, only the business can make a guess on this — and it is a guess.

Secondly, if they over-estimate, this burns cash. Creating highly scalable and highly performant systems takes a lot of money. If you over-engineer it, and then only use 30% of the capacity, there will be some questions asked on the budget.

IT knows how to deliver a system that can meet the non-functional needs, but it falls on the business to define those needs and to essentially uncover the unknown unknowns.

A good example of this is around enrolment within yuu Rewards.

SMS was a key part of our enrolment process, as we used it to establish two-factor authentication during enrolment so we had something you know (your userid/password) and something you have (your mobile number). This meant that we needed rock-solid availability and scalability on SMS delivery, as any delays or downtime here would essentially block enrolment.

From a non-functional perspective, you could limit your scope and system responsibility to internal systems — the core yuu Rewards technical stack. Making sure that these had resilience and scaling built in. However, this would ignore the downstream external 3rd parties who were ultimately responsible for delivery, which included the SMS gateway (in our case Twilio) and the telco providers themselves.

To understand this better, we brought these vendors in to see how they could support availability and how we could test it. It was interesting as we started to unpack this end-to-end delivery that there were ‘pinch points’ in the solution, with, for example, caps on the volume of SMS going through. We could work around them by extending the number of ‘connections’ switched on, but if we hadn’t looked, we’d never have known, and our SMS would have been ‘queued’ or worse, bounced.

Ultimately though, even with a commitment for 24/7 hyper-care being provided by these partners at launch, we felt there were still too many unknown unknowns in the process. Too many partners who couldn’t guarantee delivery, as they were also dependent on someone else downstream.

So, we solved it be creating a known unknown. We basically created a new functional solution to bridge any non-functional issue — a release valve, you could say - to be used if things didn’t work out.

For SMS OTP, we allowed a bypass for new enrolments, whereby they could skip the OTP if it wasn’t arriving for any unknown reason and re-do this verification later.

To protect security and data integrity, we prevented redemption until the OTP was completed which meant no value could leak out. We also provided a means for another account to lay claim to that same mobile number and move it to their account if they could provide an OTP — this was done in-part, to prevent any malicious actors from blocking our platform by mass enrolment of numbers.

We repeated this approach across many different areas.

For POS integration, we utilised a ‘store and forward’ model at the POS, so that if connectivity was down for any unknown reason, we could keep in-store earning live and catch up later.

On the app itself, we used localized caching of all data including, offers and redeemed coupons, such that the app was fully functional for a member even without data — with the exception that they could not do a new redemption.

Even with a robust approach to operational readiness, things can still go wrong after launch, as they did for yuu Rewards when the app simply wouldn’t open for thousands of members.

When someone opens the yuu Rewards app, it immediately tries to connect to download the latest offers, balance information, rewards, status, etc. It basically needs to update and refresh all of the cached information it has, and as part of this process, it renews the security token for the user if it has expired.

Except in this case, that process broke for an unknown reason.

As each customer’s token expired, we got more and more members whose app wouldn’t open — it was just hanging.

When the mobile device itself has no data and can’t reach the internet, then the app simply works offline and this is almost transparent to the user. But, if the app can actually connect, but can’t get all the right answers — if there is an unknown and unexpected break downstream, as happened with the token — then things can get tricky.

A hanging app is a big problem.

With 50% of app opens being whilst the member is standing at the till, when it fails, everything fails. It slows down the queue, frustrates customers and frustrates cashiers. Frustrations we saw pretty quickly as stores started reporting issues and customers started contacting us directly.

We ultimately fixed that single issue around the security token failing, but it also highlighted a weakness in our overall solution design — the app driven loyalty id. If another unknown unknown emerged, we’d have the same challenge again.

To combat this, as we’d done with SMS, we put in place a release valve that would work, when other areas had stopped working. We didn’t simply fix the bug, we put a workaround in to mitigate any similar bugs that might appear — ensuring we now had a known unknown.

On the app load screen, you could now see and access your loyalty id instantly. This works whilst the app is loading and works regardless of what the app is doing in the background. Whether connected or not connected, updating or hanging — the loyalty id is instantly available and cached, and always available to let the member earn.

Just a few months after we’d added that capability, it was put to use in a completely different scenario — something we also hadn’t thought about or planned for, another unknown unknown. Yet, the new ‘instant access loyalty id’ feature worked flawlessly — allowing all members to continue earning and to continue accumulating points, whilst the team worked to resolve the wider issue.

Dealing with unknown unknowns requires a methodical and proactive approach. As the saying goes, ‘a stitch in time, saves nine’ and it’s no different with operational readiness. Finding issues today, will pay dividends tomorrow, in terms of programme stability, customer experience and ultimately member engagement.

Rather than simply building, testing, launching and then fixing things as they appear, a desire for robust operational readiness asks you to consider things in two ways.

Think like a user; every user — Working through every user, whether the cashier or call center agent, helps ensure you’ve thought about different problems to solve. How can a cashier void a basket? How can a customer unlock their account if they’ve changed their SIM? It forces you to think differently and helps to uncover unknown unknowns.

Don’t simply fix. Mitigate — When something breaks, it shows you not only where the code went wrong, but also where the overall process went wrong. If you design to support things breaking, then overall, more things keep working. Not being able to login, doesn’t need to prevent earning. Not being able to get the OTP doesn’t need to prevent enrolment. Engineering ways to control for potential unknown issues — creating known unknowns — puts you back in control.

Ultimately, the best approach to operational readiness is to make your unknowns known where you can, and create support for known unknowns where you can’t.