Lessons Learned Using An Ai Army To Test Office
From 2016 until March of 2023, I was working on using AI to test desktop applications for Microsoft Office. A set of thousands of AI agents running on machines in a lab would learn how to navigate, control, and observe the Microsoft client applications. It was an incredible learning experience. I am going to share some of that with you.
As the tool in question has not been released publicly by its creators, I am going to describe it in an indirect way without getting specific. The capabilities I describe are not a lot different from other, alternative (previously) in-market tools, so I am comfortable saying as much as I do as they represent ideas that have already been out there.
How the AI learned to control the application
The AI agents would launch the application and record the UI state using the Microsoft Windows Accessibility API. Each agent would consider possible actions from those available given the UI state and select one to act upon. After taking the action, the agents would record the UI state change, and send that to a service running in the cloud. This service would build a graph of the states and actions coming from all the different agents, resulting in a model that describes how to traverse the application from one state to another via a series of actions.
While the agents were sending their various pieces of the model to the service, we would monitor the product telemetry stream. We would watch for any behaviors in the stream we wanted the tool to learn. Those behaviors were called rewards, and we would send the service a list of rewards we observed with a time stamp and a machine identifier. We also sent the service a distribution pattern stating how much of a percentage of the workload we wanted each reward to hit.
The service would build a model that would predict action sequences which would produce the time and machine correlated rewards. Then, in future runs, the reward would instruct agents to attempt to hit a given reward again, and send them the predicted action sequences.
How do you know there is a problem when an AI agent is doing the driving?
AI-driven testing is more like observing users than it is like automated checks that emit PASS and FAIL signals. With real users, we watch product telemetry streams for signs of product crash, performance degradation, product failure states, or anything else that we would conclude is an emerging issue in production or real-world usage. The product emits the same telemetry stream whether it is in use by real customers or AI agents running in a lab. The key thing you are looking for is anything you might see in the product telemetry signal that looks wrong. The first signal I started tracking was product crashes. Microsoft has a powerful and high scale infrastructure for collecting and analyzing product crashes. The Microsoft Office team had already built several tools to help product teams select and investigate crashes occurring in the wild. This existing toolset and infrastructure gave me a leg up on tracking failures, comparing them to what customers were seeing. I also chose crashes because a crash is always considered a bug, there is no gray area around whether the crash is supposed to happen. All that remains once a crash is known is establishing priority and importance, and the product teams had an existing process to triage and review incoming crashes. There were other signals of interest. The performance team had anomaly detection tools that could report if the monthly releases were showing performance degradation in public release. One of the release teams was working on a set of key product feature health metrics. The experimentation team wanted to track behavior and coverage differences on different feature flag and gate settings prior to release. I and my team were working toward producing reports on those signals as well, but it was work in progress by the time I left.
The Good, the Bad, and the Maddening
Let’s lead with the good.
It works well.
The AI agents learn their way around the application well. There are substantial issues and limits (covered more later) but given there is nobody explicitly writing any code saying how to get to a specific button, or dialog, or page inside the application, it is impressive how much of the application behavior the agent can learn and cover. The tool was version 1.0, but even at that point the methodology is sound.
You want failures, it will find failures.
Especially regarding crashes, the AI agents were doing better at finding them than any of the prior automation efforts, any of the prior testing. I attribute this to volume, scale, and randomness. We had the tool running thousands of application sessions every day, for as long as an hour at a time, 24x7. The volume is difficult to duplicate without automation, and the randomness gets into areas of the product one does not think of during test or doesn’t have time to craft the specific steps themselves. The AI agents do not have that problem.
There are no such thing as flaky failures or aborted checks.
Flaky failures are when a given test under the same conditions produces different failing results. Frequently a script aborts the run at the point of failure. The failure only exists because the scripts have an encoded expectation of what is supposed to happen. The expectation is either implied by the execution steps, or explicitly stated via an assert that checks some part of the resulting state against an encoded expectation. With the kind of AI agents we are talking about here, the lack of expectation means the agent has no flaky failure to report, no reason to abort. The agent doesn’t lock up waiting for some UI control to appear or a screen to go away. The agent keeps on going. It may be using goal seeking as a way of driving its behaviors, but even if the states do not appear as its model describes, the agent keeps going. It applies a few techniques to try to find its way to a state in its sequence, but the agent does not treat inability to do so as a reason to declare failure. Even if the application under test crashes or hangs, the agent launches it again and keeps going. From the view of the tester, flake does not exist in the same way it does not exist when analyzing real user telemetry. Instead of comparing against an expectation of consistency, everything has a rate of occurrence, patterns that correlate and predict the event. The notion of “flake” only makes sense when talking about something that is meant to look the same way every time. It is nonsense when the entire space is constantly changing. This style of testing is more akin to stress and reliability testing than it is your usual automated build validations.
It helps product teams with nagging issues they have seen with real users.
I had several instances where a product team engineer would contact me and ask what the tool was doing. They would say something like “We have been tracking this issue in production for a while but have never been able to figure out how the users are getting there. I saw this week that the issue was getting hit a lot, and it was from a lab machine.” I was able to show them what the agents were doing at the time they hit the issue (its all in the model), and the machines being in the lab, we had access to the product log files as well, giving the engineer much more information than they could normally get from a customer. And now some of the bad and maddening.
You must keep fighting the “replace this other testing” misconception.
Several times I had to explain to the people around me that this tool was not going to replace our existing scenario based automated checks. It was not going to replace the outsourced test vendors covering product usage scenarios. I asked the development manager on the team that makes the tool, and he said that no matter how hard he tries to explain otherwise to people, they still come back at him thinking all the testing is going to shift to this army of AI agents. The idea is seductive, but it is important to know what your tools and methods do and do not do. Testing methods complement each other more than they replace each other. Tools enhance and extend our capabilities. Rarely do they replace them. When you optimize for something (randomized exploration at scale) you trade off against something else (precise state checking), and vice versa. I find that even very smart people, when first presented with an idea like described here, take a while to grasp the implications, abilities, and limitations.
The costs are high.
I was using thousands and thousands of machine hours to produce the results I was getting. Microsoft Office has a very large pool of machines running automation (over 10 million tests a day and climbing), so most of the time I was still well under any concerns. Although I was running enough that when cost or system capacity pressures did come up the conversation did come my direction. The next point explains why the costs were so high.
Randomly Driven Coverage is slow.
A typical desktop application has a set of controls that are immediately available and enabled. Then there are second, third, fourth, and so on tiers of controls which are only available if another control is used first, or if certain data exists in the application, or if the application is running in a certain mode or state. When an application has a lot of controls (in the case of Office clients, each of them has 10s of thousands of distinct controls available) it takes a long time for an agent to discover the control and add it to the model. The probability of observing and using a given control diminishes the deeper it is in the tiering system, the more difficult it is to create the prior condition. This means that most of the actions from the agents are going to focus on those controls that are present in the default state with no prior conditions required. Getting to states that require sequences of controls chosen, or specific data that enables them in the application takes far longer. For Office, we were eventually able to get the tool to hit somewhere around 70% of the controls in Word, Excel, and PowerPoint. The problem was the agents would take multiple weeks to eventually hit the controls deeper and deeper in the hierarchy. We had aspirations of using data from the tool to drive a daily build health signal, and this kind of coverage inefficiency was not going to work for that.
Understanding agent behavior is difficult.
The pitch for the system (coming from the PMs, the development manager was more guarded about this pitch) was that the AI agents simulated real user behavior. This is partly based on the way rewards could be tailored to the same signal we had seen from end users. But it was also because the tool was using the same APIs that accessibility tools use to drive control and describe applications to people unable to use mouse, keyboard, screen, or whatever mode their tools substitute for. The API is meant to be descriptive of the UI, you cannot do anything through other than what a person using the mouse, keyboard, or other devices are limited to. Supposedly. The truth is that even the accessibility APIs present the product controls and behaviors in a way that is very different than how a human perceives them. It is the job of usability tools to offer the human consumable presentation because the API underlying it is cryptic and verbose. There are a lot of concepts and constructs in that API stack which are hidden or represented more abstractly by the time a human interacts with the tool using them. But to an AI agent, there is only that API, and the information it sends to its server to describe the state it was in, the actions it took are raw representation of the API constructs. Add to this that a lot of the objects and data which manifest via that API are generated automatically by build-time processes most of the developers on the product team don’t know, and you wind up with a description of what the AI agent was doing that is very difficult for a developer to relate to the product code they know. It wasn’t an API that was designed to facilitate investigation.
The tool sometimes fixates on failures the product team does not care about
There is a control in one of the applications which, for some reason, the tool was able to use to crash the application a lot. I was investigating this failure and talking with the product team engineer driving all the crash fixes for that application, and when we looked at real user crashes, there were zero users hitting that crash. Digging further, we found there weren’t users using this control much at all. This is a bit strange, because the control for this application is immediately available in the initial state. It has been there for many years. If you read books or how-to articles on this application, the functionality this control represents is mentioned a fair amount as a “smart way” to use the features. Given all that, you would expect the end users to hit the same set of crashes all the time, but they don’t. What is probably really going comes from this app’s heavily skewed reader/creator distribution, and this feature is only ever used by the creators, and it’s not a feature that needs to be used more than once or twice per document. The end users are not in there often enough to hit the problem states. The tool, meanwhile, doesn’t have the same notions that guide the users one way or the other, it just sees a control available on the first screen and starts using it.
Content creation and editing was unexplored territory.
For content-based applications, there is a massive functional space that is not available on the menus and controls in the UI. A lot of the functionality exists in editing and presenting the content. If I type the sentence “A group of grean donkeys leaned on the fence,” there will be a set of behaviors and features which light up around the introduction of the word “grean.” Likewise, if I start a new like with a hyphen, hit space, and type, when I hit ENTER the hyphen will turn into a bullet. Unless I have turned that feature off, and unless a variety of other conditions based on what might have been entered prior.
Some functionality just isn’t going to happen.
Maybe it is a matter of probability being too small, but there were certain preconditions that made it almost impossible for the AI agents on their own to hit certain functionality. I had done a bunch of “cheating” on some of these by creating documents that had a bunch of triggering content in them (e.g., tables, images, charts, WordArt, and other things I didn’t trust tool to find on its own) and placed the documents on the MRU list so the tool would randomly load them. That gave the coverage a substantial boost, but that meant the specific document I had put on the test machines was in the model rather than a more generalized way of getting to the same state. Even with that, though, there were features we had not seen yet.
There are things you do not want your AI agents to do.
One of the controls immediately visible in the Office applications is the “send us feedback” control. This is one of the first features where a product team contacted us, asked what was happening, and requested we tell the tool to not use that feature. We had to do this with several other features (we never did put Outlook into the workload, as I was terrified of the tool spamming the internet with mail). In some cases, it was not so much the feature use as it was the way the tools always starting from a new install of Office on the test machine was creating pressure on features that only happen during initial state, and the internal deployment of the backend services we were using were not configured to handle that particular capacity demand.
Being predictable and not getting lost are a tradeoff.
For a hardcoded script, when the resulting application state is different at some point along the way, the script will often fail and abort. Even if the difference in the state is innocuous, the presence of something other than exactly what the script meant to see next, act on next, is enough to prevent it from behaving correctly. The script gets lost. The possibility of getting lost is even greater if the tool is acting against a far richer, far longer sequence of steps that were collected randomly from some other agent on some other machine. Navigating a set of target states, it is easy for one of those states to be different enough that the tool cannot find the next action it intended to do. To keep going, the tool had several techniques to get back to where it intended. It would use those techniques to build a new action sequence to get back to the state it wanted the application to be in. This allowed the tool to continue with goal seeking, even if the application behaviors were not exactly the same as the existing model described. The new path would be sent to the server to augment the model. This kind of intelligence works well to keep the test running, and it improves goal seeking for rewards. But it also makes it difficult to reproduce prior failures, as the tool very often would choose a different sequence midway through a run to correct for variations in application state. Exact replay would be too brittle to succeed, but smart replay self-healing makes reproduction of prior behavior difficult.
Some states just cannot be reproduced by simple re-play.
Some states are entirely run-time dependent. Even if the agent does the exact same thing, the application will be in a different state than before.
We were executing these runs with real Microsoft Live accounts that had a real site in Microsoft 365. The Office clients connect to the Microsoft 365 cloud backed services such as grammar/spell checking, design suggestions, document storage, co-authoring document notifications, and a bunch of other capabilities that come from the Office services. The accounts were shared among other tests. This meant that the user configuration, data, and state on the server would accumulate over time and start to affect the tests. One of the most common things which caused variation between runs would be if one agent were to build a document, and then save it. The document would be saved on that user’s document folder in Microsoft 365. Some later instance of the tool, logged in as the same user, would pick up that file and perform more changes to it. Each run through that file was different than the time before. In the log of actions of the session, you would see some random file was opened or saved, and then some point later some kind of failure happened. The problem is whatever state that document was in accumulated over many runs.
Conclusion
My main point of the article was to share a very interesting experience doing something that was quite different than the types of automation efforts I see elsewhere. It also interests me because I see much pondering about the use of AI as a tool for product testing, but so much of it comes from people who have not had a chance to actually try it. I spent several years doing it, so I figured it was worth sharing some of the ideas. Core to success, I believe, is the human centric nature of using such tools. You can give the AI a lot to do, but at every instance that mattered, it was a person looking at results, making decisions, taking final action. The tools are there for the human to use, not serve. I believe there is a real and productive future for machine learning and AI tools in support of product testing. I believe the range of usage to be broad, ranging from tiny and mundane to large and bold. I believe there are a lot of misconceptions, misrepresentation, and broken promises out there already and quite a few coming. But I also know from my own experiences that the idea has promise.