Share

Unlearning to Build Great AI Apps | by Sam Stone

0 0

Product strategies from Classical ML to adapt (or ditch) for the generative AI world

Image source: Tome

Years ago, the first piece of advice my boss at Opendoor gave me was succinct: “Invest in backtesting. AI product teams succeed or fail based on the quality of their backtesting.” At the time, this advice was tried-and-true; it had been learned the hard way by teams across search, recommendations, life sciences, finance, and other high-stakes products. It’s advice I held dear for the better part of a decade.

But I’ve come to believe it’s not axiomatic for building generative AI products. A year ago, I switched from classical ML products (which produce simple output: numbers, categories, ordered lists) to generative AI products. Along the way, I discovered many principles from classical ML no longer serve me and my teams.

Through my work at Tome, where I’m Head of Product, and conversations with leaders at generative AI startups, I’ve recognized 3 behaviors that distinguish the teams shipping the most powerful, useful generative AI features. These teams:

  1. Simultaneously work backwards (from user problems) and forwards (from technology opportunities)
  2. Design low-friction feedback loops from the outset
  3. Reconsider the research and development tools from classical ML

These behaviors require “unlearning” numerous things that remain best practices for classical ML. Some may seem counter-intuitive at first. Nonetheless, they apply to generative AI applications broadly, ranging from horizontal to vertical software, and startups to incumbents. Let’s dive in!

(Wondering why automated backtesting is no longer a tenet for generative AI application teams? And what to replace it with? Read on to Principle 3)

(More interested in tactics, rather than process, for how generative AI apps’ UI/UX should differ from classical ML products? Check out this blog post.)

“Working backwards” from user problems is a credo in many product and design circles, made famous by Amazon. Study users, size their pain points, write UX requirements to mitigate the top one, identify the best technology to implement, then rinse and repeat. In other words, figure out “This is the most important nail for us to hit, then which hammer to use.”

This approach makes less sense when enabling technologies are advancing very rapidly. ChatGPT was not built by working backwards from a user pain point. It took off because it offered a powerful, new enabling technology through a simple, open-ended UI. In other words: “We’ve invented a new hammer, let’s see which nails users will hit with it.”

The best generative AI application teams work backwards and forwards simultaneously. They do the user research and understand the breadth and depth of pain points. But they don’t simply progress through a ranked list sequentially. Everyone on the team, PMs and designers included, is deeply immersed in recent AI advances. They connect these unfolding technological opportunities to user pain points in ways that are often more complex than one-to-one mappings. For example, a team will see that user pain points #2, #3, and #6 could all be mitigated via model breakthrough X. Then it may make sense for the next project to focus on “working forwards” by incorporating model breakthrough X, rather than “working backwards” from pain point #1.

Deep immersion in recent AI advances means understanding how they apply to your real-world application, not just reading research papers. This requires prototyping. Until you’ve tried a new technology in your application environment, estimates of user benefit are just speculation. The elevated importance of prototyping requires flipping the traditional spec → prototype → build process to prototype → spec → build. More prototypes are discarded, but that’s the only way to spec features consistently that match useful new technologies to broad, deep user needs.

Feedback for system improvement

Classical ML products produce relatively simple output types: numbers, categories, ordered lists. And users tend to accept or reject these outputs: you click a link in the Google search results page, or mark an email as spam. Each user interaction provides data that is fed directly back into model retraining, so the link between real-world use and model improvement is strong (and mechanical).

Unfortunately, most generative AI products tend not to produce new, ground-truth training data with each user interaction. This challenge is tied to what makes generative models so powerful: their ability to produce complex artifacts that combine text, images, video, audio, code, etc. For a complex artifact, it’s rare for a user to “take it or leave it”. Instead, most users refine the model output, either with more/different AI or manually. For example, a user may copy ChatGPT output into Word, edit it, and then send it to a colleague. This behavior prevents the application (ChatGPT) from “seeing” the final, desired form of the artifact.

One implication is to allow users to iterate on output within your application. But that does not eliminate the problem: when a user does not iterate on an output, does that mean “wow” or “woe”? You could add a sentiment indicator (e.g. thumbs up/down) to each AI response, but interaction-level feedback response rates tend to be very low. And the responses that are submitted tend to be biased towards the extremes. Users mostly perceive sentiment collection efforts as additional friction, as they mostly don’t help the user immediately get to a better output.

A better strategy is to identify a step in the user’s workflow that signifies “this output is now good enough”. Build that step into your app and make sure to log what the output looked like at this point. For Tome, where we help users craft presentations with AI, the key step is sharing a presentation with another person. To bring this into our app, we’ve invested heavily in sharing features. And then we evaluate which AI outputs were “sharable” and which required massive manual editing to be shareable.

Feedback for user assistance

Free text has emerged as the dominant user-desired method of interacting with generative AI applications. But free text is a Pandora’s box: give a user free text input to AI, and they’ll ask the product to do all sorts of things it cannot. Free text is a notoriously difficult input mechanism through which to convey a product’s constraints; in contrast, an old-fashioned web form makes it very clear what information can and must be submitted, and in exactly what format.

But users don’t want forms when doing creative or complex work. They want free text — and guidance on how to craft great prompts, specific to their task at hand. Tactics for assisting users include example prompts or templates, guidance around optimal prompt length and formatting (should they include few-shot examples?). Human-readable error messages are also key (for example: “This prompt was in language X, but we only support languages Y and Z.”)

One upshot of free text inputs is that unsupported requests can be a fantastic source of inspiration for what to build next. The trick is to be able to identify and cluster what users are trying to do in free text. More on that in the next section…

Something to build, something to keep, something to discard

Build: natural language analytics

Many generative AI applications allow users to pursue very different workflows from the same entry point: an open-ended, free-text interface. Users are not selecting from a drop-down “I’m brainstorming” or “I want to solve a math problem” — their desired workflow is implicit in their text input. So understanding users’ desired workflows requires segmenting that free text input. Some segmenting approaches are likely to be enduring — at Tome, we are always interested in desired language and audience type. There are also ad hoc segmentations, to answer specific questions on the product roadmap — for example, how many prompts request a visual element like an image, video, table or chart, and thus which visual element should we invest in?

Natural language analytics should complement, not supplant, traditional research approaches. NLP is especially powerful when paired with structured data (e.g., traditional SQL). Lots of key data is not free text: when did the user sign up, what are the user’s attributes (organization, job, geography, etc). At Tome, we tend to look at language clusters by job function, geography, and free/paid user status — all of which require traditional SQL.

And quant insights should never be relied on without qualitative insights. I’ve found that watching a user navigate our product live can sometimes generate 10x the insight of a user interview (where the user discusses their product impression post-hoc). And I’ve found scenarios where one good user interview unlocked 10x the insight of quant analysis.

Keep: tooling for low-code prototyping

Two tooling types enable high-velocity, high-quality generative AI app development: prototyping tools and output quality assessment tools.

There are many different ways to improve an ML application, but one strategy that is both fast and accessible is prompt engineering. It’s fast because it does not require model retraining; it’s accessible because it involves natural language, not code. Allowing non-engineers to manipulate prompt engineering approaches (in a dev or local environment) can dramatically increase velocity and quality. Often this can be implemented via a notebook. The notebook may contain a lot of code, but a non-engineer can make significant advances iterating on the natural language prompts without touching the code.

Assessing prototype output quality is often quite hard, especially when building a net-new feature. Rather than investing in automated quality measurement, I’ve found it significantly faster and more useful to poll colleagues or users in a “beta tester program” for 10–100 structured evaluations (scores + notes). The enabling technology for a “polling approach” can be light: a notebook to generate input/output examples at modest scale and pipe them into a Google Sheet. This allows manual evaluation to be parallelized, and it’s normally easy to get ~100 examples evaluated, across a handful of people, in under a day. Evaluators’ notes, which provide insights into patterns of failure or excellence, are an added perk; notes tend to be more useful for identifying what to fix or build next than the numeric scores.

Discard: automated, backtested measures of quality

A tenet of classical ML engineering is to invest in a robust backtest. Teams retrain classical models frequently (weekly or daily), and a good backtest ensures only good new candidates are released to production. This makes sense for models outputting numbers or categories, which can be scored against a ground-truth set easily.

But scoring accuracy is harder with complex (perhaps multi-modal) output. You may have a text that you consider great and thus you’re inclined to call it “ground truth”, but if the model output deviates from it by 1 word, is that meaningful? By 1 sentence? What if the facts are all the same, but the structure is different? What if it’s text and images together?

But not all is lost. Humans tend to find it easy to assess whether generative AI output meets their quality bar. That doesn’t mean it’s easy to transform bad output into good, just that users tend to be able to make a judgment about whether text, image, audio, etc. is “good or bad” in a few seconds. Moreover, most generative AI systems at the application layer are not retrained on a daily, weekly, or even monthly basis, because of compute costs and/or the long timelines needed to acquire sufficient user signal to warrant retraining. So we don’t need quality evaluation processes that are run every day (unless you’re Google or Meta or OpenAI).

Given the ease with which humans can evaluate generative AI output, and the infrequency of retraining, it often makes sense to evaluate new model candidates based on internal, manual testing (e.g. the polling approach described in the subsection above) rather than an automated backtest.

You may also like...