All blogs · Written by Ajitesh

OpenAI PM Interview: ChatGPT Launch Metrics

OpenAI PM Interview: ChatGPT Launch Metrics

Welcome to the ninth edition of PM Interview Prep Weekly! I’m Ajitesh, and we’re coming full circle—back to metrics questions, but this time with the most successful product launch in consumer tech history.

The Context

When ChatGPT launched on November 30, 2022, OpenAI expected maybe a few thousand users to try their “research preview.” They got 1 million in 5 days. Two months later? 100 million users—the fastest-growing consumer application in history.

But here’s the thing: there was no revenue model. No pricing strategy. No monetization plan. This wasn’t a product launch in the traditional sense—it was an experiment to answer a simple question: Would people actually find value in conversational AI?

This is my favorite type of metrics case because it forces you to think differently. Most product launches have clear business objectives: grow revenue, increase market share, hit specific conversion targets. But what do you measure when you’re launching something completely new to the world? How do you define success for an experiment that might fundamentally change how humans interact with technology?

At Google Cloud, when we launched Gemini capabilities, we faced similar challenges. The AI market was moving so fast that traditional metrics like “time to value” or “adoption rate” felt inadequate. We were competing with OpenAI’s ChatGPT, which had already captured mindshare, and we needed metrics that would help us understand genuine product-market fit beyond the novelty effect.

I learned that for experimental products - especially in AI - you need metrics that separate curiosity from real value. A million sign-ups means nothing if nobody comes back after trying it once. High engagement means nothing if your AI is producing harmful outputs. Revenue means nothing if you’re sacrificing the core experience that makes your product special.

The ChatGPT case is perfect because it embodies all of these tensions. It requires you to think about novelty effects, AI safety, experimental product validation, and long-term strategic positioning all at once.

Today’s case: You’re launching ChatGPT to the public for the first time. How would you measure its success?

The Case

Interviewer: “You’re launching ChatGPT to the public for the first time as a research preview. How would you measure its success?”

Key Context (Typically Shared When Probing):

  • Launch Date: November 30, 2022 (GPT-3.5 Turbo)
  • Business Model: Completely free, no revenue strategy
  • Strategic Goal: Validate conversational AI value, not monetization
  • Market Position: First mainstream consumer conversational AI
  • Key Risk: AI safety—preventing harmful or offensive outputs
  • Expected Users: A few thousand experimenters
  • Actual Result: 1 million users in 5 days, 100 million in 2 months

The Interview Approach

Note: Like all metrics questions, you want to show comprehensive thinking before narrowing to what truly matters. The framework helps you stay structured, but adapt it naturally to the conversation.

I follow this four-step approach for metrics cases:

  1. Clarifying Questions - Understand scope and strategic context
  2. Brainstorm Broadly - Map goals, actions, and metrics comprehensively
  3. Finalize Narrowly - Select 3-5 priority metrics with clear rationale
  4. Conclude Thoughtfully - Address measurement challenges and trade-offs

Let’s walk through how I’d tackle this case:

1. Clarifying Questions - 2 minutes

I’d start by establishing what makes this launch unique:

“Before diving into metrics, let me clarify a few things to ensure we’re aligned:

  • Is this launch US-only or global? (US only, English language)
  • Confirming we’re talking about the November 2022 launch of GPT-3.5 as a research preview, correct? (Yes)
  • Are we measuring success for the experimental phase or planning for commercialization? (Experimental phase—validation, not monetization)“

The key insight here is recognizing that this isn’t a traditional product launch. It’s an experiment with massive strategic implications.

2. Brainstorm Metrics (Go Broad) - 5-7 minutes

Now I’d map Goals → Actions → Metrics. This structure ensures metrics actually measure what matters.

The Goal: The initial launch wasn’t about monetization—there wasn’t even a pricing model. Based on Sam Altman’s interviews at the time, the goal was to answer: Can we deliver value to users through a conversational interface on an LLM?

This was fundamentally an experiment to validate demand and understand real-world usage patterns. But they also needed to do this responsibly—avoiding offensive responses, preventing system abuse, and maintaining AI safety.

So we’re really measuring two things:

  1. Value Validation: Are people finding genuine utility?
  2. AI Safety: Are we deploying this responsibly?

Map User Actions: Let me map what valuable engagement looks like—the actions that signal our goals are being met:

  • Discovery: User hears about ChatGPT (social media, word-of-mouth)
  • First Session: Tries it out of curiosity, asks initial questions
  • ”Wow Moment”: Experiences surprisingly good conversational quality
  • Experimentation: Tests boundaries, tries different use cases
  • Value Discovery: Finds specific applications (writing help, learning, problem-solving)
  • Return Usage: Comes back days later for real needs (not just novelty)
  • Advocacy: Shares interesting responses, tells friends

For AI safety, we need to watch for:

  • Harmful or offensive outputs being generated
  • Users gaming the system or attempting jailbreaks
  • Content quality degradation at scale

Brainstorm Metrics to Measure Actions: Now, let me brainstorm broadly across all possible metrics before narrowing down:

Engagement & Value Metrics:

  • 7-day retention rate (% returning within 7 days)
  • Daily/Weekly Active Users (DAU/WAU)
  • Session length and conversation depth
  • Response share rate (viral moments)
  • Returning users with different use cases
  • User-initiated conversations (vs. one-off tests)

Quality & Satisfaction Metrics:

  • Thumbs up/down ratio per response
  • Regeneration rate (response quality proxy)
  • User feedback comments and sentiment
  • Conversation success rate (user perspective)
  • Share/screenshot rate of responses

AI Safety Metrics (Critical):

  • Harmful content generation rate
  • Thumbs down with safety concerns
  • Automated content policy violations
  • Human-in-the-loop quality assessments
  • Jailbreak attempt success rate

Growth & Word-of-Mouth:

  • New user sign-ups per day
  • Social media mention volume
  • Press coverage and media sentiment
  • User base growth rate week-over-week

Infrastructure & Reliability:

  • Response latency (p50, p95, p99)
  • System uptime during viral spikes
  • Error rate and timeout frequency
  • Cost per conversation (infrastructure)

3. Finalize Metrics (Go Narrow) - 5 minutes

Now for the hard part—choosing what actually matters. Here’s what I’d prioritize:

Primary Metric: 7-Day Retention Rate

  • Definition: Percentage of users who return within 7 days of their first session
  • Why Primary: This is the ultimate test of value. Anyone can try something novel once. Coming back in 7 days means you found genuine utility beyond curiosity. This filters out the novelty effect that makes all other metrics misleading in the first few weeks.
  • Target: 40-60% would be exceptional for a new product (accounting for AI novelty factor)
  • What It Tells Us: Whether people discovered use cases worth returning for

Secondary Metric: 7-Day Active Users (Engagement)

  • Definition: Users actively engaging with ChatGPT weekly after initial signup
  • Why Secondary: This measures word-of-mouth growth and satisfaction. If people find value, they tell others. This metric indicates referral momentum and organic growth from genuine usefulness, not just marketing hype.
  • Leading Indicators: Session depth (turns per conversation), positive ratings, sharing behavior
  • What It Tells Us: Whether we’re creating sustainable engagement patterns

Guardrail Metric: AI Safety Incident Rate

  • Definition: Percentage of conversations resulting in harmful, offensive, or policy-violating outputs
  • Measurement: Combination of thumbs-down reports flagged for safety + human evaluation of test corpus
  • Why Guardrail: One major safety incident could destroy trust and derail the entire experiment. This must stay below threshold even as we scale virally.
  • Target: <1% of conversations flagged for safety issues
  • What It Tells Us: Whether we can scale responsibly

Why These Three?

Notice the strategic logic:

  1. Retention answers: “Is there real value beyond novelty?”
  2. Active Users answers: “Is that value spreading organically?”
  3. Safety Rate ensures: “Are we deploying this responsibly?”

Together, these three metrics tell you whether you’ve achieved the experimental goal: validating that conversational AI can deliver genuine value at scale without causing harm.

4. Conclude - 3 minutes

Addressing Key Measurement Challenges:

The Novelty Effect Problem: ChatGPT was completely new to most people. This creates massive measurement challenges:

  • Initial engagement will be artificially high due to curiosity
  • Users might rate experiences highly simply because expectations are undefined
  • Viral growth spikes will create misleading growth curves
  • Need to wait 4-6 weeks before extrapolating any long-term patterns

That’s why 7-day retention is crucial—it lets the novelty wear off before measuring value.

AI Safety at Scale: As we go from thousands to millions of users:

  • Edge cases become common cases (1 in 10,000 errors happen constantly)
  • Adversarial users will actively try to break safety measures
  • Need robust human-in-the-loop quality checks alongside automated systems
  • Must balance between safety and experience (overly restrictive = unusable)

What We’re Willing to Accept:

  • High initial churn: 60-70% of curious users never returning is fine
  • Lower engagement than social media: This isn’t TikTok—deep sessions matter more than frequency
  • Some false positives on safety: Better to be cautious early, refine over time
  • Infrastructure costs: Value validation justifies high compute costs initially

Key Takeaways

  1. Novelty Requires Different Metrics: For groundbreaking products, standard metrics mislead. You need measures that separate curiosity from genuine value—hence the focus on retention over raw growth.

  2. Experimental Goals ≠ Commercial Goals: ChatGPT’s launch goal was validation, not monetization. Your metrics must match the strategic stage. Revenue too early would’ve incentivized wrong behaviors.

  3. Safety is Non-Negotiable for AI: One viral harmful incident could’ve destroyed ChatGPT before it began. Guardrail metrics aren’t optional—they’re existential. Always include them when discussing AI products.

  4. Go Broad, Then Narrow with Purpose: I brainstormed 20+ potential metrics, then narrowed to 3 with clear strategic rationale. This shows comprehensive thinking AND judgment—both critical PM skills.

  5. Context Changes Everything: The same product (conversational AI) launched commercially would have completely different metrics. Always ground your metrics in strategic context, not generic frameworks.

  6. Retention > Growth for Validation: When experimenting, understanding depth matters more than breadth. Would you rather have 1 million curious users or 100,000 who come back weekly? The latter teaches you more.

Common Pitfalls to Avoid

  • Ignoring AI Safety: Treating this like a normal app launch without addressing safety shows lack of AI product awareness
  • Only Vanity Metrics: Focusing solely on DAU or sign-ups without retention or quality signals
  • No Novelty Effect Consideration: Failing to acknowledge that early metrics will be skewed by curiosity
  • Missing the Experimental Nature: Proposing revenue or monetization metrics for what was fundamentally a research preview
  • Analysis Paralysis: Listing 15 metrics without clear prioritization or strategic rationale
  • Generic Framework Following: Mechanically applying a framework without adapting to the unique context

Practice This Case

Want to practice this metrics case with an AI interviewer who understands OpenAI’s strategic context?

Practice here: PM Interview: Metrics - ChatGPT Launch

The AI interviewer will probe your understanding of experimental product launches, challenge you on novelty effects, and ensure you’re balancing growth with AI safety—just like a real OpenAI PM would.

Further Reading

Want to dive deeper into metrics frameworks and AI product measurement? Check out these resources:

Frequently Asked Questions

Q: How is measuring AI products different from traditional products?

A: AI products have unique challenges: novelty effects skew early metrics, safety is non-negotiable, and user expectations are undefined. Traditional acquisition/retention funnels miss the critical “value discovery” phase where users figure out what the AI is actually good for. You need metrics that separate curiosity from genuine utility.

Q: Why focus on 7-day retention instead of daily active users?

A: For experimental products like ChatGPT, DAU is misleading because it includes curiosity-driven usage. 7-day retention filters out the novelty effect and measures whether users found genuine value worth returning for. It’s a better predictor of long-term success than raw engagement numbers.

Q: How do you balance growth metrics with safety metrics?

A: Safety is a guardrail metric, not a trade-off. You set a threshold (like <1% safety incidents) that you cannot cross, regardless of growth impact. For experimental AI products, one viral safety incident can destroy years of progress, so safety always comes first.

Q: What if the interviewer pushes for revenue metrics?

A: Remind them of the strategic context. ChatGPT launched as a free research preview with no monetization plan. Focusing on revenue metrics would’ve incentivized wrong behaviors (paywalling, optimizing for paying users instead of usage validation). Match your metrics to the strategic phase.

Q: How do you handle the “too many metrics” feedback?

A: Start broad to show comprehensive thinking, then narrow with clear rationale. Explain why you’re prioritizing retention (value validation), active users (organic growth), and safety (responsible scaling). Three metrics with strategic justification is better than ten without purpose.

Q: What about competition from Google, Microsoft, etc.?

A: For the November 2022 launch, ChatGPT was first to market for consumer conversational AI. The goal was market creation, not market share capture. Competitive metrics become relevant later, but for experimental validation, user value matters more than competitor benchmarking.

PM Tool of the Week: Claude for Sheets

This week, let’s talk about Claude for Sheets from Anthropic - a tool that I have been loving for my analysis work.

As PMs, we’re constantly buried in spreadsheets: user survey results, feature requests, customer feedback, etc. Traditionally, you’d either read through hundreds of rows manually or write formulas when applicable.

Claude for Sheets lets you use AI directly in Google Sheets to:

  • Categorize at scale: Tag 500 user feedback comments by theme in seconds
  • Extract insights: Pull key pain points from long-form feedback
  • Smart summaries: Condense customer interview notes into action items

What I love is how it feels native to your workflow—no switching contexts, no copying data to another tool. You’re just enhancing your existing spreadsheet analysis with AI capabilities.

Using a PM tool that’s saving you hours each week? Reply and tell me about it!


What’s your take on metrics for experimental AI products? How would you measure success differently for research previews versus commercial launches? Hit reply—I’d love to hear your perspective.


About PM Interview Prep Weekly

Every Monday, get one complete PM case study with detailed solution walkthrough, an AI interview partner to practice with, and insights on what’s new in PM interviewing.

No fluff. No outdated advice. Just practical prep that works.

— Ajitesh
CEO & Co-founder, Tough Tongue AI
Ex-Google PM (Gemini)
LinkedIn | Twitter


Related Posts