Your product goes down at 9am on a Tuesday. Peak usage hours. 10,000 active users suddenly can't access the system.
Twitter erupts. Support tickets flood in. Your biggest customer's CEO emails your CEO directly.
Engineering is scrambling. Your PM says "two hours to fix." Sales is panicking. Marketing asks: "What should we post?"
This is when most companies make it worse. Radio silence for hours. Then a generic "We're experiencing technical difficulties" that tells customers nothing. Followed by defensive blog post explaining why it wasn't really your fault.
Here's how to manage crises without destroying customer trust.
The Crisis Communication Framework
Bad crisis response: Stay quiet, downplay severity, blame external factors
Good crisis response: Communicate early and often, own the problem, show specific action
The First 15 Minutes: Initial Acknowledgment
Your window: 15 minutes from when customers start reporting issues
Why: Customers are already experiencing the problem. Silence makes them assume you don't know or don't care. Fast acknowledgment shows you're on it.
StatusPage's research: Companies that acknowledge incidents within 15 minutes see 40% fewer support tickets than those who wait an hour.
The initial statement (3 sentences max):
"We're aware that [specific problem]. Our engineering team is investigating the cause. We'll update you within [specific timeframe]."
Example from Netlify outage: "We're aware that some sites are not loading. Our team is investigating the issue and working on a fix. We'll provide an update within 30 minutes."
Post this:
- Status page (first)
- Twitter/LinkedIn (immediately after)
- In-app banner (if possible)
- Email to enterprise customers (optional first 15 min)
Hours 1-4: Ongoing Updates
Update cadence: Every 30-60 minutes, even if no progress
Why: Silence creates anxiety. Regular updates (even "still working on it") show you're actively managing the situation.
The update template:
Status: What's working, what's not Impact: How many customers affected Cause: What we know (be specific if known, honest if unknown) Fix: What we're doing and ETA (if available) Next update: When to expect next communication
Example from GitLab major outage:
"Update 11:30am UTC:
- Status: GitLab.com still experiencing degraded performance
- Impact: ~40% of API requests failing
- Cause: Database replication lag from config change
- Fix: Rolling back change, expect 45 min to full recovery
- Next update: 12:15pm UTC or when resolved"
What makes this work: Specificity builds trust. Customers know what's happening, why, and when it should be fixed.
Resolution: The All-Clear
When service is restored, don't just say "We're back."
The resolution statement:
Issue resolved: Time service fully restored What happened: Root cause explanation (technical but accessible) What we're doing: How we're preventing recurrence Next steps: What customers should do (if anything) Postmortem: When you'll publish full analysis
Example from Stripe outage resolution:
"As of 2:45pm PST, all Stripe services are fully operational.
What happened: A code deployment introduced a bug that caused API timeouts under high load.
Impact: 3 hours of degraded service affecting 18% of API requests.
Prevention: We've rolled back the deployment, added load testing for this scenario, and implemented additional monitoring.
Postmortem: We'll publish full incident report within 48 hours at status.stripe.com"
The Security Incident Framework
Security incidents require different approach than service outages.
The First Hour: Controlled Disclosure
Don't post public updates until you understand scope and have solution.
Why: Public security announcements before you have a fix creates panic and gives attackers information.
The triage process:
0-15 min: Internal assessment
- What data/systems are compromised?
- How many customers affected?
- Is attack still active?
- Do we need external help (security firms, law enforcement)?
15-30 min: Customer notification (private)
- Email directly affected customers
- Be specific about what data may be compromised
- Provide immediate action steps (change passwords, revoke API keys)
30-60 min: Executive notification
- Brief CEO, board, legal counsel
- Determine if disclosure regulations apply (GDPR, SOC 2, etc.)
- Get legal approval for public statement
The Public Statement: What to Say
Only post publicly after:
- Affected customers notified privately
- Immediate threat contained
- Legal review complete
The security incident template:
What happened: Type of security incident (data breach, unauthorized access, DDoS) Scope: What data or systems affected Timeline: When it occurred and when discovered Impact: How many customers/records affected Response: What we did immediately Resolution: Current status Next steps: What customers should do Investigation: Ongoing security review and external audit
Example from Okta security incident:
"We recently identified unauthorized access to our support case management system. An attacker accessed files uploaded by certain customers during support cases. Approximately 2.5% of customers may be affected. We've contacted impacted customers directly. We've revoked the attacker's access, engaged external cybersecurity forensics, and are implementing additional security controls. Full report will be published after forensics complete."
The Product Failure Framework
Different from outages: Your product has fundamental flaw that caused customer harm
The Ownership Statement
Don't blame customers, edge cases, or "unforeseen circumstances."
Example of bad response: "Some users experienced issues due to unusual usage patterns that exceeded typical thresholds. We recommend following best practices outlined in our documentation."
Translation: "It's your fault for using it wrong."
Example of good response: "We discovered a bug that caused data loss for customers using [specific feature]. This is unacceptable. We take full responsibility. Here's what we're doing to make it right."
Basecamp's bug that deleted customer data:
"We messed up. A bug in our code caused permanent data loss for 8 customers. This is inexcusable. We're personally reaching out to every affected customer, providing full refunds, and working with them to recover what we can. We've fixed the bug and added safeguards to prevent this type of failure. We're deeply sorry."
Why this works: No excuses, clear ownership, specific action, sincere apology.
The Customer Communication Tiers
Not all customers need same level of communication.
Tier 1: Directly Affected Customers (Personalized)
Communication method: Phone call or personal email from account executive or CSM Timing: Immediate (within first hour) Message: Specific impact to their account, immediate action required, direct support contact
Example: "Your API keys may have been compromised. We recommend regenerating them immediately. I'm available at [phone] to walk you through this personally."
Tier 2: Enterprise/High-Value Customers (Segment-Specific)
Communication method: Email from executive team Timing: Within 2 hours Message: Situation overview, business impact assessment, dedicated support
Example: "We know you rely on [product] for critical operations. Here's the detailed timeline of today's incident and how we're preventing recurrence. Your dedicated CSM is standing by."
Tier 3: All Other Customers (Broadcast)
Communication method: Email, status page, social media Timing: After crisis contained Message: General update, transparent accounting, apology
Common Crisis Communication Mistakes
Mistake #1: Waiting until you have all the answers
You delay communication for hours while investigating. Customers assume the worst.
Fix: Acknowledge fast, update often, even if you don't have complete information yet.
Mistake #2: Corporate speak and jargon
"We're experiencing a service degradation affecting a subset of users." What does that mean?
Fix: Plain language. "Our app is down for some customers. We're working on fixing it."
Mistake #3: Blaming external factors
"AWS had an outage." (Subtext: Not our fault)
Fix: Customers don't care whose fault it is. They care that you solve their problem. Own the impact.
Mistake #4: Under-communicating impact
"Small issue affecting limited users." Actually affected 50% of customers.
Fix: Be honest about scope. Trust is destroyed when people discover you minimized impact.
Mistake #5: No follow-through on promised postmortem
You promise full transparency and detailed analysis. Then never publish it.
Fix: Always publish postmortem. Date it. Hold yourself accountable.
The Postmortem Framework
Within 48-72 hours of resolution, publish detailed incident report.
The structure:
Summary: One-paragraph overview of what happened
Timeline: Minute-by-minute account of incident
- When issue started
- When detected
- What actions taken when
- When resolved
Root cause: Technical explanation (accessible to non-technical readers)
Impact analysis:
- Number of customers affected
- Duration of impact
- Data/functionality lost
Immediate fixes: What was done to resolve
Long-term prevention:
- What we're changing in our systems
- New monitoring/testing being implemented
- Timeline for implementation
Lessons learned: What we could have done better
PagerDuty's incident postmortems: They publish every major incident with this level of detail. Builds immense trust because customers see genuine commitment to improvement.
Quick Start: Build Crisis Playbook This Week
Day 1: Create crisis communication team
- Who's responsible for drafting statements? (PMM, typically)
- Who approves before posting? (CEO, Legal, depending on severity)
- Who posts to which channels? (Support, Social Media Manager)
Day 2: Build status page and update templates
- Set up StatusPage.io or similar
- Create template for initial acknowledgment (just fill in details)
- Create template for ongoing updates
- Create template for resolution statement
Day 3: Define notification tiers
- Tier 1: Who gets personal calls? (Top 20 accounts)
- Tier 2: Enterprise segment email list
- Tier 3: All customer broadcast list
Day 4: Set up monitoring and escalation
- When does support escalate to crisis team? (Severity thresholds)
- Who's on-call for crisis communication?
- How do they get alerted?
Day 5: Run crisis simulation
- Simulate outage or security incident
- Practice full communication flow
- Time how fast team can respond
- Identify gaps in process
The Uncomfortable Truth
Most companies have zero crisis communication plan until they need one. Then they improvise badly under pressure.
What doesn't work:
- Radio silence followed by defensive blog post
- Corporate jargon that tells customers nothing
- Blaming external factors
- Under-communicating severity
- Promising transparency then going dark
What works:
- Acknowledge within 15 minutes
- Update every 30-60 minutes
- Plain language, specific details
- Own the problem fully
- Publish detailed postmortem
Your customers don't expect perfection. They expect honesty and competence when things break.
Build the playbook before the crisis. You won't have time during.