Incident Response Plan

Identifying Incidents

DevOps engineers can be notified of an incident in various ways.

Customer signup notification (alerts us to unexpected issues during ReachOut.AI sign-up process)
Customer complaint notification (alerts us to unexpected issues while using ReachOut.AI features)
Service monitoring notification (alerts us to unexpected issues in ReachOut.AI service infrastructure)

Incident Types

Infrastructure Failure Incident

In the event of an isolated incident with our cloud hosting provider, our High Availability configurations will keep us online. We will remain operational during a single Availability Zone outage, but a widespread outage may cause downtime until the upstream service is restored. In such cases, you should:

Find out the issue really is the upstream provider.
Check the status updates of the upstream provider and inform internally so we can inform users.
If service issue takes longer than a couple of hours, start assessing the possibility of migrating the affected service to another Region.

Security Breach
If you notice a security breach of any kind, you should:

Escalate the issue internally and communicate with users if any data was leaked.
Gather and analyze evidence that led to the classification as a security breach.

In case of affected instances:

Turn them off and create snapshots for future investigation.
Rotate any credential that might have been present in the instances.

In case of affected credentials, like email phishing or other:

Rotate any credential that might have been compromised.
Assume more things have been compromised and investigate other possible affected targets.

These include but are not limited to:

Loss or theft of personal computing devices used to store or access ReachOut.AI systems.
Breaches of any ReachOut.AI systems.
Unintended disclosure of ReachOut.AI sensitive information.

Reacting to Incidents

Ensure the whole team knows by announcing it on the ReachOut.AI Team channel. Use @channel to attract everyone’s attention.
Identify the affected services. If it takes more than a few minutes, coordinate with online engineers for assistance, possibly by starting a Slack or WhatsApp chat to share findings without interrupting remediation efforts.
When you’ve identified the affected services, decide on the severity of the incident:
- Was there a security breach?
- Is customer data affected?
- Is the incident part of a larger vendor, AWS, outage?
- Will a reliable fix be easy to produce?
- Can you do it on your own?
- How long will it take you to deploy it?
- Do you need someone to review your fix before and after you deploy it?
- Do we need to go into maintenance mode in the meantime?
- Are you sure what you are fixing is the actual root cause of the problem?
Make sure the DevOps team are aware of the issue. If none of them are online, contact them immediately by phone. Most certainly they know about the issue before anyone else, but it’s better to verify if you’re unsure.
Create an activity log to track what changes are being made and what is known about the outage. This could be writing small updates in a channel like #ReachOut-Notifications. This is very useful for hand-overs and post-mortem creation.
Discuss in the ReachOut.AI Team channel if we should enter maintenance mode. Maintenance mode should be used if the outage is expected to take more than a few minutes. If it’s decided that we should enter maintenance mode, a developer should immediately do so.
If users contact us, use the BetterUptime Incident Maintenance Mode if ReachOut.AI is in maintenance, and Incident – Not in Maintenance Mode if it is not.

After the Incident is Solved

Confirm that the incident is resolved.
Update the team on ReachOut.AI Team channel.
Ensure we have exited maintenance mode if it was activated.
If the maintenance takes significantly longer than anticipated, we will send an email to explain the situation.
Ensure monitoring is established to detect this issue in the future.
For lengthy incidents or those impacting multiple services, conduct a post-mortem analysis with a detailed timeline to identify root causes and enhance future processes.

Other notes:

Data Subject Notification: In the event of a data breach that affects personal data, we will notify affected data subjects without undue delay, providing them with information about the nature of the breach, potential consequences, and measures taken to mitigate risks.
Data Protection Impact Assessments (DPIAs): For incidents that may impact personal data, we will conduct a Data Protection Impact Assessment to evaluate risks and implement necessary measures to protect data subjects' rights.
Incident Documentation: All incidents will be documented, including the nature of the incident, the response actions taken, and the outcomes. This documentation will be maintained to demonstrate compliance with GDPR accountability requirements.
Training and Awareness: Regular training sessions will be conducted for all staff to ensure they are aware of GDPR requirements and their responsibilities in the event of an incident.
Third-Party Notifications: If a significant data breach occurs, we will notify the relevant data protection authority within 72 hours of becoming aware of the breach.