Increase in errors on public pages
Incident Report for Atlassian Statuspage
Postmortem

SUMMARY

On May 14, 2020, between 08:48 to 09:04 GMT-0700 (Pacific Daylight Time), a large volume of malicious traffic bypassed our traffic management layer, which caused a significant portion of public statuspage requests to fail. Engineers started remediation work within a minute of discovery and impact was fully mitigated in 16 minutes.

We apologize to customers whose services were impacted during this incident; we are taking immediate steps to improve the platform’s performance and availability.

TECHNICAL REASONS

We use a traffic management layer to mitigate large volumes of malicious traffic. The attackers used a specific high-volume attack pattern, which the layer detected. During the attack, it blocked 70% of the traffic, but 30% was still able to get through. As a result, this traffic was routed to our application stack, which caused performance degradation.

During this window, a significant portion of requests to public status pages failed. In order to restore service across our network of status pages, we temporarily restricted access to one customer status page which was the target of malicious traffic. That status page was restricted at 08:59:55, which returned our service to normal operations. We were able to restore service to the affected status page at 09:21:04.

ROOT CAUSE

The traffic management layer was not configured correctly for this specific attack pattern. The layer attempted to mitigate the attack traffic by rate-limiting client connections on a per IP basis but did not ban the IPs. As a result, bad clients continued to send traffic, which degraded our backends, until we restricted access to the status page all the traffic was sent to.

REMEDIAL ACTIONS PLAN & NEXT STEPS

To protect our customers, we continually evolve our defense-in-depth mechanisms. As a part of this work, we plan to harden layers of our stack and specifically implement changes to our traffic management layer to address the high-volume attack pattern that caused the incident. These changes will prevent an incident from the same type of attack in the future.

Defense-in-Depth Plan Action

  • Implement parameter and route validation checks across our stack

Traffic Management Layer Plan Action

  • Configure the traffic management layer to temporarily block IPs that exhibit malicious behavior
Posted May 29, 2020 - 11:20 PDT

Resolved
This incident has been resolved and our service to public pages has fully stabilized. We will publish a postmortem on the details of this incident soon.
Posted May 14, 2020 - 10:17 PDT
Monitoring
We've identified the source of the traffic that affected our ability to serve public pages and have stabilized. We are continuing to monitor.
Posted May 14, 2020 - 09:35 PDT
Investigating
We are currently investigating this issue.
Posted May 14, 2020 - 09:08 PDT
This incident affected: Hosted Pages (HTTP Pages, HTTPS Pages).