Cristinel Anastasoaie

Incident report - recent downtime for AU data center

As you are probably aware, sites on all data centers have experienced some downtime these two weeks. First, we apologize for any inconvenience this might have caused you, and offer you a detailed explanation on what happened and what measures we're taking to prevent this in the future.

Starting on September 26th, sites on all data centers have begun experiencing intermittent downtimes. Sites on our Asia Pacific data center have experienced longer and more frequent downtime sessions than what we have announced in the AWS maintenance blog post.

The downtime has been caused by three distinct events and was amplified by timing:

  • Amazon AWS infrastructure upgrade - this operation implied many server restarts and failing over from one Amazon availability zone onto another and then back (basically, we had to execute a scheduled disaster recovery procedure). Our team has worked 24/7 to make this major AWS-wide infrastructure upgrade as smooth as possible to all our customers. During these procedures, the sites on the data center under maintenance became totally unavailable while sites on the other two data centers kept their front-ends running but had most of the back-end services disabled because we needed to stop the data replication between data centers. While Amazon has performed the restarts outside business hours for each region, the restarts of NA and Europe data centers fell during AU business hours and thus had some impact on all sites by preventing customers to access some of the back-end services. We are looking into implementing some architectural changes that will limit the impact of such operations from one data center to the other.
  • Load balancer crash - this week we have encountered a load balancer crash. We have worked with the vendor to identify the root cause and we decided to upgrade the system’s firmware; this procedure is almost completed now and we are closely monitoring the load balancer for any unforeseen issues that might arise.
  • A network connectivity issue between Amazon datacenters triggered an automatic fail over of the database servers to the backup servers. This type of operation usually generates a downtime of up to several minutes. We are currently trying to identify a potential network architecture change that could help mitigate this type of occurrence.

Once again, our apologies for any inconvenience this incident might have caused. Both our team and Amazon are fully committed to provide the upmost level of security and reliability to all our customers and we continuously dedicate efforts to improve on these fronts.

Sincerely,

The Adobe Business Catalyst Team

View Comments
Cristinel Anastasoaie

Scheduled system maintenance on EU datacenter - June 16th 2014

To ensure the highest levels of performance and reliability, we've scheduled a database server upgrade on our EU AWS data center. To minimize the customer impact, the upgrade is scheduled at the most convenient hours for the region and will take up to 4 hours to complete. During the maintenance procedure, creating and updating content, Partner registration, trial site creation, publish from Muse, sFTP, APIs and some site admin sections will not be available. Additionally, all sites on the EU data center will experience a 10 minutes downtime sometimes during the maintenance window. Except for the scheduled 10 minutes downtime, the website front-ends will not be impacted by the maintenance.

Maintenance schedule:

  • Start date and time: Monday, June 16th, 3:00 AM UTC (check data center times)
  • Duration: We are targeting a 4 hours maintenance window

Customer impact:

  • Partner registration, Trial site creation Muse Publish, APIs, FTP and some admin section will not be available through the entire maintenance window
  • All websites and services on EU data center will experience a 10 minutes downtime sometimes within the maintenance window
  • Creating or updating content on the impacted sites will be unavailable during the maintenance procedure

For up to date information about system status, check the Business Catalyst System Status page. We apologize for any inconvenience caused by these service interruptions. Please make sure that your customers and team members are made aware of these important updates.

Thank you for your understanding and support,

The Adobe Business Catalyst Team

View Comments
Dragos Manescu

Major DNS Upgrade and Service Maintenance on February 8

The Business Catalyst  team has just finished a major upgrade of the DNS architecture that will increase the performance and the availability of the system.
This is an infrastructure upgrade that will improve both system scalability and resilience, while keeping the existing DNS management user experience unchanged.

The upgrade is scheduled for Wednesday, February the 8th,  from 05:00 to 11:00 AM EST (check local time) and during this time frame functionalities related to domain management,  website creation and new partner registration  will be suspended.

Please find below a schedule of the maintenance:
  • Time frame for this operation: 05:00 to 11:00 AM EST (check local time); the duration will be of about 6 hours.
  • Systems affected: 
    • DNS management for existing websites
    • Set-up of internal or external MX records or addition of new email addresses
    • Site Cancelation from the Partner Portal
    • Partner Activation:  if a customer has registered as a partner prior to the above time frame but hasn’t visited the Partner Portal to activate her account, she will have to wait until 11:00 EST (check local time) for this activation
    • New website creation
    • New partner registration
We sincerely apologize for any inconveniences generated by these service interruptions.
 
The Business Catalyst Team
View Comments
Cristinel Anastasoaie

Business Catalyst Service Maintenance - November 12

To ensure the highest reliability and performance levels for our services, we've scheduled a database server upgrade on our Asia Pacific datacenter. The upgrade is scheduled for Saturday, November 12 at 1:00 AM AEDT time (check local time) and will take one hour to complete.

During the upgrade, customers of our Asia Pacific datacenter (including the Business Catalyst website) will experience 3 windows of 5 minutes each of service interruption. 

Please find below the maintenance schedule and the list of affected services:

  • Start of maintenance: Saturday, November 12, 1:00 AM AEDT time (check local time)
  • End of maintenance: Saturday, November 12, 2:00 AM AEDT time (check local time)
  • Duration: 2 hours
  • Systems affected: Site front-ends, Admin console, Partner Portal, FTP services, API services
  • Customer impact: 3 windows of 5 minutes each of service interruptions

We sincerely apologize for any inconveniences generated by these service interruptions.

The Business Catalyst Team

View Comments
Cristinel Anastasoaie

Scheduled System Update - October 31st, November 1st and 2nd

We are planning to update our database and server infrastructure between 31 October and 2nd of November. For each datacenter, the update will take up to 6 hours and will cause two downtime sessions of up to 15 minutes each, one at the start of the update and another one at the end. During the downtime, the following Business Catalyst services will be unavailable:

  • Admin Console Access
  • Partner Portal
  • FTP
  • Dreamweaver extension
  • Muse
  • Business Catalyst APIs
  • Partner registration
  • Trial site creation

Additionally, the during each of the planned downtimes, the Business Catalyst front-end service will experience up to 1 minute of service interruption that will display a "Site under maintenance" page for site visitors.

Please find below the schedule and expected downtime hours for each of the data centers.

Monday, October 31st, Asia Pacific datacenter update:

  • Duration: 6 hours and 15 minutes
  • Start time: Mon, 21 Oct, 21:00 Sydney time (check local time)
  • Downtime (affecting all sites): up to 15 min, starting 21:00 and ending 21:15 (check local time);
  • End of maintenance: Tue, 1 Nov, 3:15 AM (check local time)
  • Downtime (affecting all sites): up to 15 min starting 3:00 AM and ending 3:15 AM (check local time)

Tuesday, November 1st, North America datacenter update

  • Duration: 6 hours and 15 minutes
  • Start time: 1:00 AM PDT (check local time)
  • Downtime (affecting all sites): up to 15 min, starting 1:00 AM and ending 10:15 AM (check local time)
  • End of maintenance: 7:15 AM PDT (check local time)
  • Downtime (affecting all sites): up to 15 min starting with 7:00 AM and ending 7:15 AM (check local time)

Wednesday, November 2nd, Europe datacenter update

If you have any questions, please contact Business Catalyst support team.

View Comments
Cristinel Anastasoaie

Update on migration schedule and duration

Last update: 3:00 AM February 19, Australian EDT

During this week, our engineering and operations teams executed a series of dry-runs to better predict the maintenance window duration as well as identify potential roadblocks.

While we haven't discovered anything that might stop us from going ahead with our migration plans, the exercises revealed that fact that we are reaching the 6 hours limit while executing the plan and therefore have no buffer time left.

In order to mitigate this risk we have decided to add 2 additional hours to the maintenance window. Thus, the migration will start at 12:00 AM February 20 Sydney time and last for 8 hours. Please use a time zone converter to find out the time in your local time zone.

I have udpated the already published communication on the bcstatus.com website to include these updates. Please expect the final email communication to be sent on Saturday, February 19.

Thank you,
Cristinel Anastasoaie
Adobe Business Catalyst Product Manager

View Comments

New Schedule For Sydney Datacenter Migration

We are continuing to experience intermittent issues on Adobe Business Catalyst's legacy Sydney datacenter. All dates/times in this post are in Australian Eastern Daylight Saving Time (+11GMT) We have systems engineers working on the issue. I know this is the 3rd business day in a row and it is really getting long in the tooth for Partners and site owners alike. Paul Gubbay, a VP of Engineering at Adobe will be posting on the blog shortly to share some of his thoughts on this very serious situation.

  • Issue: Business Catalyst services hosted on the legacy Sydney Primus Datacenter are exhibiting slow response times. Webpages for these sites were being served slowly in addition to customers reporting problems accessing Admin UI or transferring data through FTP. This is the 3rd business day in a row this has occurred.
  • Time of Incident Start: 2 Feb 2011 11:18AM Australian Eastern Daylight Time
  • Time of Incident End: Ongoing - ETA is unknown
  • Technical Action: Although we installed an extra switch and an extra firewall yesterday into the environment and moved OpenSRS migration to use the secondary switch/firewall, system engineers suspect there's still  too much HTTP traffic coming through the primary firewall. I mentioned that we were going to put in a load balancer for the 2 firewalls as well but this was not required in the end because we put the second firewall on the second switch. Our plan now involves moving 2 web servers (out of 3) across to the network using the secondary firewall/switch combination to balance the load (resulting in 5-10 minutes downtime).

New Schedule For Migration

Onto the topic of datacenter migration; given all the feedback we've received in the comments below, we've now scheduled the migration to occur at 1:00AM Sunday 13 February (check local times here) to give the lowest customer impact possible for Australian businesses.

To give you some background, we originally chose 9am Saturday morning because we thought it would help those partners and site owners with externally hosted DNS make the switch in sync with the migration, however this isn't required anymore. Additionally you have all made it clear that the impact to your customers' businesses is unacceptable if we were to do the migration at the original time. With this in mind, the updated details are as follows:

  • What's Happening?: We are migrating all sites and BC application infrastructure from Sydney Primus to Sydney Ultimo in one bulk-move
  • Target Start Date/Time: 1:00AM Sunday 13 February 2011 (Australian EDT) | 6:00AM Saturday 12 February 2011 (US Pacific) | 2:00PM Saturday 12 February 2011 (London) | check local times here
  • Target End Date/Time: 6:00AM Sunday 13 February 2011 (Australian EDT) | 11:00AM Saturday 12 February 2011 (US Pacific) | 7:00PM Saturday 12 February 2011 (London) | check local times here
  • How Long Will It Take? We will have a scheduled maintenance window of 5 hours, during which all sites hosted on Sydney Primus will be unavailable. Partner Portal access and new site creation will be unavailable at this time as well.
  • What are we doing? Simply put, we are going to replicate all databases between Sydney Primus and Sydney Ultimo. We will also setup a high-speed direct datalink between the 2 locations, to ensure databases are kept in sync prior to the migration. At the scheduled time of the migration we will reconfigure DNS settings and make other related BC architectural changes to point to the new Ultimo Datacenter. We will also need to restart all web servers.
  • Customer Impact - Worldwide: During the migration you will not be able to create new BC sites on any datacenter. You will not be able to access the Partner Portal during the maintenance window. No action is required from you.
  • Customer Impact - sites hosted on legacy Sydney DC with redelegated DNS: In addition to the above, all sites hosted on legacy Sydney DC will be offline for the maintenance window of 5 hours. There will be no front-end pages being served or Admin console access. No action is required from you.
  • Customer Impact - sites hosted on legacy Sydney DC with externally hosted DNS: In addition to the 2 points above you will be required to change your DNS settings with your DNS host e.g MelbourneIT, GoDaddy etc, to point to the IP address of the new datacenter after the migration has started.

Sites with Externally Hosted DNS - Action Required

There's been some questions around what happens for sites with their DNS externally hosted. The Engineering team are looking into an improved solution right now which is to keep a proxy server in the legacy Sydney Datacenter so that all requests coming in to the old legacy IP addresses will get routed through to the new datacenter transparently. Likewise the pages being served will come from the new DC through the proxy server back out to customers. This is not a permanent solution but gives you a longer window in which to make your DNS changes and also lessens impact to your customers when you do make the change. We will likely keep the proxy server running for a minimum of at least 30 days after the migration before we fully decommission the legacy datacenter.

For partners or site owners with externally hosted DNS, we advise you to set the TTL for your records down to 1800 (30 minutes) during the next week in preparation for the migration so that when you do make an IP address change following the migration, the settings will take a shorter amount of time to propagate

Thanks for reading and check back in a bit for Paul's post.
Eddy Chan
Business Catalyst Product Manager
View Comments

Legacy Sydney Datacenter Issues Update

At the time of writing, we continue to experience issues on BC's legacy Sydney datacenter. We have systems engineers working on the issue and I am posting an official update on the situation. Please note that for the purposes of this post, all dates/times are posted as Australian Eastern Daylight Saving (+11 GMT) time.

To give you some background surrounding these issues, we originally had 2 Watchguard firewalls in place in our legacy (Sydney Primus) datacenter, one acting as the primary, the other as a backup. The primary firewall developed a hardware issue causing last Friday's outage and we failed-over to the backup firewall.

Yesterday, we suffered another major outage from 11am to 5:30pm due to the backup firewall being unable to handle the load. To rectify this, we have installed an additional firewall with a load balancer to distribute the load across 2 firewalls, and to try and stabilize the situation. We are also adding another network switch which will take approximately 2 hours. We are working with the vendor to procure another primary firewall as soon as possible, giving us triple redundancy.

Other actions we are taking to improve stability in the Sydney Primus datacenter include:

  1. Rebooting the NAS server tonight (1AM Wed 2 February 2011) - this will result in 25 minutes of downtime during off-peak hours, however the reboot will free up system resources and improve performance of that server
  2. Throttling OpenSRS mail migration - given that we are experiencing load issues on our firewall we have taken steps to throttle our OpenSRS migration from Sydney Primus. The legacy mail server was physically located in the same location behind the same firewall as the other servers. This has unfortunately extended our mail migration period for another 72 hours.
System engineers are monitoring the situation 24/7 and you can be assured they are doing everything possible to keep the system stable.

Plan for Migrating to Sydney Ultimo

Obviously, keeping the old DC stable isn't our final fix for these on-going issues. Our medium term goal is to migrate all sites from Sydney Primus to Sydney Ultimo as soon as possible, with the least amount of customer impact. I've just finished meeting with the Engineering and Systems teams, who have put together a technical plan which I'm sharing publicly to keep you informed of the situation. Please be aware that the following is subject to change over the next 10 days.

  • What's Happening?: We are migrating all sites and BC application infrastructure from Sydney Primus to Sydney Ultimo in one bulk-move
  • Target Date/Time: 7am Saturday 12 February 2011 (AEDT). This is 2 weekends from now.
  • How Long Will It Take? We will have a scheduled maintenance window of 5 hours, during which all sites hosted on Sydney Primus will be unavailable
  • What are we doing? Simply put, we are going to replicate all databases between Sydney Primus and Sydney Ultimo. We will also setup a high-speed direct datalink between the 2 locations, to ensure databases are kept in sync prior to the migration. At the scheduled time of the migration we will reconfigure DNS settings and make other related BC architectural changes to point to the new Ultimo Datacenter. We will also need to restart all web servers.
  • Customer Impact - Worldwide: During the migration you will not be able to create new BC sites on any datacenter. You will not be able to access the Partner Portal during the maintenance window. No action is required from you.
  • Customer Impact - sites hosted on legacy Sydney DC with redelegated DNS: In addition to the above, all sites hosted on legacy Sydney DC will be offline for the maintenance window of 5 hours. There will be no front-end pages being served or Admin console access. No action is required from you.
  • Customer Impact - sites hosted on legacy Sydney DC with externally hosted DNS: In addition to the 2 points above you will be required to change your DNS settings with your DNS host e.g MelbourneIT, GoDaddy etc, to point to the IP address of the new datacenter. More details and instructions on this in the near future.

Over the coming days I will be posting regular communications around this datacenter migration, including detailed instructions if action is required from you or your customers, and more technical details around the plan as well. We've learnt some important lessons from the mail migration communication process, thank you for the feedback you've provided.

Finally, I want to thank all our partners for sticking with us through these trying times. I read the forums and the comments on this blog and I understand that many of you have built businesses on BC and that you're feeling pain. We know that this is disruptive to you and we are throwing everything we can at the problem to fix it. I will be posting daily updates to the blog on the situation and try to answer as many questions as possible via this channel.

Thanks for reading,
Eddy Chan
Business Catalyst Product Manager
View Comments