Captivating technology: race, carceral technoscience, and liberatory

xuất hiện Source

Platforms Infrastructure Systems Physical Infrastructure Video & AR/VR Artificial Intelligence


We always want Facebook's products & services khổng lồ work well, for anyone who uses them, no matter where they are in the world. This motivates us lớn be proactive sầu in detecting & addressing problems in our production infrastructure, so we can avoid failures that could slow down or interrupt service lớn the millions of people using Facebook at any given time.

Bạn đang xem: Captivating technology: race, carceral technoscience, and liberatory

In 2011, we introduced the Facebook Aukhổng lồ Remediation (intouch247.netAR) service, a mix of daemons that exedễ thương code automatically in response to lớn detected software và hardware failures on individual servers. Every day, without human intervention, intouch247.netAR takes these servers out of production and sends requests lớn our data center teams to perform physical hardware repairs, making isolated failures a nonissue.

As our infrastructure continues lớn grow, we also have sầu khổng lồ be proactive sầu in detecting & addressing problems at the rachồng màn chơi or for other failure domains such as network switches or backup power units. Given that multiple services can be collocated on a single raông xã, performing this type of maintenance on a daily basis would interrupt dozens of teams, some multiple times, throughout the year.

To help minimize disruption, we built an enhancement on top of intouch247.netAR called Aggregate Maintenance Handlers that provides a way lớn safely automate maintenance on multiple servers at once. For cases where automation isn’t enough, we also developed Dapper, a tool that enables manual intervention lớn ensure that scheduled maintenance can proceed safely. The rest of this post will explore how the Aggregate Maintenance Handlers work for various outage scenargame ios, including what happens when automation fails, and how Dapper is used to coordinate automated và manual processes.

Automating with Aggregate Maintenance Handlers

While intouch247.netAR includes methods to lớn disable & reenable a single host at a time, executing these methods serially or in parallel was not a safe enough approach for the purpose of working on multiple hosts at once. The serial approach could be time-consuming or risk a service running out of capathành phố one server at a time. The parallel approach was prone to race conditions và could run a service out of capathành phố even faster.

Aggregate Maintenance Handlers offer a framework lớn automatically disable & enable servers in bulk, providing our engineers with full context on the maintenance work being performed và the full scope of servers affected.

Making decisions based on maintenance impact

Outages vary in size, length, và type: Some can affect a single rack, some can affect several; they can be long or short; some can affect only network connectivity while others can interrupt power supplies. Different services giảm giá with different outages in different ways. When we schedule maintenance work, we give sầu the Aggregate Maintenance Handler four pieces of information khổng lồ determine the impact it will have sầu on our overall infrastructure:

Scope (a full list of servers affected by the maintenance)Maintenance type (network interruption, power interruption)Maintenance start time (e.g., 10:00 a.m. Pacific Standard Time)Maintenance duration (e.g., two hours)

Our engineers can then use this impact mô tả tìm kiếm lớn make decisions about automation & optimize how the outage should be handled. Let's look at three simplified examples:

A stateless web VPS could handle a network or power interruption of any length by being removed from a load balancer pool. The only concern in this case would be to lớn ensure that there are enough web servers still available khổng lồ handle all requests.A cache machine serving a static index from memory could handle a lengthy network interruption by being taken out of a load balancer pool. Once the network is restored, the machine could immediately resume serving the index. A short power interruption, on the other hvà, would require reloading the index inlớn memory. Dealing with a reboot would require proactively replacing the VPS with one not affected by the same maintenance.A MySQL replica with a busy replication stream could handle a short power interruption. The host would be removed from a load balancer pool, data would be stored on disk, và the MySquốc lộ hệ thống would quickly catch up on replication after rebooting. Conversely, interrupting network connectivity for hours could cause it khổng lồ fall too far behind, making a proactive sầu replacement of the replica hệ thống a better option.

Taking into account the length & type of interruptions allows us khổng lồ build a simple decision-making matrix for each service:


Handler disable/enable process

Once the appropriate maintenance has been selected và scheduled, the handler follows a four-step flow lớn disable the affected hosts:

Preflight checkPre-disableHost-level disablePost-disable

Preflight check: The preflight kiểm tra is called at the start of the disable process & checks whether there would be enough capathành phố available in the unaffected servers for the maintenance to be performed safely. It returns a true or false response that either allows the maintenance work to lớn move forward or halts it, respectively. The preflight check can also be called independently as part of a scheduling process, giving teams more time to lớn handle scenartiện ích ios where the preflight kiểm tra might return false.

Let's imagine the following six-rack row in a data center with the given constraints:


Now let's imagine two maintenance scenarios:


Preflight checks for the web servers would pass in both scenartiện ích ios, but in scenario B, preflight checks would fail for both the cađậy và database servers, & the maintenance would not be allowed lớn proceed automatically. (This scenario is addressed in more detail in the next section.)

When all preflight checks pass, our Aggregate Maintenance Handlers allow us to lớn wrap a smarter layer of code around pre-existing host-level disable/enable súc tích.

Pre-disable: This step is generally used to lớn ensure that hosts currently considered spares in our pools are not accidentally reintroduced inkhổng lồ production when multiple hosts are swapped out during host-level disable or bulk operations.

Xem thêm: Cách Tắt Tính Năng Bình Luận Trên Facebook Cá Nhân Chỉ Với Các Thao Tác Đơn Giản

Host-cấp độ disable: In some cases, this is a no-op because hosts were bulk-disabled in the pre-disable step. In all other cases it becomes a parallel execution of host-màn chơi disable logic inherited from intouch247.netAR.

Post-disable: This step is used primarily lớn verify that pre-disable & host-màn chơi disables succeeded. It also allows the author khổng lồ inspect the results of the host-cấp độ disable step & decide whether lớn ignore certain types of failures if they remain below a desired threshold.

This flow is represented in the following animation:


The enabling process is identical to lớn the disable process: pre-enable, host-level enable, & post-enable. With automation, we can safely perkhung regular maintenance at the raông chồng or multi-rachồng level while minimizing disruption lớn other teams và the services that people on Facebook use.

Coordinating with humans: When automation isn't possible (or fails)

Although our goal is lớn be able lớn automate all of the maintenance work that needs to happen in our infrastructure, there are times where manual intervention is required lớn ensure that the maintenance can happen safely.

Failed preflight checks or no automation

In some cases, it's possible that the scheduled work affects large enough sets of servers that the preflight checks will refuse khổng lồ allow the maintenance lớn proceed automatically. Our automation is intentionally conservative sầu & prefers manual intervention over possibly risky larger-scale operations. In other cases, automation has not yet been implemented or has been temporarily disabled, either for reliability reasons or because a service is in a degraded state và we prefer to lớn prsự kiện automated changes from happening.

Failed automation

Even though we have sầu a high success rate when we invoke our Aggregate Maintenance Handlers, there are still occasions where things go wrong. When a failure happens, our maintenance process notifies the service's owner that the automation has failed. Once they've sầu manually confirmed that the hosts have been properly disabled, the maintenance is allowed to lớn continue.

Mixing automation & manual work

To help coordinate automated & manual processes, we've sầu developed Dapper, a tool that can be used by a variety of teams (e.g., data center teams, technical program managers, infrastructure engineers, production engineers) khổng lồ schedule maintenance work by providing the impact mô tả tìm kiếm mentioned above (hosts affected, maintenance type, start time, and duration).

The workflow for maintenance executed by Dapper is as follows:


Lessons learned

We learned a few lessons early on as we were scaling from automated single-host repairs up khổng lồ rack-cấp độ và multi-rachồng maintenance work.

Serial use of disable logic

Disabling hosts one at a time had two possible negative side effects. The first was running out of capađô thị at some point during the maintenance, resulting in the maintenance work being blocked until a human intervened:


Worse, when the swap xúc tích for a service preferred lớn reuse hosts in the same raông xã, we could either accidentally reintroduce hosts baông xã inkhổng lồ production, or at best, run inlớn an infinite loop:


Parallel use of disable logic

Swapping hosts in parallel rather than one at a time could possibly prevent some of the issues seen in the serialized approach, but introduced other problems. The most common problem was that invoking single-host lô ghích in parallel would cause a race condition where individual operations would find a replacement host, but the aggregate result would cause a service to lớn run out of capacity:


Expanding automation

The framework provided by Dapper & Aggregate Maintenance Handlers has grown beyond just physical maintenance work, expanding to include disabling & enabling hosts as part of software releases, or kernel, BIOS, and OS upgrades.

The production engineers working on Dapper are passionate about further expanding the reach of automation và building tools that allow Facebook's teams khổng lồ lower the burden of operations work, freeing them up lớn tackle bigger, more challenging problems.