文档结构  
翻译进度:31%     翻译赏金:20 元 (?)    ¥ 我要打赏

最近出现了 SharedArrayBuffer,可见并发正在寻找一种方式进入 JavaScript 语言。这一新功能让 JavaScript 程序并发访问 SharedArrayBuffer 对象。WebKit 支持 SharedArrayBuffer并且在我们的编译流程中得到了完全优化。但不幸的是,JavaScript 不允许共享除 SharedArrayBuffer 之外的任何对象。

本文思考了一个疯狂的尝试:把并发性扩展到整个 JavaScript 堆会怎么样?这样的世界里,任何对象可以分享给另一个线程。这不是一个小改动。现有的 JavaScript VM(虚拟机) 优化利用了单线程执行这一事实,所以并发肯定会产生一些性能问题。本文关注的是其技术可行性,以及在可行的情况下需要付出的代价。

第 1 段(可获 1.61 积分)

我们提供了一个基本的稻草人 API 来说明是什么我们所说的并发。多数文章关注的是 WebKit 的 JavaScript VM (称为 JavaScriptCore,或者简称 JSC) 如何实现稻草人。实现这个稻草人需要做的工作不少。我们认为应该提出实现以下目标:

  • 未使用并发的代码不进行性能回归。
  • 当程序并行运行而不考虑共享任何对象的时候,其规模应该线性扩展。我们不指望两个线程的效率是一个线程的两倍,因为我们知道并发会有一些开销 —— 但是如果你有两个 CPU,那么两个线程会比一个线程快,而且有希望达到接近两倍的速度。
  • 线程伸缩性 —— 包括速度与最佳串行基线 —— 在共享对象的并行代码语料库中。
  • 合理的语义,包括优于现在硬件模型的内存模型。
  • 兼容性。比如,并发 JavaScript 程序应该说明白如何在不需要重写 DOM 实现的情况下使用 DOM。
第 2 段(可获 2.33 积分)

我们提出的实现方案依赖于 64 位系统,不过这主要是因为我们的引用已经是 64位的核心。

现在还无法通过经验来判断这个方案的性能,但从草案可以对表现有一个直观的感觉。具体来说,我们的方法确保多数属性访问最多只有一个算术指令的开销(这几乎没有开销),有几个特别棘手(但罕见)的访问达到了约 7 倍开销。本文结尾部分会将我们的方案和其它实现并发的技术进行比较。

第 3 段(可获 1.23 积分)

稻草人的并发JS

我们的稻草人对并发 JS 的提议就是简单地添加线程。线程间的栈是独立开的,但是其它东西都可以共享。对于我们来说,线程非常好,因为它是通用的。我们可以想象出基于线程的多种并发编程模型。因此,如果能让我们的 VM 支持线程,我们就可能支持许多其它并发和并行的编程模型。本文将技术可行性作为门控因子,以此为 JavaScript 添加任意并发编程模型。

 

第 4 段(可获 1.16 积分)

本节使稻草人变得具体,主要是在难和易两个方面的实现提供上下文。知道 API 直观的样子对于了解其创建的约束非常有用。

每个程序都是从一个线程开始。一个线程可以启动另一个线程。所有线程活跃在同一个堆中,可以彼此共享对象。本节展示的 API 描述了内存模型以及展示了并发如何作用于 DOM。

可能的 API

我们希望:

  • 有简单的 API 来实现线程;
  • 改变 Atomics 对象以支持构造锁对象;
  • 锁和条件变量 API 可以建议顶层 Atomics;
  • 创建线程内变量和方法,以及
  • 一些辅助方法实现增量采用。
第 5 段(可获 1.65 积分)

本文通过稻草人让你明白 JavaScriptCore 如何实现线程。

我们希望创建线程很容易:

new Thread(function() { console.log("Hello, threads!"); });

这会启动一个新线程,它会在执行过程中打印 “Hello, threads!”。注意,即便这是个一简单的例了,它也会与线程分享很多东西。比如函数对象可以访问创建它的线程中的词法作用域,所以当线程访问 console 变量的时候,它访问的是一个共享对象。

线程可以通过加入(join)来等待其执行完成并获得结果:

第 6 段(可获 1.25 积分)
let result = new Thread(() => 42).join(); // returns 42

在 Web 浏览器中,不能阻塞主线程,所以,如果上面的代码在主线程中,join() 会抛出异常。我们可以使用阻塞操作的异步版本:

new Thread(() => 42).asyncJoin().then((result) => /* result is 42 */)

随时可以获得当前线程的 Thread 对象:

let myThread = Thread.current;

线程可能需要彼此等待以避免竞争。这可以由锁来完成。与其简单的介绍锁 API,不如适当的延伸到原子(Atomics) API,让用户按自己喜好创建锁。我们在提议的 API 实现中提供了一个不错的锁实现,但我们会鼓励创建其它类型的同步原语。现有的 SharedArrayBuffer 规范允许开发者使用 Atomics API 自定义锁。这个 API 像这样使用:

第 7 段(可获 1.55 积分)
Atomics.wait(array, index, expectedValue);

和:

Atomics.wake(array, index, numThreadsToWake);

当然,数组必须是由一个SharedArrayBuffer支持的整数类型数组。 我们建议扩展所有使用数组/索引的Atomics方法来取代一个对象和一个属性名。因为索引是属性名, 这并不会改变已经为SharedArrayBuffer使用此API的代码行为。 这也意味着当与正常JavaScript属性一起使用时,当前使用整数值的Atomicsmethods(用于存储或比较类型化数组中的元素)现在可以使用任何JavaScript值。 Atomics.wake,Atomics.wait和Atomics.compareExchange足以使用一个JavaScript属性来实现任何一个锁。

第 8 段(可获 1.24 积分)

此外,我们建议加一个 Lock API:

let lock = new Lock();
lock.hold(function() { /* ...perform work with lock held... */ });

在主线程上锁定的可能是 Promise:

lock.asyncHold().then(function() { /* ...perform work with lock held... */ });

这段代码可以执行,因为每个线程有自己的运行循环。我们还可以添加 Condition API:

let cond = new Condition();
cond.wait(lock); // Wait for a notification while the lock is temporarily released.
// ...
cond.asyncWait(lock).then(function() { /* ...perform work with lock reacquired... */ });
// ...
cond.notify(); // Notify one thread or promise.
// ...
cond.notifyAll(); // Notify all threads and promises.
第 9 段(可获 0.4 积分)

Condition.prototype.wait 会在等待之前释放你传递的锁,并在返回之前重新获取它。这个异步变量将promise与Condition对象相关联,如果Condition收到了通知,则promise将在当前线程上实现。

使用  Thread.current 和  WeakMap,任何人都可以实现线程局部变量。尽管如此,在底层的运行时有时候可以做更巧妙的事情。我们不希望要求所有的JavaScript程序员都知道如何实现线程局部变量  WeakMap。因此,我们提出一个简单的API:

第 10 段(可获 1.06 积分)
let threadLocal = new ThreadLocal();
function foo()
{
    return threadLocal.value;
}
new Thread(function() {
    threadLocal.value = 43;
    print("Thread sees " + foo()); // Will always print 43.
});
threadLocal.value = 42;
print("Main thread sees " + foo()); // Will always print 42.

最终,我们希望用户很容易判断对象是在某个线程上:

var o = {f: "hello"}; // Could be any object.
Thread.restrict(o);
new Thread(function() {
    console.log(o.f); // Throws ConcurrentAccessError
});

会让 Thread.restricted 抛出 ConcurrencyAccessError 的任何对象返回由一个线程执行的可代理操作,当然这个线程不能是调用 Thread.restrict 的那个。

第 11 段(可获 0.5 积分)

Memory Model

处理器和编译器会对内容访问进行重排。处理器和编译器都喜欢提升内存加载。这是因为下面的代码:

let x = o.f
o.g = 42
let y = o.f

可能会被编译器或处理器转换成:

let x = o.f
o.g = 42
let y = x

实际上,这意味着 y 的“加载”移到了 o.g 的上面。处理器会动态进行同样的优化,它通过缓存 o.f 的加载,并在第二次加载的时候重用。

处理器和编译器喜欢下沉内存存储。这是因为下面的代码:

第 12 段(可获 1.05 积分)
o.f = 42
let tmp = o.g
o.f = 43

这可能会被编译器或者处理器转换如下:

let tmp = o.g
o.f = 43

这就“犹如” o.f = 42 语句被移到了 o.f = 43前面。即使编译器不进行转换操作,处理器也可能会缓冲存储它们,并在方便的时候执行它们。这也意味着 o.f = 42 可能会在let tmp = o.g之后运行。

最好的解决方案就是遵循现有的SharedArrayBuffer 内存模型。编写使用SharedArrayBuffer的具有有界不确定的行为的多线程代码已经是可能实现的了,而且我们也不会完全保护这些代码。但是JavaScript的对象比缓存更加的复杂,因此保证了JS的对象模型里的基本不变量是不容忽视的。我们建议修改JavaScript对象存储的操作应该原子性的执行。原子性意思是说,如果多个线程并发执行了这些操作,那么每个线程应该就像它们是以某种顺序非并发的执行一样。JS可以对对象所作的每一个操作都应当是具有原子性的:

第 13 段(可获 2.1 积分)
  • 添加一个属性。
  • 删除一个属性。
  • 获取一个属性值。
  • 设置一个属性值。
  • 修改一个属性的配置。
  • 获取属性名称集合的快照。

 这就表示 o.f 表达式并不总是原子操作,因为这个表达式可以做的事情不仅限于获取属性值。存在这样一些特殊情况:

This doesn’t always mean that the expression o.f is atomic, since this expression may do much more than loading a property’s value. In particular:

  • 如果 o.f 是 o 的直接属性,那么它是原子的。
  • 如果 o.f 是一个原型访问,那么获取这个属性与从原型获取 f 是分别独立的步骤。
  • 如果 o.f 是一个 getter,那么第一步是获取这个 getter(如果 getter 是 o 的直接属性,这一步是原子的),但是调用 getter 并不是原子的,因为 getter 可能会执行各种各样的代码。
第 14 段(可获 1.43 积分)

我们认为低级别的对象操作是原子级别的。一些操作,如获取和设置属性的值,可能使用允许重新排序的硬件原语来实现。我们认为只要允许对彼此周围的获取/设置访问进行重新排序,就可以获取/设置访问共享数组缓冲器的相同内存模型。尽管我们的草案确实允许种类和记忆模型的奇异性,但它不允许JavaScript对象模型不变的量失效。对于任何由并发JS程序创建的堆,都应该能写入和创建不可区分堆集顺序的JS程序。

第 15 段(可获 1.26 积分)

最后,我们认为并发JS堆的内存管理就像在其他垃圾收集的多线程语言中一样。简单地说,垃圾收集必须出现在原子上,并且只有在定义良好的安全点上,例如循回边界,分配网站以及在调用未触及堆的本机代码时。这一需求通过流行的垃圾收集算法来满足,比如HotSpot或MMTk。它也满足了像标记扫描和半空间这样的经典算法,甚至是在没有全局停步的情况下,包括那些非锁定的程序,它们甚至都无法阻止线程扫描它们的堆。由于我们的JIT线程可以访问堆,从而WebKit的GC激流已经主要支持多线程。

第 16 段(可获 1.54 积分)

与DOM的交互性

将并发性扩展到所有的JavaScript会变得很难; 将其扩展到所有的DOM会更加困难。我们对DOM进行了足够的推理来让这项草案发挥作用。我们认为在默认情况下,DOM对象会引发文件并发访问错误,来响应主线程以外的任何可代理操作,就像从主线程上调用线程限制一样。但是,少数对象必须允许并发访问才能使语言行为健全。一些明显的就像:

第 17 段(可获 1.15 积分)
new Thread(function() { console.log("Hello, threads!"); });

需要并发访问的DOM对象。在WebKit中,DOM全局对象负责存储窗口的变量状态。全局对象(包括控制台,对象等)的普通JS属性必须可以被并发线程访问,因为这些属性是脚本的全局变量。从其他线程访问全局对象的奇异属性或原型属性可能会丢失。尝试向全局对象添加新属性或从非主线程中删除属性也可能会抛出它们。这些限制意味着在WebKit中,处理DOM全局对象有限类型的并发属性访问并不比普通JS对象相同的访问困难。 

第 18 段(可获 1.51 积分)

此外,像console这样的命名空间对象,因为能使用JS对象模型,所以能够被并发线程访问。这没有理由限制他们去使用主线程,而且更重要的是, console 用来做调试是很实用的。

Strawman综述

本节主要提出了Javascript中多线程编程的strawman API,这个API用来实现自定义同步操作是完全够用的。同时这个API强大到能够允许多线程程序之间竞争的执行,但它不允许线程的竞争来破坏语言本身的结构性。多线程也一般能够满足基于这些功能来实现出许多其他的编程模型。

第 19 段(可获 1.36 积分)

在 WebKit 中实现并发 JS

这一节展示了我们可以实现基于线程的存根,而不必禁用JavaScriptCore的任何基本性能优化。我们的方案的目标是实现接近零的开销,即使在从多个线程读取和写入相同对象的程序中也是如此。

JavaScript、Java和.Net等已经支持线程的语言有很多共同之处。以下是这些语言与JavaScript的一些共同之处:

  • 和JavaScript一样,这些语言使用基于跟踪的垃圾收集器。我们的GC大多支持多个线程,因为已经允许读JIT线程堆。最大的缺失部分是线程的本地分配 ,它支持并发分配。对于我们的GC,这仅仅意味着每个分配器的每个线程单独的空闲列表。我们的GC有固定数量的分配器,我们已经有了快速线程的本地存储,所以这将是一个机械的改变。
  • 与JavaScript一样,这些语言也使用了多个层次的JITs,可能还包括一个解释器。我们的WebAssembly VM已经支持从BBQ(快速构建字节码)到OMG(优化的机器代码生成)的多线程tier。我们不担心用这个来做JavaScript。
  • 与JavaScript的实现一样,这些语言的实现使用内联缓存来加速动态操作。我们将不得不对内联缓存进行一些更改以支持并发性,我们将在本文后面的部分中描述这些更改。这并不是增加并发性的最困难的部分,因为它不是新的领域——之前就已经完成了
第 20 段(可获 3.19 积分)

这些相似性表明,许多用于 Java 虚拟机的并发技术可以在 JavaScript 并发中重用。

困难的是 JavaScript 动态重新配置对象的特性。 Java 和 .NET 的对象具有固定大小(一旦分配了,对象大小不会改变),但 JavaScript 对象往往是大小可变的。Java 和 .NET 这些静态类型语言的并发依赖于这样一个事实,即对于固定大小对象的并发访问默认是原子的机器指针宽度(因此,64位系统默认情况下是访问 64 位的属性)。 Java 和 .NET 指针中的值是保存对象数据的内存地址,只需要对地址进行一些地址运算(如添加偏移量)和执行某个内存访问指令就可以读/写任意字段。即使内存模型做了复杂的重排序,对同一个字段(或同一对象的不同字段)的访问争用也永远不会破坏整个对象或导致崩溃。另一方面,JavaScript 对象大小可变意味着在某些情况下对象访问需要多次访问内存。涉及多次访问内存的一系列操作一般不是原子的。最坏情况时虚拟机可能会崩溃,因为内部对象状态被破坏。即使这种情况没有发生,争用可能会导致写入丢失或者写入偏移(同一字段先写入 A 然后写入 B 可能会导致该字段首先被读到 A,然后读到 B,之后又读到 A )。

第 21 段(可获 3.31 积分)

In our strawman proposal, concurrent JavaScript implies that fundamental operations such as adding a new property to an object, changing the value of a property, reading a property, and removing a property all proceed atomically. No race should ever lead to a VM crash, lost writes, or the value of a property experiencing time travel.

We propose an algorithm that allows most JavaScript object accesses to be wait-free and require minimal overhead compared to our existing serial JS implementation. Wait-free operations execute without ever blocking and complete in a bounded number of steps regardless of contention. This algorithm borrows ideas from real-time garbage collection, locking algorithms, and type inference. We plan to use a tiered defense against the overheads of concurrency:

第 22 段(可获 1.51 积分)
  1. We propose using our existing polymorphic-inline-cache-based type inference system to infer relaxed notions of thread-locality of objects (and their types) that we call transition-thread-locality (TTL). TTL objects will use the same object model as they do today and accesses to those objects will enjoy close to zero overhead. TTL does not imply true thread-locality; for example even objects that are read and written by many threads may be inferred TTL.
  2. We propose using a segmented object model for objects that fail TTL inference. This will introduce an extra load instruction, and some arithmetic, to the property access fast path. We fear that by itself, this technique would not be fast enough to meet our performance goals — but since the segmented object model will only be used surgically for those objects (or types of objects) that fail TTL inference, we suspect that the net cost will be small enough to be practical.
  3. Operations on non-TTL objects that don’t benefit from the segmented object model will use locks. During our development of the Riptide concurrent GC, we added an internal lock to each object. This lock occupies just 2 bits and allows the lock/unlock fast path to require just an atomic compare-and-swap (CAS) and a branch.
第 23 段(可获 2.58 积分)

Before diving into the details, we first set up some expectations for the hardware that this proposed system would run on. Then we describe this proposal in reverse. We consider the cost of just using per-object locking. Then we review the existing JSC object model and describe what kinds of operations happen to already be atomic within this model. Next we introduce segmented butterflies, which are the key to allow dynamically reconfiguring objects. Then we show how our TTL inference will allow us to use our existing object model for hopefully most of the JavaScript heap. This section then considers some loose ends, like how to do inline caching concurrently, how resizable arrays fit into TTL and segmented butterflies, how to handle large thread counts using the local optimizer lock (LOL), and finally how to handle native state that isn’t thread-safe.

第 24 段(可获 1.78 积分)

Hardware Expectations

This post describes how to convert JavaScriptCore to support concurrent JavaScript. JSC is currently optimized for 64-bit systems, and most of our existing concurrency support (concurrent JIT, concurrent GC) only works on 64-bit systems. This section briefly summarizes what we expect from a 64-bit system to be able to use our scheme.

JSC is optimized for x86-64 and ARM64. This scheme assumes the weaker of the two memory models (ARM64) and assumes that the following atomic instructions are inexpensive enough that it’s practical to use them as an optimization over locking:

第 25 段(可获 1.18 积分)
  • 64-bit loads and stores are atomic by default. We expect that these accesses may be reordered around other accesses. But we also expect that if memory access instruction B has a dataflow dependency on memory access instruction A, then A will always come before B. For example, in a dependent load chain like a->f->ga->f will always execute before _->g.
  • 64-bit CAS (compare-and-swap). JSC uses 64-bit words for JavaScript properties and two important meta-data fields in the object header: the type header and the butterfly pointer. We want to be able to atomically CAS any JS property and either of the meta-data properties in the object header.
  • 128-bit DCAS (double-word compare-and-swap). Sometimes, we will want to CAS all of the meta-data inside JS objects, which means CASing both the 64-bit type header and the adjacent 64-bit butterfly pointer.
第 26 段(可获 1.69 积分)

Per-Object Locking

Each object in JSC already has a lock, which we use to synchronize certain fundamental JavaScript operations with the garbage collector. We can use this lock to protect any operations that need to be synchronized between JavaScript threads. The rest of this post deals with optimizations that allow us to avoid this lock. But let’s consider exactly what the cost of the lock is. Our internal object lock algorithm requires an atomic compare-and-swap (CAS) and a branch for locking, and another CAS and branch for unlocking. In the best case, CAS is like executing at least 10 cycles. This is the amortized cost assuming the CAS succeeds. On some hardware, the cost is much higher. Assuming a branch is one cycle, this means at least 22 extra cycles for each object operation that requires the lock. Some rare operations are already expensive enough that 22 cycles is not bad, but many fast path operations would not be able to handle such overhead. Below we consider some operations and how much they would be affected if they had to use locking.

第 27 段(可获 2.29 积分)
  • 添加一个新属性目前只需要一个负载、一个分支和两个优化快捷路径上的存储。每一个操作都是一个单一的循环,在最快的情况下大约会产生4个周期。将锁止点添加到26个循环周期中- 一个 7x的减速。在某些情况下,添加一个新属性仅可以优化一个存储。在这些情况下,添加锁定会创建一个23x回归。
  • 改变现有属性的值需要加载、分支和存储。这是最快的三个周期。增加两个例子和两个分支会使这一数字增加到25个周期——一个8x的慢速。在某些情况下,我们可以将一个属性存储优化为一个存储指令。把它锁定为23x的回归。
  • 加载现有属性的值需要加载、分支和另一负载。添加锁定是一个8x的慢速,或者是23x,在这个例子中,我们将负载优化为一个单负载指令。
  • 删除属性已经很缓慢了。我们可以把它锁起来。删除属性相对来说比较少见,但是我们必须予以支持。
  • 字典查找——JSC-speak,我们在没有任何内联缓存或编译器的情况下动态执行的属性访问——将会从锁定中看到很小的开销。这些代码路径已经足够昂贵,增加对它们的锁定不太可能成为问题。
  • 用快照来处理这些属性是不会变的。因为我们把它锁定为并发的垃圾收集,并发的线程已经有可能在JSC中对任何对象的属性进行快照。
第 28 段(可获 3.38 积分)

While some operations like deletion are already expensive enough that locking is not a problem, we don’t believe it’s practical to add such extreme costs to the fast cases of JavaScript object access. The resulting programming language would be too slow.

Designing a fast concurrent JS implementation requires introducing new algorithms for property accesses that can run concurrently to each other without requiring any locking except in rare cases. In the sections that follow, we describe such an algorithm. First we review JavaScriptCore’s object model. Then we show cases where JavaScriptCore’s object model already behaves as if it had fixed-size objects; those cases never require locking. Next we show a technique called segmented butterflies, which allows for mostly wait-free concurrent object access. By itself, this technique is still too expensive for our tastes. So, we show how to infer which object types are transition-thread-local (TTL) to avoid synchronization when transitioning objects. This allows us to use flat butterflies (our existing object model) for the majority of objects. Then we describe how to handle deletion, dictionary lookups, inline caches, arrays, and thread-unsafe objects.

第 29 段(可获 2.33 积分)

JavaScriptCore Object Model

The JavaScriptCore Object Model. JavaScript values and object pointers are implemented as pointers to the cell. The cell never moves and never changes size, but contains a pointer to the butterfly, which can grow either left (for named properties) or right (for array elements).

JavaScriptCore’s object model allows for four kinds of state, each of which is optional:

  • Native state represented using ordinary C++ objects.
  • Data for named object properties (like o.f) that we statically guessed the object would have.
  • Data for named object properties that were added dynamically, forcing us to change the size of the object.
  • Data for indexed object properties. These can change the size of the object in many ways.
第 30 段(可获 1.46 积分)

The first two kinds of state does not involve resizing the object. The last two kinds of state require resizing. In JSC, fixed-sized state is stored directly in the object’s cell. The cell is what a pointer to an object points to. Within the cell, there is a butterfly pointer that can be used to store dynamically-allocated and resizable state in a butterfly. Butterflies store named properties to the left of where the butterfly pointer points inside out-of-line slots, and indexed properties to the right as array elements. Each of those locations may store a tagged JavaScript value, which may be a number, pointer to another cell (representing a string, symbol, or object), or a special value (truefalsenull, or undefined).

第 31 段(可获 1.49 积分)

Every object has a structure, which contains the hashtable used for mapping property names to slots in the object. Objects typically share a structure with other objects that look like it (that have the same properties in the same order, among other requirements). The structure is referenced using a 32-bit index into a structure table. We do this to save space. The indexing and cell state bytes have some spare bits. We used two spare bits in the indexingbyte to support per-object locking. We also have spare bits in the butterfly pointer, since pointers don’t need all 64 bits on a 64-bit system.

第 32 段(可获 1.3 积分)

Butterflies are optional. Many objects don’t have a butterfly. When a butterfly is allocated, each of the two sides are optional. The left side of the butterfly holds the out-of-line slots. It’s possible just to allocate that; in this case the butterfly pointer will point 8 bytes to the right of the end of the butterfly’s memory. The right side of the butterfly holds the array elements and the array header. The public length is the length as it is reported by array.length, while the vector length is the number of array element slots that were allocated.

第 33 段(可获 1.24 积分)

Freebies

Within this object model, accesses to anything in the cell itself are atomic by default, just like accesses to objects in Java are atomic by default. This is a big deal, since our optimizer is generally successful at putting the most important object fields inline in the cell. Concurrent JS programs that mostly rely on data that ends up inside the cell will experience almost no overhead versus their serial equivalents.

Additionally, if we know that the butterfly won’t be reallocated ever again (i.e. the butterfly pointer is immutable), then we can access it directly without any problems. Those accesses will be atomic by default.

第 34 段(可获 1.35 积分)

Alternatively, if we know that only the current thread will ever resize the butterfly and all other threads will only read it, then accesses to the butterfly will be atomic by default.

Problems happen when one thread attempts to transition the object (i.e. add a property and/or reconfigure the butterfly) while some other thread is writing to it (or also transitioning it). If we used our current implementation of object accesses in a concurrent setting, a transition on one thread could cause races that lead to behavior that does not conform to our proposed specification.

第 35 段(可获 1.21 积分)
  • Writes made by another thread could disappear or experience time travel. This race happens because the transitioning thread might first complete a copy of the butterfly and then transition the object, while another thread does some more writes to the object in between those two steps. If a third thread is observing what is going on by repeatedly reading the state of the fields being written, then it will first see the state before any writes, then it will see the writes, and then once the transition completes it will see the original state again.
  • Two transitions performed concurrently could cause the butterfly pointer and the object’s type to be mismatched. The butterfly’s format is determined by fields in the object’s header that are not inside the same 64-bit word as the butterfly pointer. We cannot allow the butterfly and the header to get out of sync, as this would lead to memory corruption.
  • Two transitions performed concurrently can cause some minor heap corruption even if the butterfly is not involved. If a transitions involving an inline property proceeds as two steps — first, change the object’s type, and then, store the new property — then two racing transitions adding two different properties may result in one property being added, where its value ends up being the intended value of the other property. For example, a race between o.f = 1 and o.g = 2 could result in o.f == 2 and "g" in o == false.
第 36 段(可获 2.94 积分)

In the next section we show how to create an object model that has no such races, but comes with some cost. After that, we show how to use TTL inference to use our existing object model most of the time even in programs that share objects between threads.

Segmented Butterflies

Transitioning an object means allocating a new butterfly, and copying the contents of the old butterfly into the new one while other threads may be reading or writing the old butterfly. We want the copy to seem as if it had happened in one atomic step. Lots of research has gone into supporting concurrent copying of objects in the context of real-time garbage collectors. The approach that we propose to use is based on the Schism real-time garbage collector‘s arraylet object model. This section reviews the Schism arraylet object model and then shows how it can be used for butterfly transitions.

第 37 段(可获 1.91 积分)

Schism was trying to solve the problem of implementing a copying garbage collector whose copying phase was completely concurrent to the application. It was based on the observation that copying an object concurrently to other threads accessing the object is easy if the object is immutable. It’s only hard if the object can be written to by the other threads. Schism solves the problem of copying mutable objects concurrently by boxing the mutable state in small fixed-size fragments (32 bytes in the paper). The object that gets copied is the spine that serves as an index for finding fragments. All object accesses use an extra indirection to find the fragment that contains the data being accessed. Spines are immutable objects because the fragments they point to never move.

第 38 段(可获 1.61 积分)

WebKit already uses something like arraylets as well, in the WTF::SegmentedVector class template. We use it to prevent having to move C++ objects when the vector resizes. We also use it for implementing the JSGlobalObject‘s variable storage (for var statements in global scope). The term arraylets is due to Bacon, Cheng, and Rajan, who used them to control fragmentation in the Metronome garbage collector. Lots of arraylet research shows that the extra indirection has a high cost (often 10% or more). That is, if you double the cost of each array access by adding an extra indirection, then you will increase the total run time of the program by 10%.

第 39 段(可获 1.35 积分)

Segmented Butterflies. A segmented butterfly comprises a spine (which resizes but contains only immutable state) and zero or more fragments (which don’t resize but are mutable). Splitting resizable state from mutable state to enable concurrent layout changes is a proven technique in concurrent garbage collection. The shape of the fragments matches the shape of an unsegmented butterfly, which we will use for some additional optimizations.

We can use this trick for butterflies. A segmented butterfly has a spine that contains pointers to fragments that contain mutable data. If we need to add new properties and those properties won’t fit in an existing fragment, we can grow the spine and allocate more fragments. Growing the spine implies reallocating it. We can safely allocate a new spine and memcpy the contents of the old one into it, since the spine’s contents never changes. In this world, no writes ever get lost or experience time travel when a butterfly is being transitioned. An access that races with the transition either gets to a fragment using the old spine, or the new spine; since both spines will contain the same fragment addresses for properties that existed before the transition, every possible interleaving results in threads reading and writing the same fragments.

第 40 段(可获 2.61 积分)

The most natural way to implement segmented butterflies is probably to have 32-byte fragments like in Schism, since this is also a sweet-spot for our GC. The vector length is be inside the butterfly spine, since this is an immutable property of that spine. The vector length allows the spine to self-identify how big its right side is. The public length is mutable, so we want to put it into one of the fragments. Note that the first fragment also contains the old vector length. This property is unused when we are using segmented butterflies. It allows us to convert a flat butterfly into a segmented butterfly by having the spine point at slices of the flat butterfly; we discuss this optimization more in a later section.

第 41 段(可获 1.59 积分)

这仍然留下了并发转换的问题。进行转换必须重新分配主干(骨骼),在类型标题中修改一些数据,并存储一个属性的值。重新分配主干(骨骼)是可选的方案,因为通常某些片段已经拥有备用的插槽。例如在数组大小调整的情况时,更改类型也是可选的。类型标题和蝴蝶形件可使用DCAS (二重的比较和交换; 64 位系统通常支持128 位CAS) 进行原子设置。 但这还不够,因为无论我们在DCAS之前还是之后设置属性的值,都会有一项挑战。该属性的插槽可以位于内存中的任何位置,因此无法同时使用CAS 执行整个转换任务。

第 42 段(可获 1.53 积分)

If we set the newly-added value slot before changing everything else, we risk a race in which one thread tries to use some object slot for adding field f while another thread tries to use the same slot for field g. If we’re not careful, the second thread might win the race for the type (so everyone thinks we added property g) while the first thread wins the race for the property (so we get thread 1’s intended value for property f appearing as if it was stored by thread 2 into g).

If we set the newly-added value slot after changing everything else, then all loads have to contend with the possibility that slots can have “holes”. In other words, at any time, an object may claim a type that includes a property f even though the object does not have a value for f yet. We could change the language to allow this: putting a new property into an object would first involve defining it to undefined and then storing the real value.

第 43 段(可获 2.11 积分)

We choose to put a lock around transitions. This lock isn’t for guarding races between transitions and other accesses, since those are atomic by default thanks to segmented butterflies — it’s for protecting transitions from each other. To ensure that objects don’t have holes, transitions store the new property’s value before changing the type. Transitions can use this algorithm:

  1. Allocate whatever memory needs to be allocated.
  2. Acquire the lock.
  3. Determine if we allocated the right amount of memory; if not, release the lock and go back to step 1.
  4. Store the new property’s value.
  5. Change the type and butterfly.
  6. Release the lock.
第 44 段(可获 1.33 积分)

This results in the following cost model:

  • Transitions are about 7x costlier, since they now require locking.
  • Reading or writing existing property slots requires an extra load.
  • Deleting and dictionary lookups still work (we show details in a later section).

Recall that the arraylet research shows 10% or higher costs if arrays use a segmented object model. We’re proposing to use it for ordinary properties as well, if those properties were added in a sufficiently subtle way that we weren’t able to inline them into the cell. It’s probably safe to assume that if we implemented literally this, we incur at least a 10% slow-down. We would like to have close to zero overhead. The next section shows how we think we can get there.

第 45 段(可获 1.6 积分)

Transition Thread Locality and Flat Butterflies

Segmented butterflies are only necessary when:

  • Some thread other than the thread that allocated an object tries to transition the object.
  • The thread that allocated the object tries to transition it, and other threads may be writing to it.

We can use flat butterflies — our existing object model — when we know that these things haven’t happened yet. This section describes a hybrid object model that uses either flat or segmented butterflies, depending on whether or not we have detected a possible write-transition race. This object model also allows us to avoid locking while doing many transitions.

第 46 段(可获 1.29 积分)

On a 64-bit system, butterfly pointers have 48 bits of pointer information and have zeroes in their high 16 bits. 16 bits is enough to store:

  • The ID of the thread that allocated the object. We call this the butterfly TID.
  • A bit to indicate if any thread other than the allocating thread has attempted to write to the butterfly. We call this the butterfly shared-write (SW) bit.

Some systems have more than 48 pointer bits. In a later section, we show how to make this scheme useful even if we had to use fewer bits. Also, we could restrict butterfly allocations to the lower 248 addresses if we really wanted 16 spare bits.

第 47 段(可获 1.43 积分)

Let’s say that these bits are encoded as follows:

static const uint16_t mainThreadTID = 0; // Occassionally, we can optimize a bit more for the main thread.
static const uint16_t notTTLTID = 0x7fff; // For when the thread is no longer transition-thread-local.

static inline uint64_t encodeButterflyHeader(uint16_t tid, bool sharedWrite)
{
    ASSERT(tid <= notTTLTID); // Only support 2^15 tids.
    return (static_cast<uint64_t>(tid) << 48)
         | (static_cast<uint64_t>(sharedWrite) << 63);
}

static inline uint64_t encodeButterfly(Butterfly* butterfly, uint16_t tid, bool sharedWrite)
{
    return static_cast<uint64_t>(bitwise_cast<uintptr_t>(butterfly))
         | encodeButterflyHeader(tid, sharedWrite);
}
第 48 段(可获 0.13 积分)

We can use flat butterflies — our existing object model — whenever the TID matches the current thread. We set the SW bit and use a magic TID value (all bits set) to indicate that the butterfly is segmented. This section shows how to use these bits to allow the use of flat butterflies anytime we know that no transition race can occur. We call this transition thread locality inference.

Anytime a thread attempts to write to a butterfly and the TID matches the current thread, it can simply write to the butterfly without doing anything special.

Anytime a thread attempts to write to a butterfly, the TID does not match, but the SW bit is set, it can simply write to the butterfly as well.

第 49 段(可获 1.55 积分)

Anytime a thread attempts to read a butterfly, it just needs to check that the TID to determine if the read should be segmented (extra indirection) or not.

Anytime a read or write encounters TID = notTTLTID and SW = true, it knows to use the segmented object model.

The following cases need special handling:

  • A thread other than the one identified by the TID attempts to write to the butterfly and the SW bit is not yet set. In this case, it needs to set the SW bit. This does not mean that the butterfly needs to become segmented. It only means that the SW bit must be set, so that any future attempts to transition the butterfly from the owning thread triggers a transition to the segmented butterfly object model.
  • A thread other than the one identified by the TID attempts to transition the butterfly. In this case, it needs to set the SW bit and all TID bits and perform a segmented butterfly transition. We describe this process in detail below.
  • A thread attempts to transition a butterfly that has the SW bit set. This also necessitates a segmented butterfly transition.
  • Even transitions with TID = current and SW = false need locking, to make sure that those assumptions aren’t violated during the transition.
第 50 段(可获 2.68 积分)

The sections that follow show further refinements that reduce overhead even more. First we consider how to avoid checking TID and SW bits on every operation. Next we show how to use structure watchpointing to avoid locking on the most common transitions. Then we show how to handle array resizing. Finally we explain in more detail how flat butterflies can be transitioned to segmented ones.

Quickly Checking TID And SW Bits

This section will show how to make concurrent JS as fast as serial JS in many of the most important cases by simply extending optimizations that JavaScriptCore already uses.

第 51 段(可获 1.25 积分)

JSC optimizes the code for heap accesses using inline caching. We use inline caches for accesses to named properties (like o.f) and for array accesses (like a[i]).

Objects that have the same properties at the same offsets will usually share the same structure, which is identified by 32 bits in the object’s type header. If a property access tends to see the same structure, then we will generate new machine code for this property access, which checks that the object has the expected structure and then accesses the property directly. Failed speculation causes recompilation of the inline cache. If the inline cache remains stable (doesn’t recompile much) for a long time and the function that contains it becomes eligible for optimized JIT compilation, then the optimizing JIT compiler might express the inline cache’s code directly in its IR, which has two outcomes: the structure check branches to OSR exit if it fails (causing abrupt termination of the optimized code’s execution), and all of the code for the inline cache (the structure check, the memory access, and any other steps) become eligible for low-level optimization by our DFG and B3 JIT compiler pipelines. Our compilers are good at making type-stable JavaScript property accesses perform great.

第 52 段(可获 2.58 积分)

Array accesses use similar techniques to detect if an object has array elements, and if so, how they are formatted. We support multiple kinds of storage for array elements depending on how they are being used. Inline caches detect what kind of array each array access is accessing, and we then emit code that speculates for that kind of array access.

Another technique we already use is virtual memory. For example, our WebAssembly implementation uses virtual memory tricks to check the bounds of linear memory accesses for free. We can catch page faults using POSIX signal handlers or Mach exceptions. The handler will know the exact machine state at the point of the fault, and has the power to transfer execution to any state it likes. WebAssembly uses this to throw an exception, but we can use it to transfer control flow to slow paths that handle the generic case of a memory access. Essentially, this means that we can save cycles on safety checks if we can express the condition being checked as something that causes the virtual memory system to issue a page fault. Concurrent JS will require a combination of inline caching and virtual memory tricks to make TID and SW checking cheap.

第 53 段(可获 2.58 积分)

Inline caching means emitting different code for each property access, and then recompiling each property access potentially many times as we learn new information about what this property access may do. Inline caches are able to get incredibly precise information about the behavior of each property access because we always emit a fast path access that can only handle exactly those cases that we have seen so far. We learn new information by recording everything we know about those accesses that failed the fast path. The complete log of failed accesses is then LUBed (least-upper-bounded) together to create a minimal set of AccessCases. We can implement optimizations for new kinds of property accesses — such as the ones that have to check TID and SW bits — by considering in what ways a particular access site might be special and so specially optimizable. Below is a list of conditions we may encounter along with the strategy for how inline caches can test this condition and handle it.

第 54 段(可获 2.06 积分)
  • 可能很多对象访问始终会出现TID = current和SW = false。这些访问可以在访问之前从butterfly中简单地减去encodeButterflyHeader(currentTID, 0)。如果猜测是错误的,虚拟内存子系统将会因为非零的高位而发出一个页面错误。我们可以将其捕获为马赫异常或POSIX信号,并将执行转移到慢路径。可能许多对象访问总是会看到TID = current和SW = false。这些访问可以在访问之前从butterfly中简单地减去encodeButterflyHeader(currentTID, 0)。如果猜测是错误的,虚拟内存子系统将会因为非零的高位而发出一个页面错误。我们可以将其捕获为Mach异常或POSIX信号,并将执行转移到慢路径。这个常量的减法通常可以直接编码到后续的butterfly访问指令中。因此,这些类型的访问将不会经历并发JS中的任何开销,而不是当前的情况。请注意,这种优化并不适用于转换,它必须安装新butterfly,同时原子地断言没有发生任何不好的事情——我们考虑下面的情况。注意,这个优化稍微偏向于主线程(以及所有单线程程序),因为如果currentTID = 0,那么我们就不需要添加任何东西。
  • 可能许多对象访问总是会看到TID=current和SW=true。这些可以与前面的例子完全相同。这个状态是针对我们已经知道可以从多个线程写入的对象,但这只是在上次线程转换对象之后才发生。
  • 可能许多对象的读取将会看到TID != nottltid和SW =任何东西。这些只是需要检查一下butterfly的高位是不是没有。这可以用一个比较/分支来完成。除此之外,他们可以在我们当前的引擎中准确地进行操作。
  • 可能许多对象的写入会看到TID != nottltid和SW = true。这意味着该对象被许多线程写入。我们也正在向它写入,但由于已经设置了SW位,所以不需要设置SW位。还可以在单个比较/分支中执行该检查。与上面的阅读优化一起,这意味着共享对象的并发JS程序通常只会经历一个额外的周期(用于融合比较/分支),用于butterfly访问。
  • 有时,一个对象写入会看到TID != nottltid和SW = false。这意味着需要使用DCAS设置SW位,这将设置SW位,同时断言类型头不会更改。这些访问将比他们的朋友更昂贵,但是这只会发生在第一次,任何线程写入一个共享对象。内联缓存基础设施可能会让它这样写道,有时看到SW = false,有时看到SW = true(不一定看到TID =当前但从未notTTLTID)将多枝的:他们将有一个快捷途径SW = true和稍微慢但仍内联路径SW = false。在下一节中,我们将介绍在结构的任何对象的SW位第一次被设置时,需要在结构上调用函数的优化。由于这个内联缓存在生成时知道结构,所以它可以确保结构已经被告知其类型的对象可能具有SW位集。
  • 可能一些对象访问将会看到TID = nottltid和SW = true。按照惯例,我们会说,我们不可能有TID = nottltid和SW = false(在我们的VM内部,在技术上可以将一个对象转换为没有“写入”的对象,但是我们会假装这些都是写的)。这些访问可以在访问之前从butterfly中减去encodeButterflyHeader(nottltid, true),然后在读取butterfly时,它们必须执行额外的加载。额外的负载是必要的,因为分割对象模型:butterfly指针点指向一个脊柱,我们可以使用它来找到包含我们感兴趣的值的片段。这是一个额外的开销周期,因为它引入了负载依赖性。
第 55 段(可获 8 积分)

These optimizations leave an unresolved issue: transitions. Transitions now require acquiring and releasing a lock. We have to do this anytime we add any property. In the next section, we show how to solve this problem using watchpoints. Then we describe how to implement transitions that need to create a segmented butterfly out of a flat one.

Watchpoint Optimizations

Transitions are difficult to optimize. Every transition, including those that see TID = current, need to acquire the object’s internal lock to ensure that they set the butterfly, adjust the object’s type header, and store the new value of the property all in one atomic step. Luckily, we can dramatically improve the performance of transitions by using our engine’s structure watchpointing infrastructure.

第 56 段(可获 1.54 积分)

Each object has a structure. Many objects will share the same structure. Most inline cache optimizations begin with a check to see if the object’s structure matches the inline cache’s expectations. This means that when an inline cache is compiled, it has a pointer to the Structure object in hand. Each structure can have any number of watchpoint sets in it. A watchpoint set is simply a boolean field (starts valid, becomes invalid) and a set of watchpoints. When the set is fired (the field goes from valid to invalid), all watchpoints are invoked. We can add two watchpoint sets to Structure:

第 57 段(可获 1.28 积分)
  • transitionThreadLocal. This remains valid so long as all objects that have this structure have TID != notTTLTID.
  • writeThreadLocal. This remains valid so long as all objects that have this structure have SW = false.

This enables the following optimizations on top of the inline cache optimizations described above:

  • Transitions with TID = current and SW = false can proceed as they do in our existing engine so long as the structure’s transitionThreadLocal and writeThreadLocalwatchpoint sets are both still valid. This means not doing any extra locking or even CAS when transitioning the object. This works even if other threads are concurrently reading from the object. This works even when building up an object that eventually gets read and written to from other threads, since in that case the writeThreadLocal watchpoint set only fires on the structure that is the target of the final transition used to build the object. The last transition can proceed without locking because the watchpoint sets of the source of the transition are still valid.
  • Any checks for TID != notTTLTID can be elided if the structure’s transitionThreadLocal watchpoint set is still valid and a watchpoint is installed.
  • Any checks for SW == false can be elided if the structure’s writeThreadLocalwatchpoint set is still valid and a watchpoint is installed.
第 58 段(可获 2.58 积分)

The fact that we can elide TID != notTTLTID and SW == false checks means that reads and writes to those objects don’t actually have to check anything; they just need to mask off the high bits of the butterfly.

Most importantly, this means that transitions that happen on the same thread that allocated the object don’t need any locking so long as the structures involved have valid transitionThreadLocal and writeThreadLocal watchpoint sets. Structures are dynamically inferred by our engine so as to have a close relationship to the code that creates those objects. Hence, if you write a constructor that adds a bunch of fields to this but does not escape this, then the structures corresponding to the chain of transitions that occurred in the constructor will all have a valid transitionThreadLocal and writeThreadLocal watchpoint sets. To make sure that the object continues to have a flat butterfly, you need to either never dyamically add properties to the object, or never write to the object on any thread other than the one that constructed it. Objects that play by these rules will experience almost no concurrency overhead, since property accesses will have at most one extra instruction (a mask) on the fast path.

第 59 段(可获 2.48 积分)

Watchpoints, and the operations that cause them to fire, will execute under safepoint: if we ever perform some operation that finds that it has to invalidate a watchpoint, it will do the operation while all other threads are stopped as if for a garbage collection. This means that if some optimized code is performing non-atomic transitions involving some structures, and some other thread attempts to write or transition an object that uses any of those structures, then it will not actually perform the write until that optimized code reaches a safepoint and gets invalidated. Most of the time, the watchpoint sets will tell us that they have been invalidated even before the optimizing JIT compilers try to install any watchpoints. We expect few watchpoint invalidations in steady state.

第 60 段(可获 1.6 积分)

To summarize, if our optimizer is able to guess which object properties you will add to an object at the time of allocation, then the cost model of your object accesses does not change at all, since inline properties get concurrency for free. If you do have out-of-line properties, then they will perform almost exactly as they do now (occasionally, an extra arithmetic instruction will be involved in computing the butterfly access) so long as either the object is only written to by the thread that created it (and read by anyone), or no new properties are added to the object after creation (in which case it can be read and written by anyone). If you break this pattern, then transitions will become about 7x more expensive and all other object accesses will become 2x more expensive. This slow-down will be extremely surgical: only those property accesses that touch segmented butterflies will experience the slow-down. This system is designed to let you dynamically and concurrently add stuff to your objects, and the hope is that if you do so in moderation, you won’t see much of a change in performance.

第 61 段(可获 2.39 积分)

Arrays

Array element accesses will benefit from TTL similarly to the way that named property accesses do:

  • Accesses to TTL arrays will be about as fast as they are now.
  • Accesses to non-TTL arrays will require one extra indirection.

We handle array transitions a bit specially. Many array transitions in JavaScriptCore already use locking because of their need to avoid races with the garbage collector. Adding locking to the remaining array transitions may cause performance issues, so this section considers an alternative.

Resizing an array in response to an out-of-bounds store or something like array.push is the most common kind of array transition. Fortunately, this transition does not change anything in the type header of the object — it only changes the butterfly pointer. Whether or not the butterfly has array elements is reflected in the structure. Therefore, we can just use CAS on the butterfly pointer, and then have a rule that any changes to butterfly pointers of objects that have array elements requires CAS even if the object lock is already held. Because array structures’ transitionThreadLocal and writeThreadLocal watchpoints are unlikely to be intact (any shared array use will invalidate them, since arrays share structures), we expect that even transitions on TTL arrays will have to use CAS in the common case. This butterfly pointer CAS is sufficient to ensure that a simultaneous attempt to make the array not TTL won’t get confused by our resize operation.

第 62 段(可获 2.95 积分)

One CAS for array resizing is probably cheap enough to be practical. The CAS is likely cheap relative to the current costs of resizing, which involves allocation, copying data, and initializing the newly allocated memory.

Objects that have both named properties and array elements will now have to use both locking and CAS on some of the transitions involving named properties. Fortunately, objects that have both array elements and named properties are sufficiently uncommon that we can probably get away with increasing their named property transition cost a bit.

第 63 段(可获 1.11 积分)

Quickly Transitioning From Flat To Segmented

Turning flat butterflies into segmented ones requires a special kind of transition. Fortunartely, this transition is cheap. Butterfly fragments contain only data. They look just like fragments (fixed-size slices) of the payload of a flat butterfly. Therefore, we can transition a flat butterfly to a segmented one by allocating a spine and pointing its fragment pointers at the original flat butterfly.

Any read or write to the flat butterfly while the butterfly is being transitioned to segmented will seem to users of the segmented butterfly to have happened to the fragments. In other words, although some threads may mistakenly think that the butterfly is still flat, their accesses to that butterfly will still be sound even after the butterfly is already segmented.

第 64 段(可获 1.6 积分)

Making this work right requires making sure that the segmented butterfly’s fragments share the same memory layout as the flat butterfly that they were converted from. For this reason, the first array fragment contains the public length and the old vector length; it will remember the vector length that the flat butterfly used forever, and the real vector length will be in the segmented butterfly’s spine. This ensures that if there is a race between a flat butterfly array access and a flat-to-segmented transition, then the flat butterfly access will correctly know the flat butterfly’s size because its vector length will not change. To line up the object models, we also store out-of-line properties in reverse order in the out-of-line fragments, to match what flat butterflies do.

第 65 段(可获 1.63 积分)

This works so long as the flat butterfly accesses loaded their butterfly pointer before the transition occured. If they load it later, then their TID check will fail, probably in the form of a page fault on an access to the butterfly.

Deletion and Dictionary Lookups

JavaScriptCore tries to make objects share structures whenever possible. This relies on two properties of structures:

  • Structures are immutable.
  • When a new structure is needed, we generally hash cons it.

This is a great optimization if it indeed causes objects to reuse structures. But some objects will have a unique structure no matter what the VM tries to do about it. JavaScriptCore tries to detect when this happens, and puts the object in dictionary mode. In dictionary mode, the structure has a 1:1 mapping to the object. Also, the structure becomes mutable. Adding and removing properties means editing the structure and the object in tandem.

第 66 段(可获 1.9 积分)

Deletion is closely related to dictionary mode, since deletion will immediately put the object in dictionary mode.

It’s already the case that mutations to a dictionary require holding the structure’s lock. This is necessary to support concurrent JIT and concurrent GC.

To support concurrent JS, we only need to make these changes:

  1. Reading from a dictionary requires holding the structure’s lock, in case some other thread is changing the dictionary.
  2. Properties added before the object entered dictionary mode must be treated specially by deletion.

We are not worried about the performance hit of grabbing a lock for all reads of a dictionary. It’s already relatively expensive to read a dictionary.

第 67 段(可获 1.43 积分)

The existing facilities for safely transitioning objects will naturally support transitioning to dictionary mode. Usually, dictionary transitions don’t involve deleting properties. They happen when we detect that the program is adding so many properties to the object that it’s probably going to perform better as a dictionary. In that case, the fact that some other thread may be in the middle of accessing this object without holding any locks does not matter. For non-dictionary objects we require that only transitions take the lock. Reads and writes involve separate steps for checking the object’s type, loading the butterfly, and then accessing the butterfly. But if none of the properties that were added before the dictionary transition get deleted, then it’s fine for some other thread to race when accessing those old properties. We will call this the phenomenon of tardy accesses. Even though those tardyaccesses don’t do any locking, they are correct for the same reasons that they are correct before the object becomes a dictionary. That problem is up to the butterfly object model, which will be either flat or segmented depending on whether the object is still TTL. Dictionary mode operates orthogonally to TTL inference and butterflies.

第 68 段(可获 2.54 积分)

But if any of the properties added before the dictionary transition get deleted, then we have to handle this deletion specially. Normally, deletions cause the deleted slots to become reusable. We cannot do this here, because then a tardy read to some deleted property fmight end up reading the value of some newly added property g. Or, a tardy write to some deleted property f might end up overwriting the value of some newly added property g. We can prevent either of these bad outcomes by simply not reusing the space freed up by deleted properties, if those properties had been added before the dictionary transition.

第 69 段(可获 1.29 积分)

This does not lead to unbounded memory usage. When the GC does its safepoint, it already knows that all memory accesses have completed. Therefore, the simplest implementation is to have GC change the state of those deleted properties when it visits the object. Once the GC flags those property slots during a safepoint, future property additions can reuse those slots.

Inline Caches

Once we enable concurrency, recompiling an inline cache becomes harder because you need to be careful when changing code that may be currently executing on another CPU. We plan to have a tiered defense against this problem: we will infer the thread-locality of code to use the same inline caches as we do now as much as possible, we will buffer inline cache changes whenever it’s safe to do so, and finally we will rely on safepoints if an inline cache needs to be reset eagerly.

第 70 段(可获 1.86 积分)

We expect that in many programs, inline caches will reach steady state before the code ever executes on more than one thread. Hence, we plan to have each block of code track its thread-locality. So long as it has only ever executed on one thread, it will remember this and check for it on entry. So long as this is true, any inline caches within that code can be modified without any extra synchronization. We can track thread-locality on as fine-grain a level as we like; for example it can even be per-basic-block. It’s probably most natural if we use JavaScriptCore’s notion of CodeBlock as the granularity; this roughly corresponds to functions in the source code.

第 71 段(可获 1.46 积分)

Once the thread-locality of code is broken, we can switch to buffering inline cache changes. Our PolymorphicAccess inline cache infrastructure already buffers changes because that’s most efficient even if we don’t have to do expensive synchronization. For inline caches that may be executing globally, we can globally buffer all changes. For example, the VM can perform once-per-millisecond safepoints to flush changes to all global inline caches.

Sometimes, we need to make an immediate change to an inline cache. When this happens, we will rely on our ability to safepoint the system — i.e. to stop all threads at a point where we can account for each thread’s state. This is not a fast operation when there are many threads. Java virtual machines already have to do something like this for class hierarchy invalidation and biased locking. By design, these eager invalidations are unlikely. We only perform optimizations that can lead to eager invalidations if we have profiling data that suggests that it would be unlikely.

第 72 段(可获 2.09 积分)

Local Optimizer Lock

So far, this algorithm relies on being able to sneak 16 bits of information into the high bits of the butterfly pointer. Some hardware will not allow us to do this. For example, fewer than 16 high pointer bits may be used for the all-zero check in virtual memory. If the system only allows for 8 checked high bits, then we will only be able to support 127 threads. Concurrent JavaScript would probably still be useful even if it had a hard 127 thread limit, but this would be an unusually low limit to impose at the language level. This section shows how to overcome this limit.

第 73 段(可获 1.38 积分)

If the object model can only handle thread-locality inference for 127 threads, then we can either choose to have the 127th thread get no thread-locality inference or we can try to map more than one logical thread onto the same thread identifier. Threads mapped to the same TID won’t be able to execute concurrently to each other. To do this, we can borrow the idea of the GIL (global interpreter lock) from Python. The CPython implementation only allows 1 native thread to interpreting Python code at a time. It achieves this by having a lock (the so-called GIL) that protects the interpreter. The lock is released and reacquired periodically, which gives the appearance of concurrency. Since we can also release the lock around any potentially-blocking operations, we can even avoid deadlocks that arise from insufficient concurrency. We can apply this technique here: if we are limited to 127 threads, then we can have 127 locks protecting JS execution. So long as there are 127 or fewer threads, this lock will not do anything; anytime we try to release it, we will do nothing because the lock will tell us that nobody is waiting to acquire it. “Releasing and reacquiring” the lock will really be a cheap load and branch to verify that there is no need to release it.

第 74 段(可获 2.75 积分)

This lock will be local to a thread pool rather than global to the whole engine, and it will protect all of our optimized paths rather than just protecting the interpreter. Hence the name: local optimizer lock, or LOL for short.

Thread-Unsafe Objects

DOM objects behave like JavaScript objects, but are actually a proxy for complicated logic implemented in C++. That logic is usually not thread-safe. The native code that supports the DOM transitively uses many native APIs that are meant to only be used from the main thread. It would take a lot of work to make the DOM completely thread-safe. Our strawman proposal only requires making the DOM global object capable of handling lookups to self properties, which we need to allow threads to access global JavaScript properties like Object and Thread.

第 75 段(可获 1.65 积分)

In WebKit, variable resolution and self property resolution on the JSDOMWindow largely follows a pattern of relying on existing JS property lookup mechanisms. It uses exotic behavior when the window has a null frame. We already support installing a watchpoint on the non-nullness of frame. Hence, we can support fast concurrent accesses to properties of JSDOMWindow while honoring the frame special case by having threads use the existing watchpoint set. This implies that some of the JSDOMWindow slow paths will need locking, but that’s probably acceptable since the majority of global object accesses are inline cached.

第 76 段(可获 1.14 积分)

Native apps that want to dip their toes into concurrent JS may also want to restrict sharing for some of their classes. The thread-affinity check on accesses to objects will need to be implemented in the VM itself to give our compilers the ability to optimize it away. This means that it’s possible to expose the functionality to C/Objective-C API clients.

Summary

We think that we can make it possible to execute JavaScript concurrently in WebKit while providing good enough performance that developers would enjoy using this feature. At the heart of this technique is the combination of segmented butterflies, transition thread locality inference, and lots of cheap per-object locks. So long as programs obey certain behaviors, they should experience minuscule overheads: property accesses will cost about as much as they do today and objects will not require any extra memory. If some objects decide that they don’t want to play by our rules, they will get slightly higher overheads.

第 77 段(可获 2.04 积分)

Related Work

Segmented butterflies are a straightforward combination of the array object model in the Schism garbage collector and the butterfly object model that we have used for a long time in JavaScriptCore.

Transition-thread-locality inference is inspired by work on biased locking. As in the HotSpot biased locking scheme, we use types to reason about whether we have a guarantee of thread locality. Unlike that scheme, we don’t rely on deoptimization to break a thread-object relationship. Our primary mechanism of informing other threads that an object is no longer transition-thread-local is to have different tag bits in the butterfly pointer.

第 78 段(可获 1.26 积分)

The combination of segmented butterflies, transition-thread-locality inference, and the ability of both schemes to fall back on per-object locking when things get weird is not something that we have seen before. Most object-oriented systems that allow concurrency do it either by having an object model that naturally avoids expensive-to-resolve races, like in Java, or using some mechanisms to restrict the amount of concurrency in the language.

For example, CPython uses a global interpreter lock (GIL) to ensure that the interpreter never races with itself. This is a powerful technique that captures some of what concurrency is about: once I/O or other kinds of blocking happen, you want your other threads to be able to do some useful work in the meantime. The GIL allows this, so it makes threads useful to a large class of applications. Unfortunately, it doesn’t allow programs to exploit parallelism. The best feature of the GIL is that it is easy to implement. For example, it could be used to implement our proposed thread semantics in any JS engine. The scheme in this post is all about full concurrency optimized for 64-bit platforms.

第 79 段(可获 2.36 积分)

An effort called the Gilectomy is underway to remove CPython’s GIL and replace it with fine-grained locking and clever atomic algorithms. The Gilectomy is not yet as fast as stock CPython; both single-thread and multi-thread programs run significantly slower in the Gilectomy. The performance problems appear to be related to locking in the allocator (we plan to use thread-local allocation buffers) and in the reference counter (we don’t have reference counting). The Gilectomy does not remove locking from accesses to objects. Like JavaScript’s objects, Python’s objects are dictionaries that can resize dynamically. Most of our scheme is about how to make accesses to those objects fast even when multiple threads are reading and writing to the same object.

第 80 段(可获 1.53 积分)

PyPy also has a GIL removal effort underway, but they haven’t said much about how they plan to handle synchronizing object accesses aside from using locks. We will also have locks, but we also came up with optimizations to avoid locking in most cases.

Another approach was the title locking scheme in SpiderMonkey. It has since been removed. That scheme also involved per-object locks, but did not have our optimizations for avoiding locking in most cases.

Daloze, Marr, Bonetta, and Mossenbock propose a scheme for concurrency in Truffle, a runtime that supports many dynamic languages including JavaScript. That scheme requires objects that are shared to use synchronization for all writes. Sharing is detected by a write barrier. When a thread-local object is stored into a shared object, the runtime will traverse objects transitively reachable from the escaping object. All of those objects are marked shared so that future writes to those objects use synchronization. Our scheme does not require synchronization on shared writes, except when those writes are transitions. Our scheme does not require traversing object graphs; transition thread locality inference is on a per-object basis. A TTL object may point to a non-TTL object or vice-versa.

第 81 段(可获 2.48 积分)

科恩,塔尔,和佩特兰克最近介绍了布局锁定,这可以让快速阅读和写进可能让他们的布局改变的对象。读者需要一个额外的负载(必须要有防护性或依赖关系),作者必须获得读取锁定(其可以与一对被隔离的负载和分支一样容易),并且转变必须获得写入锁定并进行一些额外的簿记。我们认为,除了书写之外,这具有与分段蝶型的类似成本模型。分段的蝶型只需要一个额外的依赖负载来写,这比获取一个读取锁定要低廉一些,这需要两个负载在一个存储周围。在存储周围设置防护装置是不方便的;例如,在x86上的负载之前强制一个存储发生,需要一个xchg指令来存储,这比仅仅做一个mov要昂贵得多。在ARM上,它需要使用acq/rel变量来加载和存储。另一方面,对被分段的蝶型进行写入额外的依赖负载不需要在x86或ARM上加额外的防护装置。转换和读取可能与分段的蝶型和布局锁定一样快速,但是编写对我们来说很重要。经常发生的情况下,编写分段的蝶型很可能比JavaScript对象模型的布局锁定快得多。另一方面,在许多不同类型的数据结构中,分段的方法并不容易重现。由于它更通用,所以布局锁定可以用于更复杂的数据JSC数据结构,比如SparseArrayValueMap和Map/WeakMap/Set。这些数据结构对于分段来说可能过于复杂,但对于布局锁定来说并不复杂。与分段的蝶型一样,布局锁定可以与转移线程本地推断相结合,避免执行任何锁定,直到需要保护不活跃的转换。

第 82 段(可获 3.84 积分)

结论

这篇文章展示了WebKit的JavaScript实现如何被修改以支持线程。我们的方案允许共享所有的JavaScript对象。我们的方案避免了在快速路径上锁定,比如属性访问的快速路径。我们期望能够实现低开销的方案。串行代码不会有任何开销,并发代码只会在向由多个线程写入的对象中添加属性时才会出现大的开销。

这个疯狂的思想实验的下一步是尝试去实现它,看看它是否有效。我们将跟踪 缺陷174276.

第 83 段(可获 1.35 积分)

文章评论

边城
因为好奇,开了个头,但是我觉得译不下去了……
pauli
加油,慢慢来。Apple 也是草案,不会那么快推出。
北极星Polaris
昨天,一个单词其中有两个释义,我选用的是不贴切的表达,今天想了想已经都改正了另一个更为准确 的含义 :blush: