volatile-2-volatile在jvm内部的实现

Posted by My Blog on November 12, 2023

0.为什么要写本篇?

既然jitwatch可以查看JIT编译后的机器码,那么应该可以调戏volatile变量,对其进行读写,然后看看产生的内存屏障(memory barrier)到底是什么样的?能看到loadload, loadstore, storeload, storestore对应的指令吗?

1.JIT编译后的屏障是什么样的?

环境:jdk1.8, Intel Core i5

volatile read read
volatile write write

可见在x86架构下,JIT编译后的代码,仅在写时(更精确是storeload)会生成mem bar

2.内存屏障分类

volatile语义保证有两个内涵:

  • 编译器层面,防止指令重排
  • cpu层面,从cache间数据可见性角度防止指令重排

为了细化场景,提升性能,分为四种屏障1

  • loadload:Load1; LoadLoad; Load2, 2

    1
    2
    3
    4
    5
    
    if (IsPublished)                   // Load and check shared flag
    {
        LOADLOAD_FENCE();              // Prevent reordering of loads
        return Value;                  // Load published value
    }
    
  • storestore:Store1; StoreStore; Store2

    1
    2
    3
    
    Value = x;                         // Publish some data
    STORESTORE_FENCE();
    IsPublished = 1;                   // Set shared flag to indicate availability of data
    
  • loadstore:Load1; LoadStore; Store2

  • storeload:Store1; StoreLoad; Load2,相当于完整内存屏障,也是代价最大的屏障3

    唯一可避免:r1 = 0 and r2 = 0出现的屏障2
    comment out

3. 内存屏障的实现

3.1 屏障定义4

openjdk: jdk/src/hotspot/share/runtime/orderAccess.hpp

1
2
3
4
5
6
7
8
9
10
11
12
class OrderAccess : public AllStatic {
 public:
  static void     loadload();
  static void     storestore();
  static void     loadstore();
  static void     storeload();

  static void     acquire();
  static void     release();
  static void     fence();
  ...
};

3.2 x86平台下linux实现

openjdk: jdk/src/hotspot/os_cpu/linux_x86/orderAccess_linux_x86.hpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Implementation of class OrderAccess.

// A compiler barrier, forcing the C++ compiler to invalidate all memory assumptions
static inline void compiler_barrier() {
  __asm__ volatile ("" : : : "memory");
}

inline void OrderAccess::loadload()   { compiler_barrier(); }
inline void OrderAccess::storestore() { compiler_barrier(); }
inline void OrderAccess::loadstore()  { compiler_barrier(); }
inline void OrderAccess::storeload()  { fence();            }

inline void OrderAccess::acquire()    { compiler_barrier(); }
inline void OrderAccess::release()    { compiler_barrier(); }

inline void OrderAccess::fence() {
   // always use locked addl since mfence is sometimes expensive
#ifdef AMD64
  __asm__ volatile ("lock; addl $0,0(%%rsp)" : : : "cc", "memory");
#else
  __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory");
#endif
  compiler_barrier();
}

在x86平台上,可见storeload是唯一需要指令实现的屏障且等效于完整内存屏障fence;其它屏障在cpu层面就符合可见性要求,只需防止编译重排即可。

各平台cache可见性标准
cache coherence of platforms

3.3 __asm__ volatile含义5

  • __asm__表示在c语言中嵌入汇编代码
  • __asm__ volatile表示不允许优化器删除,cache,乱序等
  • __asm__ volatile ("" : : : "memory")禁止编译器指令重排
  • __asm__ volatile ("lock; addl $0,0(%%esp)" : : : "cc", "memory") 表示lock; addl $0,0(%%esp)适用上述限制

3.4 x86屏障指令

根据IA32-3 7.2 memory odering小节:

  • SFENCE—指令将程序指令流中在其之前发生的所有存储(写入)操作进行序列化,但不影响加载(读取)操作
  • LFENCE—指令将程序指令流中在其之前发生的所有加载(读取)操作进行序列化,但不影响存储(写入)操作
  • MFENCE—指令将程序指令流中在其之前发生的所有存储(写入)和加载(读取)操作进行序列化
  • LOCK—在多处理器环境中,LOCK指令防止读写重排,并独占共享内存且完成原子化操作。Locked instructions have a total order 6。一般而言lock性能较高,会替代MFENCE使用。

4.volatile变量读写实现7

4.1 字节码实现

getfield & putfield

volatile变量读写的字节码编译
cache coherence of platforms

4.2 jvm实现(x86)

1)定义jdk8u/hotspot/src/cpu/x86/vm/templateTable_x86_32.cpp

1
2
3
4
5
6
7
void TemplateTable::getfield(int byte_no) {
  getfield_or_static(byte_no, false);
}

void TemplateTable::putfield(int byte_no) {
  putfield_or_static(byte_no, false);
}

2)具体实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
void TemplateTable::getfield_or_static(int byte_no, bool is_static) {
  // 删除大量非直接相关代码
	...

  __ bind(Done);
  // Doug Lea believes this is not needed with current Sparcs (TSO) and Intel (PSO).
  // volatile_barrier( );
}

void TemplateTable::putfield_or_static(int byte_no, bool is_static) {
	// 删除大量非直接相关代码
	...
	
  // Check for volatile store
  __ testl(rdx, rdx);
  __ jcc(Assembler::zero, notVolatile);
  volatile_barrier(Assembler::Membar_mask_bits(Assembler::StoreLoad |
                                               Assembler::StoreStore));
  __ bind(notVolatile);
}

对于volatile的读,在x86上是不需要的,也跟OrderAccess中的实现一致,所以关键就看写是如何实现的。再看volatile_barrierAssembler::Membar_mask_bits干了什么:

1
2
3
4
5
void TemplateTable::volatile_barrier(Assembler::Membar_mask_bits order_constraint ) {
  // Helper function to insert a is-volatile test and memory barrier
  if( !os::is_MP() ) return;    // Not needed on single CPU
  __ membar(order_constraint);
}

3)最终jdk8u/hotspot/src/cpu/x86/vm/assembler_x86.hpp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
enum Membar_mask_bits {
    StoreStore = 1 << 3,
    LoadStore  = 1 << 2,
    StoreLoad  = 1 << 1,
    LoadLoad   = 1 << 0
  };

  // Serializes memory and blows flags
  void membar(Membar_mask_bits order_constraint) {
    if (os::is_MP()) {
      // We only have to handle StoreLoad
      if (order_constraint & StoreLoad) {
        // All usable chips support "locked" instructions which suffice
        // as barriers, and are much faster than the alternative of
        // using cpuid instruction. We use here a locked add [esp],0.
        // This is conveniently otherwise a no-op except for blowing
        // flags.
        // Any change to this code may need to revisit other places in
        // the code where this idiom is used, in particular the
        // orderAccess code.
        lock();
        addl(Address(rsp, 0), 0);// Assert the lock# signal here
      }
    }
  }

可见只对StoreLoad加了屏障(基于lock实现),其它几个屏障在cpu指令层面是不需要的,也与前面OrderAccess实现一致。

5.references