Rust JVM — 2

Introduction

In this post we will start interpreting what the spec tells us into some code. Our initial target will be being able to read a .class file into memory. For that we will need to implement the ClassFile format data structure and it’s dependencies.

Right now it is fundamental to have a rough idea of the organization of the specification (https://docs.oracle.com/javase/specs/jvms/se11/html/index.html) so please go ahead and take some time to get familiar with it. Finally, I won’t be getting into every detail possible about the spec, so to make sure we are in equal footing read up to the third chapter since these are more introductory and introduce a lot of concepts that are fundamental to be able to follow these posts.

Branch setup

We will be working on a feature branch called feature/classfile_def for a couple of posts, so if you are following along the changes will be commited there, not on master.

Shut up and get to it — The class file definition

Ok, let’s take a peek at the definition of a ClassFile:

https://docs.oracle.com/javase/specs/jvms/se11/html/jvms-4.html

ClassFile {
 u4 magic;
 u2 minor_version;
 u2 major_version;
 u2 constant_pool_count;
 cp_info constant_pool[constant_pool_count-1];
 u2 access_flags;
 u2 this_class;
 u2 super_class;
 u2 interfaces_count;
 u2 interfaces[interfaces_count];
 u2 fields_count;
 field_info fields[fields_count];
 u2 methods_count;
 method_info methods[methods_count];
 u2 attributes_count;
 attribute_info attributes[attributes_count];
}

If you’ve read the 3 first chapters (please tell me you did) you know that the syntax un means unsigned of n bytes, which thankfully maps nicely to Rust’s primitive types. So, let’s begin with something simple:

// class_commons/src/class_file/mod.rs

pub struct ClassFile {
 magic: u32,
 minor_version: u16,
 major_version: u16,
 // no count for the constant pool necessary
 cp_info: Vec<CPInfo>,
 access_flags: ClassAccessFlags,
 this_class: u16,
 super_class: u16,
 // no need for a count again
 interfaces: Vec<u16>,
 fields: Vec<FieldInfo>,
 methods: Vec<MethodInfo>,
 attributes: Vec<AttributeInfo>
}

A Vec is essentially a size + ptr, so this should cover the need to keep a variable with the size of each Vec, right? Indeed that’s what we are doing, so everything is peachy, right? Here we are already introducing some non compliant changes. In Rust Vec``s keep track of the size with ausizewhich has a variable size depending on the architecture the code is compiled on. Indeed, chances are that the same will happen to the *ptr* part of theVecsince it must be able to handle 64 bit addressing if it is available, but will only be a 32 bit point in 32 bit architectures (I’m kind of guessing, haven’t checked implementation details). So we would need to implement our ownVec` that only handles 16 bit addressing. We might do that eventually, but it is not a priority for now. Just be aware that for now we are non compliant (I’ll keep a TODO on that code portion).

Also, something to keep in mind is that this struct is only used for reading the .class file into memory before turning it into a more complete runtime representation. Indeed, the HotSpot implementation of the ClassFile data structure (hotspot/share/classfile/classFileParser.hpp) is quite different from what the specification describes.

Next, as per the spec we will need the following constants defined

// commons/src/constants.rs

pub static MAGIC_NUMBER: u32 = 0xCAFEBABE

pub static JAVA_1_2_VERSION: u32 = 46;
pub static JAVA_1_3_VERSION: u32 = 47;
pub static JAVA_1_4_VERSION: u32 = 48;
pub static JAVA_1_5_VERSION: u32 = 49;
pub static JAVA_6_VERSION: u32 = 50;
pub static JAVA_7_VERSION: u32 = 51;
pub static JAVA_8_VERSION: u32 = 52;
pub static JAVA_9_VERSION: u32 = 53;
pub static JAVA_10_VERSION: u32 = 54;
pub static JAVA_11_VERSION: u32 = 55;

I’m limiting support starting with the 1.2 versions since things before were… weird…

We also need to better describe the struct ClassAccessFlags. It needs to be a represented by an u16. To make things nice to handle we will use the bitflags crate. Update the definition of the struct with the following

// commons/class_file/mod.rs

bitflags! {
 pub struct ClassAccessFlags: u16 {
 const ACC_PUBLIC = 0x0001;
 const ACC_FINAL = 0x0010;
 const ACC_SUPER = 0x0020;
 const ACC_INTERFACE = 0x0200;
 const ACC_ABSTRACT = 0x4000;
 const ACC_SYNTHETIC = 0x1000;
 const ACC_ANNOTATION = 0x2000;
 const ACC_ENUM = 0x4000;
 const ACC_MODULE = 0x8000;
 }
}

Next up the document talks about how names work inside the VM. You should read up on that, especially if you are not familiar how modules work yet. We will get back to the definitions in §4.2.3 in the future. For now we will also skip implementation details of §4.3. If you ever worked with JNI the details in this sub chapter will be very familiar to you.

Constant Pool Info

Next up, cp_info. Roughly, this struct use it’s first byte to determine how to interpret the following bytes. Because we have a fixed set of possible values we will represent this as an enum. Table 4.4-A shows us which cases we need in our enum which will look roughly like this

// commons/class_file/cp_info.rs

pub enum CPInfo {
 ClassInfo {…},
 FieldRef {…},
 MethodRef {…},
 InterfaceMethodref {…},
 String {…},
 Integer {…},
 Float {…},
 Long {…},
 Double {…},
 NameAndType {…},
 Utf8 {…},
 MethodHandle {…},
 MethodType {…},
 Dynamic {…},
 InvokeDynamic {…},
 Module {…},
 Package {…},
}

I’ll not go through the details of the implementation of each case, but it mirrors the contents of the documentation. You can check the implementation on the github project through this direct link https://github.com/pedrohjordao/justvm11/blob/628c7809f893cffe9e31fc5136b3331276152f88/class_commons/src/class_file/cp_info.rs#L39. A special note should be taken when reading the string for the UTF8 case, we will implement the parsing later.

Descriptors

Let’s go back to §4.3 and talk about descriptors. Field and method signatures are represented using a descriptor. A descriptor is a String that represents a type, where a method is created by the type of its parameters and return type. Being familiar with the way these descriptors are constructed is fundamental to understand how to parse it.

The next couple of structures we will implement (FieldInfo and MethodInfo) both have pointers to descriptors.

FieldInfo

The field_info structure represents the fields in a class. Like in a lot of other parts of the JVM, this struct uses “pointers” into the constant pool to represent its information, such as field name, field descriptor and attributes. These pointers are not memory pointers in the traditional sense of the concept, but indexes into the constant pool of the ClassFile data structure.

Here is our representation of a FieldInfo

// commons/classfile/field_info.rs

pub struct FieldInfo {
 pub access_flags: FieldAccessFlags,
 pub name_index: u16,
 pub descriptor_index: u16,
 pub attribute_info: Vec<attribute_info::AttributeInfo>,
}

with the access flags as

// commons/classfile/field_info.rs

bitflags! {
 pub struct FieldAccessFlags : u16 {
 const ACC_PUBLIC = 0x0001;
 const ACC_PRIVATE = 0x0002;
 const ACC_PROTECTED = 0x0004;
 const ACC_STATIC = 0x0008;
 const ACC_FINAL = 0x0010;
 const ACC_VOLATILE = 0x0040;
 const ACC_TRANSIENT = 0x0080;
 const ACC_SYNTHETIC = 0x1000;
 const ACC_ANNOTATION = 0x2000;
 const ACC_ENUM = 0x4000;
 }
}

MethodInfo

Next we have the information for methods. This will look really similar to the FieldInfo structure, with the main difference been the access flags:

// commons/classfile/method_info.rs

pub struct MethodInfo {
 pub access_flags: MethodAccessFlags,
 pub name_index: u16,
 pub descriptor_index: u16,
 pub attribute_info: Vec<attribute_info::AttributeInfo>,
}

bitflags!(
 pub struct MethodAccessFlags : u16 {
 const ACC_PUBLIC = 0x0001;
 const ACC_PRIVATE = 0x0002;
 const ACC_PROTECTED = 0x0004;
 const ACC_STATIC = 0x0008;
 const ACC_FINAL = 0x0010;
 const ACC_SYNCHRONIZED = 0x0020;
 const ACC_BRIDGE = 0x0040;
 const ACC_VARARGS = 0x0080;
 const ACC_NATIVE = 0x0100;
 const ACC_ABASTRACT = 0x0400;
 const ACC_STRICT = 0x0800;
 const ACC_SYNTHETIC = 0x1000;
 }
);

```

You might be noticing that both MethodInfo and FieldInfo have a vector of something called AttributeInfo This will be a more complex structure that would make this post run way too long. We will talk about the implementation of that in a future post.

Conclusion

I think that was a pretty good chunk of implementation for one day. Next post we will go through AttributeInfo which is one of the most complex structures of the class file.

See you then!