External Data Representation & Marshalling

11 min readJun 20, 2020

The information stored in running programs is interpreted as data structures, for instance by sets of interconnected objects whereas the message information consists of byte sequences. The individual elements of primitive data transmitted in messages can be Data values of many types, and not all computers store primitive values like integers in the same order. Floating-point numbers also vary in representation Architectures respectively. There are two variants for the ordering of integers. That is big-endian order, in which the most significant byte comes first and little-endian order, in which most significant byte comes last. The set of codes used to represent characters is another issue. for example the majority of systems applications like UNIX use ASCII character Coding, taking one byte per character whilst Unicode allows Texts are represented in many different languages and take two bytes per character. To exchange binary data values, one of the following methods can be used to enable any two computers;

Before transmission the values are translated to an accepted external format and Upon receipt converted to the local form; if both computers are known to be the Conversion to external format may be omitted on the same type.
The values are transmitted in the sender format, accompanied by an indication of The format used, and if necessary the receiver will convert the values.

However, during transmission, the bytes themselves are never changed. Aid RMI or RPC, any type of data that can be passed as an argument or subsequently returned must can be flattened and individual primitive data values reflected in an agreed upon document format. An agreed standard for the representation of primitive values and data structures is called an external representation of the data.

Marshalling is the process of collecting and assembling data elements they become a form suitable for message transmission. The process is Unmarshalling to disassemble them when they arrive to generate an equivalent collection of data items at the destination. Marshalling thus comprises the translation of structured data items and primitive values to external representation of the data. Unmarshalling also consists of generation of primitive values by their external representation of data and datastructure rebuilding.

This article will be demonstrating following three alternative approaches to external data representation and marshalling.

Common representation of data by CORBA, which concerns an external representation of the structured and primitive types to pass as remote method invocation arguments and results in CORBA. Could be used by diverse programming languages.
The object serialization of Java, which concerns the flattening and the external data representation of any object or tree of objects which might need to be sent in a message, or stored on a disk. It is for Java-only use.
XML (Extensible Markup Language), a text fomat for structured data represents. It was originally intended to contain documents textual self-description of structured data. For example records available on the web, but now it is also used to represent data transmitted in messages exchanged by web-service clients and servers.

In the first two cases, marshalling and unmarshalling, shall be performed by a middleware layer, without any involvement by the application scheduler. Even for XML, which is textual and consequently more accessible the Marshalling and Unmarshalling program is available for hand-encoding platforms and programming environments commonly used. Because Marshalling requires an examination of all the finest details of the primitive representation components of composite objects If performed, the procedure is likely to be error-prone crafted by side. compactness is another issue which can be dealt with when designing Marshalling procedures generated automatically.

The basic data types are marshaled into a binary within the first two approaches form. In the third method (XML), there is textual representation of the primitive data forms. In general the textual representation of a data type is longer than the equivalent binary representation.

Another question concerning the design of the marshalling methods is whether the Marshaled information should include information on the type of its contents. For example, the CORBA representation contains only the values of the transmitted objects, And none of their types. On the other hand, serialization in Java and XML do both include information regarding the type, but in different ways.Java puts in all the type required information in serialized form but XML documents may be externally referenced defined namesets (with types) which are called namespaces.

CORBA’s Common Data Representation (CDR)

CORBA CDR is an external data representation specified by CORBA 2.0 CDR which can represent all the data types that can be used as arguments and returned values in CORBA remote invocations. These are composed of 15 primitive types, include short (16-bit), long (32-bit), unsigned short, unsigned long, 32-bit float; double (64-bit), char, boolean (TRUE, FALSE), octet (8-bit), and a number of composite types; Every statement or trigger a remote invocation is represented by a sequence of bytes in the message invocation or result.

Primitive types: The CDR describes a representation for large-endian as well as small-endian orders. The values are transmitted in the order of the sender specified in each of the text. If it requires a different order, the receiver translates. For instance, a 16-bit short occupies two bytes in the post, and the most for large-endian ordering significant bits occupy the first byte and smallest bits occupy the second byte. Every primitive value is set to an index in the byte series according to its size. Suppose the byte sequence is indexed upwards from zero. Then a primitive size value of n bytes (where n =1, 2, 4 or 8) is added to the sequence at an index which is a multiple of n in a byte stream. Floating-point values follow the IEEE standard in which the sign, exponent and fractional part are 0–n bytes for large-endian ordering and small-endian bytes. Characters are defined by a series of codes accepted between server and client.

Constructed types: In a particular order, the primitive values comprising each type constructed are added to a sequence of bytes.

Marshalling in CORBA

Marshalling operations may be automatically generated by specifying the types of data items to be transmitted in a message. The types of the data structures and the types of the basic data items are described in CORBA IDL, which provides a notation to describe the types of RMI methods’ arguments and results.

The CORBA interface compiler produces suitable marshalling and unmarshalling operations from the descriptions of the types of their parameters and results for the arguments and results of distant methods.

Java object serialization

In Java RMI it is possible to transfer both objects and primitive data values as arguments and results of invocations of methods. An object is an instance of class Java.

In Java, the term serialization refers to the activity of flattening an object or a connected set of objects into a serial form that is suitable, for example, as an argument or as the result of an RMI, for storing on disk or transmitting in message. Deserialization is about restoring an object’s state or a collection of objects from its serialized type. The method that performs the deserialization is believed to have no prior knowledge of the object types in the serialized form. Therefore the serialized form includes some information about the class of each object. This information allows the recipient to load the appropriate class upon deserialization of an object.

The class details consists of the class name, as well as a version number. The version number is intended to change when major modifications to the class are made. The programmer can set it, or automatically calculate it as a hash of the class name and its instance variables, methods and interfaces. The process that deserializes an object will verify that it has the proper class version.

Java objects could include references to other objects. Once an object is serialized, all the objects it refers to will be serialized along with it to ensure that all its connections can be fulfilled at the destination when the object is recovered. References are set to serial as handles. In this case, in the serialized form, the handle is a reference to an object — for example, the next number in a sequence of positive integer. The serialization procedure must ensure that the object references and handles correspond 1–1. It must also ensure that each object is written only once-on an object’s second or subsequent occurrence, Instead of an object the handle is written.

To serialize an object, it writes out its class information, followed by the types and names of its instance variables. If the instance variables belong to new classes, then they must also write down their class information, followed by the types and names of their instance variables. This recursive process continues until all the requisite classes have been written down the class information and types and names of the instance variables. Every class has a handle, and no class is written to the byte stream more than once.

Instance variables, which are primitive types, such as integers, Boolean characters, bytes and longs are written in a compact binary format using ObjectOutputStream class methods. Strings and characters are written using the Universal Transfer Format (UTF-8) by its writeUTF method which allows unchanged (in one byte) representation of ASCII characters, Whereas the Unicode characters are multi-byte. Strings are preceded by the number of bytes in the stream that they occupy.

External Data Representation in Java Object Serialization

To use Java serialization, for instance to serialize the object user, creates an ObjectOutputStream class instance and invokes its writeObject method, passing the object Person as its argument. Open an ObjectInputStream on the stream to deserialize an object from a data network, and use its readObject method to recreate the original entity. Application of this class pair is identical to DataOutputStream and DataInputStream.

Serialization and deserialization of remote arguments and results invocations are usually carried out automatically via the middleware, without the application programmer having any involvement. Programmers with special requirements can write their own version of the methods that read and write objects, if needed. Another way a programmer can change the serialization effects is by marking variables that are not to be serialized as transitory. References to local resources such as files and sockets are examples of items that should not be serialized.

Use of reflection

The Java language supports reflection-the ability to analyze a class ‘s property, such as the names and types of its instance variables and methods. It also allows the creation of classes from their names, and the creation of a constructor for a given class with different argument types. Reflection allows serialization and deserialization to be performed in a completely generic way. This implies there is no need to create special marshalling functions for each object type, as stated in CORBA above.

Java object serialization uses reflection to figure out the object’s class name and the names, types and values of the instance variables. That’s what it takes for the serialised form.

The class name in the serialised form is used to create a class for deserialization. This will then be used to create a new constructor with arguments types that match those defined in the serialized form. Finally, the new constructor is used to create a new object with instance variables the values of which are read from the serial type.

Extensible Markup Language (XML)

XML is a markup language which the World Wide Web Consortium has defined for general use on the Internet. The term markup language generally refers to a textual encoding which represents both a text and information as to its structure or appearance. Both XML and HTML had a rather complex markup language derived from SGML (Standardized Generalized Markup Language). HTML was designed to define Web pages appearance. XML was designed to write structured Web documents.

XML data items are tagged with the strings ‘markup.’ The tags are used to define the logical data structure, and to connect pairs of attribute-value with logical structures. In XML, that is, the tags are related to the text structure they contain, as opposed to HTML, in which the tags specify how a browser might display the text.

XML is used to enable customers to communicate with Web services and to define the interfaces and other web services properties. XML is still used in many other areas, however, including in archiving and retrieval systems-while an XML archive can be larger than a binary one, it has the advantage of being readable on any device. Other examples of XML uses include user interface specification and configuration file encoding within operating systems.

XML is extensible in the sense that in contrast to HTML, which uses a defined collection of tags, users may create their own tags. However, if more than one application is intended to use an XML document, then the tag names have to be agreed between them. Clients typically use SOAP messages to communicate with Web services , for example. SOAP is an XML format, the tags of which are released for use by web providers and clients.

Some external data representations such as CORBA CDR do not need to be self-descriptive, because it is presumed that the client and server exchanging a message has prior knowledge of the order and the types of the information it contains. Yet XML was intended to be used for different purposes by multiple applications. This was made possible by the introduction of tags along with the use of namespaces to describe the context of the tags. Furthermore, the use of tags helps users to pick only the parts of a document they need to process: it won’t be influenced by the introduction of similar information to other users.

External Data Representation in Extensible Markup Language

Humans can read XML documents, being textual. Most XMLs are used in practice XML processing software creates and reads documents so the ability to read XML can be helpful when things go wrong. Additionally, the use of text renders XML independent of any particular framework.

The use of a textual rather than a binary representation, along with the use of tags, makes the messages huge, so they take longer processing and transmission times and more storage space. HTTP version 1.1 requires data to be compressed, which save bandwidth during transmission, however, files and messages may be compressed.

XML elements and attributes

Elements: An element in XML consists of a portion of character data surrounded by matching start and end tags.

Attributes: A start tag may optionally include pairs of associated attribute names and values. The syntax is the same as for HTML, in which an attribute name is followed by an equal sign and an attribute value in quotes. Multiple attribute values are separated by spaces.

Names: The names of tags and attributes in XML generally start with a letter, but can also start with an underline or a colon.

Binary data: All of the information in XML elements must be expressed as character data.

XML namespaces: Traditionally, namespaces provide a means for scoping names. An XML namespace is a set of names for a collection of element types and attributes that is referenced by a URL. Any other XML document can use an XML namespace by referring to its URL.

XML schemas: An XML schema defines the elements and attributes that can appear in a document, how the elements are nested and the order and number of elements, and whether an element is empty or can include text. For each element, it defines the type and default value.

In this post, I have demonstrated External data representation and marshalling in CORBA’s common data representation, Java’s object serialization & XML (Extensible Markup Language).