THRAY format specification
Work in progress.
(This is an informal specification of the THRAY format. In the future, an RFC may be planned.)
The THRAY format is a superset of JSON. The format extends JSON with support for comments and more data types.
Like JSON, THRAY is a text format, that is, a sequence of characters where each character is a Unicode code point.
Conventions used in this document
The grammar syntax used in this specification is ABNF as defined in RFC 5234 and RFC 7405.
Document
The text representing a valid THRAY format is called a THRAY document. A THRAY document contains a THRAY value with optional whitespace and comments.
document = value-with-wsc
value = null /
boolean /
float /
integer /
string /
binary /
array /
object /
extension
Null data type
The null data type can be used to represent an absence of a value. It has only one case-sensitive token.
null = %s"null"
Boolean data type
The boolean data type represents a boolean value. It has only two case-sensitive tokens.
boolean = %s"true" / %s"false"
Numbers
Unlike in JSON, numbers have two distinct data types: integer and float.
Numbers can be prefixed with optional character +
or -
to indicate sign.
Digits are sequence of one or characters for the appropriate base. Digits can be grouped using an underscore (_
) to increase readability. The grouping character must be preceded and followed by a digit character.
sign = "+" / "-"
digits-with-groups = 1*DIGIT *( "_" 1*DIGIT )
hex-digits-with-groups = 1*HEXDIG *( "_" 1*HEXDIG )
Integer data type
The integer data type represents whole numbers. Integers can be expressed in decimal or hexadecimal notations.
A base 10 integer is a sequence of digits. The set of characters for digits range from “0” to “9” inclusive.
A base 16 (hexadecimal) integer is a sequence of 0x
(case-sensitive) followed by digits. The set of characters for digits range from “0” to “9” and “A” to “F” inclusive, case-insensitive.
integer = [sign] ( hex-integer / decimal-integer )
decimal-integer = digits-with-groups
hex-integer = %s"0x" hex-digits-with-groups
Implementations must support number sizes up to 64-bit signed integer values. Implementations may support larger values.
Float data type
The float data type represents IEEE 754 double-precision floating point numbers.
Floats are a sequence of digits, a decimal separator, digits, and an optional exponent part.
float = [sign] ( infinity / nan / float-digits )
float-digits = digits-with-groups "." digits-with-groups [exponent-part]
exponent-part = "e" [sign] digits-with-groups
Other mathematical values, Infinity
and NaN
, are supported as case-sensitive tokens.
infinity = %s"Infinity"
nan = %s"NaN"
String data type
The string data type represents text values. Strings are delimited by a double quote character. The content consists of zero or more literals or escape sequences.
A string literal is a character interpreted without any transformations. These are space, ASCII printable characters excluding double quote and backslash, and Unicode characters.
string = string-line *string-line-continuation
string-line = DQUOTE *string-fragment DQUOTE
string-fragment = string-literal / escape-sequence
string-literal = 1*(
SP /
%x21 / ; ASCII printable excluding double quote and backslash
%x23-5B /
%x5D-7E /
unicode-char
)
backslash = %x5C
An escape sequence is a sequence of characters that represent an encoded character. An escape sequence begins with the escape character backslash followed by sequence of characters defined in the following table.
Unescaped |
Escaped |
---|---|
quotation mark ( |
|
backslash ( |
|
slash ( |
|
backspace (U+0008) |
|
form feed (U+000C) |
|
line feed (U+000A) |
|
carriage return (U+000D) |
|
tab (U+0009) |
|
Unicode codepoint (BMP only) |
|
Unicode codepoint |
|
escape-sequence = backslash (
DQUOTE /
backslash /
"b" / ; U+0008
"f" / ; U+000C
"n" / ; U+000A
"r" / ; U+000D
"t" / ; U+0009
char-code-brace /
char-code-bmp
)
char-code-brace = %s"u{" 1*6HEXDIG "}"
char-code-bmp = %s"u" 4HEXDIG
Surrogate pairs
When using the JSON 4-hexadecimal character Unicode escape sequence, only code points in the Basic Multilingual Plane may be used. To represent code points outside the plane, the character is represented using UTF-16 surrogate pairs.
Implementations must decode surrogate pairs into their corresponding characters. Implementations must return an error by default if decoding surrogate pairs fails. Implementations may use an alternate error handling, such as substitution by a Replacement Character (U+FFFD), if configured by the user.
Implementations should use the brace syntax for characters outside the BMP.
Line continuation
Line continuations can be used to break long lines for readability purposes. A line continuation joiner is a sequence of backslash, line separator, and zero or more whitespace characters.
string-line-continuation = line-continuation-joiner string-line
line-continuation-joiner = backslash newline *WSP
Introducing or removing line continuations must not affect the value of the string.
Binary data type
The binary data type represents a sequence of bytes.
Binary values can be encoded in hexadecimal or base 64 notations.
binary = binary-hex / binary-base64
Binary value sequences begin with a token representing the encoding, followed by the encoded value delimited by opening parenthesis and closing parenthesis characters.
binary-hex = %s"b16(" *HEXDIG ")"
binary-base64 = %s"b64(" *( ALPHA / DIGIT / "_" / "-" ) ")"
Hexadecimal encoded binary values use the encoding token b16
. Encoding is case-insensitive. However, implementations should use lowercase.
In base 64 encoding, the scheme “Base 64 Encoding with URL and Filename Safe Alphabet” defined in RFC 4648. Padding characters must not used. The encoding token is b64
.
Array data type
The array data type represents a list of values.
An array is a structure delimited by opening bracket and closing bracket characters. Values are separated by a comma. A trailing comma is optional.
array = "[" ( array-items / array-empty ) "]"
array-empty = wsc
array-items = value-with-wsc *( "," value-with-wsc) [ "," wsc ]
Object data type
The object data type represents a list of key-value pairs.
An object is a structure delimited by opening curly brace and closing curly brace characters. Key-value pairs are separated by a comma. A trailing comma is optional.
Key-value pairs are a sequence of a THRAY value, colon character, and a THARY value.
object = "{" ( object-pairs / object-empty ) "}"
object-empty = wsc
object-pairs = object-pair * ( "," object-pair ) [ "," wsc ]
object-pair = value-with-wsc ":" value-with-wsc
Implementations may choose to present the object as a list or a map.
Because keys can be any THRAY value, it is important for implementations to carefully consider the conversion process to a map. Implementations may opt to only accept integer and string keys and return an error otherwise. Implementations must return an error if there are duplicate keys by default. Accepting duplicates may introduce security vulnerabilities when implementations use differing deduplication schemes.
Extension data type
The extension data type annotates a value with a tag for use with custom data types. Extensions are delimited by a less-than sign and a greater-than sign characters. The content is a sequence of a tag, colon character, and a value.
Tags are one or more of alphanumerical, underscore, or hyphen characters.
extension = "<" extension-tag ":" value-with-wsc ">"
extension-tag = 1*( ALPHA / DIGIT / "_" / "-" )
Applications should avoid tags consisting of 1 to 3 characters or simple words for private use. These tags are intended for use with well-known standards in subsequent specifications. Applications may opt to use a namespace prefix (such as az-
) to avoid naming conflicts.
Miscellaneous grammar rules
Miscellaneous grammar rules common to various data types:
unicode-char = %x80-10FFFF
Rules not defined in this specification are Core ABNF rules defined in Appendix B.1 in RFC 5234.
Encoding
For data interchange, a THRAY document should be encoded as UTF-8. When using the UTF-8 encoding, a Byte Order Mark (U+FEFF) must not be used at the start of the document.
File registration
For filesystems, the filename extension of thray
is suggested to be used.
For MIME type, there is a possibility of registering application/thray
but this is not guaranteed.
Comments and whitespace
Whitespace is defined to be space or tab characters. Newline (line separator) is defined to be either line feed (Unix style) or {carriage return, line feed} (Windows style) character sequences.
Comments are segments of text that carry no semantic meaning to the represented data. They are used for user documentation purposes and may be present wherever whitespace is permitted.
There are two types of comments: line comments and block comments.
Line comments
Line comments begin with a character sequence of {slash, slash}, followed by zero or more of space, tab, printable ASCII, or Unicode characters.
Block comments
Block comments are delimited by character sequences {slash, asterisk} and {asterisk, slash}. The content consists of zero or more of space, tab, carriage return, line feed, printable ASCII, and Unicode characters (excluding the {asterisk, slash} sequence as defined in the grammar).