Skip to content

moddengine/m-tsv

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

M-TSV: TSV with Types and Metadata

M-TSV is an enhancement of the Tab-Separated Values (TSV) file format. It's fast to parse, concise, streamable, very efficient, and readable by spreadsheet software.

M-TSV files are a tab-separated format where:

  • Each record fits onto exactly one line, with each field separated by a tab character (0x09), and terminated by a line feed (0x0A).
  • All fields have the tab character, line feed, backslash, and carriage return characters escaped using common escape sequences (\t \n \\ \r).
  • File-level and column-level metadata can be included between the header row and the start of the records, or after all the records.
  • Columns have a static type, and types can be declared via column-level metadata for multiple different languages.
  • Support for metadata that can include non-row data for things like source, author, description, date, and software.
  • The filename should end in .m.tsv. This makes it easy to load into software that accepts TSV files, while still indicating the filetype.

This repository includes an encoder and parser in PHP and Typescript (javascript).

M-TSV Example

Id  Code    Price   Added   Description
#\M  Title   Product List
#\M  Creation-Date 2000-12-31T23:59:59.000000Z
#\M  Generator   ShopCatalogue/1.0
#\C A comment that will not be parsed at all
#\F  Type json
int string  string  string  string
#\F  Name variables
id  code    price   date_added  desc
10  prod-1  $ 10.54 2000-12-31  Product #1\nFirst Product
20  prod-2  $ 5.56  2001-01-01  Product #2
...
#\M  Time    Generated in 0.023s

Comparisons

JSON

Using JSON can be great for achieving data transfer between systems, but row-based data is rather inefficient, as each column heading is included with every row. This makes parsing slower, makes files larger, and has limited support for streaming. M-TSV sends the header only once, is very efficient (no extra quotes & field names), and is very easy to write a streaming parser for.

CSV

Using CSV can offer great interoperability, however there are competing encodings, quoting and escaping are not consistent between vendors, and by its nature, commas are found in lots of typical values. Streaming is possible, however writing a parser is definitely more complicated. There is no standard way to add metadata.

Features & Design Goals

  • Human Readability. The use of plain text and a simple structure can be read just using a text editor.
  • Compatibility. The .m.tsv file extension and the TSV-like structure ensure backward compatibility, making it easy to read with all popular spreadsheet software, with the only caveat being that Metadata would also be displayed.
  • Easy to Parse. Simple escape rules (\\, \t, \n, \r, \0), and clear metadata markers (#\M, #\F, and #\C) make parsing straightforward.
  • Self Documenting. Files can document types, intent, source, and description, making it easy to identify why, when, and how the file was created.
  • Fast. Fast to generate and parse, with minimal bytes wasted on field names and quotation marks.
  • Extensible. Metadata for both the document and the columns.
  • Flexible. Metadata can be added at the start or end of records, allowing for post-generation metadata to be added without file seeking.

M-TSV File Structure

The file is encoded as UTF-8 and has 4 distinct sections:

  • Field Headings. The first row of the file is dedicated to the human-readable field names, encoded as a single TSV record. This line defines the column count for all subsequent Data Area records.
  • Metadata Area. This optional section contains metadata records. It can appear before the Data Area and/or after.
  • Data Area. A line for each record with field values encoded as TSV.
  • Additional Metadata (optional). Additional document metadata and/or comments can appear after the data area.

Metadata Parsing

Lines beginning with the #\ sequence are metadata records. They are parsed as independent TSV records, and their field count does not need to match the Field Headings column count.

  • Document Metadata (#\M). A TSV record of three fields: the constant #\M, a Name string, and a Value string.
  • Column Metadata (#\F). A pair of lines.
    1. The first line is a TSV record of three fields: the constant #\F, a metadata Name (e.g., 'Type', 'Name'), and a metadata Value (e.g., 'json', 'variables').
    2. The subsequent line must be a TSV record containing a value for each data column, corresponding to the metadata defined in the line above. Its field count must match the Field Headings count.
  • Comment (#\C). A TSV record of two fields: the constant #\C, and a string containing the comment text, which is ignored by the parser.

TSV Encoding

Replace all of the following characters within a field with their corresponding escape sequences:

  • Tab (0x09) with the string \t
  • Null (0x00) with the string \0
  • New line (0x0a) with the string \n
  • Carriage return (0x0d) with the string \r
  • Backslash (0x5C) with the string \\

Each record is represented as individual fields, encoded as above, separated by tab characters (0x09) and ending with a new line character (0x0a).

If binary data needs to be stored, it should use an encoding like base64.

Data Record Rules

  • Empty Field: An empty string is represented by two consecutive tabs (... \t \t ...).
  • Column Mismatch: If a data record has fewer fields than the header, the missing fields should be treated as empty strings. If it has more fields, the extra fields should be ignored.
  • Null Values: The format does not have an explicit representation for NULL distinct from an empty string. This should be handled at the application level.

Standard Metadata (#\M)

Standard Metadata Field Names:

  • Title A descriptive title of the document.
  • Description A description of the file contents.
  • Creation-Date ISO8601 Date/Time (preferably in UTC) of when the file was created.
  • Modified-Date ISO8601 Date/Time (preferably in UTC) of when the file was updated.
  • Author The author of the file, e.g., "First Last [email protected]". Use multiple Author records for multiple authors.
  • Generator An identifier for the software that created the file.
  • Cursor A cursor value if the file is a partial result, used to continue fetching.

Names should be cased like HTTP Header Names.

Field Metadata (#\F)

Standard Field Metadata Name/Values:

  • Type Static field types, defined by a type system, e.g.:

    • json - JSON types (number, string, bool, buffer)
    • sql - SQL types (e.g., unsigned int, decimal, datetime, text, varchar(200))
    • C - C field types (e.g., int32, float, bool, ...)
  • Name Alternate field names for different uses, e.g.:

    • variables - standard variable names
    • sql - SQL field names

Other field metadata can be included with any (Name, Value) pairs as required.

About

TSV with Metatadata

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published