Subtext Graph Specification

Version: 0.1
Author: Sebastian

Abstract

This document describes a specification for storing a knowledge graph based on the Subtext markup language in a file system. Applications adhering to this specification are able read and manipulate Subtext graphs in a conformant way. The specification includes a directory structure and a file format.

Introduction and motivation

Out of discontent with existing note-taking apps, the author started working on the note-taking application NENO (acronym for "network of notes") in early 2020. Please read the design principles of NENO as these have significant influence on this specification. Since the author did not want to create just another app, but an open and interoperable knowledge management system independent of single apps, the desire for an open specification arose. This also pays in to the goal of having all the de-facto rules of NENO not only documented in, and spread throughout the code, but collected in a normative ruleset.

Specification requirement levels

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.

Graph directory

A graph directory is a directory on a file system that contains a Subtext graph.

Subtext Graph

A Subtext graph consists of Subtext graph files and arbitrary graph files in a graph directory and its subdirectories. To obtain a representation of the full graph, implementations MUST collect and parse all Subtext graph files in the graph directory and its subdirectories.

Subtext graph file

A Subtext graph file is an UTF8-encoded text file with the filename extension .subtext.

The Subtext graph file's name consists of its slug, followed by the filename extension.

This section is non-normative.

Note: A subtext graph file looks like this: <slug>.subtext (replace <slug> with the actual slug)

This section is non-normative.

Note: The author has considered to use a different filename extension than .subtext for Subtext graph files (e.g. .subg, .sugra, or .neno), to distinguish Subtext graph files from files that just contain Subtext markup language. This would also promote the fact that Subtext graph files could contain content with a different type than Subtext markup. However, that would introduce a breaking change in how .subtext files are currently being used by several active users.

A Subtext graph file can contain zero or more headers, followed by an optional content section.

If a Subtext graph file has at least one header and a content section, the header section and the content section MUST be separated with an empty line.

If a Subtext graph file has no headers but a content section, the Subtext graph file MUST start with the content section, unless the lines from the beginning of the content section until the first empty line or EOF could be parsed as valid headers. In that case, implementations MUST insert a header section with an empty header at the beginning of the Subtext graph file.

This section is non-normative.

Note: A content section whose first lines could be parsed as a spec-compliant header section should be a very rare case. Here is an example of how the file has to look like in that case:

::

:this-is:the-content
:section:

If a Subtext graph file contains at least one header but no content section, the file ends with the line of the last header.

This section is non-normative.

Note: It is possible that a Subtext graph file can have an empty content section which is different from having no content section at all. If there is an empty content section, the file ends with two newline characters.

The default content type of a Subtext graph file is text/vnd.subtext.

Implementations MUST NOT add a newline character at the end of the file.

This section is non-normative.

Note: However, a file can still end with a newline character, this is then considered part of the content section.

Header

A header section can be at the beginning of a Subtext graph file and consists of one or more line-separated headers. A header has the following format:

:<KEY>:<VALUE>
where <KEY> is the header key and <VALUE> is the header value.

A header key has a minimum length of 0 characters, and a maximum length of 200 characters. Allowed are all Unicode characters, except for a colon (:) and the newline character. Implementations SHOULD use only lower-case letters and hyphens (-) in a header key.

A header value has a minimum length of 0 characters. There is no maximum length. Allowed are all Unicode characters, except for the newline character.

Empty header

An empty header key is a header key with zero characters. Empty header keys are reserved for special use.

An empty header has an empty header key and a value with zero characters (::).

Canonical header

Implementations SHOULD include the following header when creating a Subtext graph file:
Key: created-at, value: An ISO 8601 timestamp of the time of the creation of this file

Implementations SHOULD include the following header when creating a Subtext graph file: Key: updated-at, value: An ISO 8601 timestamp of the last time the file's name, any header, or the content section of the file has been changed.

If a Subtext graph file has an updated-at header, implementations SHOULD update the ISO 8601 timestamp in the header value each time the file's name, any header, or the content section of the file has been changed.

If the Subtext graph file has a content section with the default content type, implementations MAY include the following header when creating a content file:
Key: content-type, Value: The MIME type of the content

If the Subtext graph file has a content section with a different than the default content type, implementations MUST include the content-type header.

This section is non-normative.

Note: Example structure of a header section for a Subtext graph file with content.

:created-at:<ISO timestamp>
:updated-at:<ISO timestamp>
:content-type:<MIME type>

Structure of a header section for a Subtext graph file with content with example values:

:created-at:2024-09-29T19:22:43+02:00
:updated-at:2024-09-29T19:22:43+02:00
:content-type:text/vnd.subtext

Content section

The content section can include any Unicode code points and MUST contain content of the Subtext graph file's content-type.

Slug

A slug is a string that identifies a Subtext graph file.

An implementation that wants to create a Subtext graph file MUST assign it a slug that is unique in the Subtext graph.

An implementation SHOULD map a slug to a file path relative to the graph directory when storing or retrieving the Subtext graph file for a specific slug. Slashes (/) in the slug SHOULD be interpreted as directory separators, so that a Subtext graph file that has a slug with a slash in it will be placed in a subdirectory of the graph directory.

Slug syntax

A slug MUST have a length of at least 1 Unicode code point and 200 Unicode code points maximum.

This section is non-normative.

Note: Windows systems can handle up to 255 chars in a filename, but we truncate at 200 to leave a bit of room for the filename extension and possible future prefixes and suffixes.

A slug MUST match this ECMAScript regular expression:

/^[\p{L}\p{M}\d_][\p{L}\p{M}\d\-._]*((?<!\.)\/[\p{L}\p{M}\d\-_][\p{L}\p{M}\d\-._]*)*$/u

A slug MUST NOT contain two dots in direct succession (..).

A slug MUST NOT start or end with a dot (.).

A slug MUST NOT start with a dash (-).

A slug contains one or more slug segments. The segments can be obtained by separating the slug at every slash (/).

A slug segment MUST NOT start or end with a dot (.).

A slug segment MUST NOT start with a dash (-).

This section is non-normative.

Note: Examples of valid slugs:

foo
foo/bar
f-o-o/b-a-r
f/o/o/b/a/r
foo/bar.png

Examples of invalid slugs:

/foo
foo/
.foo
foo.
foo./bar
foo/.bar
-foo

Implementations MUST only use lower-case letters in a slug.

Implementations MUST disallow the usage of dots (.) in slugs that do not point towards arbitrary graph files.

Alias

An alias is a slug that points to another slug.

To create an alias, an implementation MUST create a Subtext graph file whose slug is the alias.

This Subtext graph file MUST have a header with the key alias-of, and the value of the target slug.

A Subtext graph file that contains a header with the key alias-of SHOULD have no content section.

Arbitrary graph file

An arbitrary graph file is a file with a different filename extension than the one of a Subtext graph file. An arbitrary graph file is part of the Subtext graph.

Implementations MUST store an arbitrary graph file inside the graph directory or one of its subdirectories.

A Subtext graph file MUST be created to accompany and point towards the arbitrary graph file. This Subtext graph file MUST be stored in the same directory as the arbitrary graph file.

This section is non-normative.

Note: Technically, the Subtext graph file acts in this case as a sidecar file.

The arbitrary graph file's name MAY be the same as the slug of its accompanying Subtext graph file.

The arbitrary graph file's name SHOULD be the same as the last slug segment of the slug of its accompanying Subtext graph file.

This section is non-normative.

It might be easier for implementations to enforce that the arbitrary graph file's name is the same as its slug's last segment, because then it can derive filename from slug and it does not need to keep track of possible filename collisions in addition to slug collisions.

This section is non-normative.

Note: Example filenames for Subtext graph files that point to arbitrary graph files are: song.mp3.subtext pointing towards song.mp3, good-movie.subtext pointing towards movie-1234.mp4

To identify a Subtext graph file as a file that accompanies an arbitrary graph file, the Subtext graph file MUST have a header with the key file and the arbitrary graph file's name as value.

A Subtext graph file with a header with the key file MUST NOT have a content section.

This section is non-normative.

Note: Example header in an accompanying graph file:
:file:song.mp3

A Subtext graph file that has a header with the key file MUST also have a header with the key size and as value the size of the arbitrary graph file in bytes.

If a Subtext graph file has a file header but no size header, an implementation MUST ignore this Subtext graph file and the arbitrary graph file it points towards.

Creating a slug and a normalized filename for an arbitrary graph file

If an implementation wants to include an arbitrary graph file in the graph, the implementation MAY use the following algorithm to derive a slug and a normalized filename from the arbitrary graph file's original name:

/*********************
  Helper functions
*********************/

const getExtensionFromFilename = (filename: string): string | null => {
  const posOfDot = filename.lastIndexOf(".");
  if (posOfDot === -1) {
    return null;
  }

  const extension = filename.substring(posOfDot + 1).toLowerCase();
  if (extension.length === 0) {
    return null;
  }

  return extension;
};


const removeExtensionFromFilename = (filename: string): string => {
  const posOfDot = filename.lastIndexOf(".");
  if (posOfDot === -1) {
    return filename;
  }

  return filename.substring(0, posOfDot);
};

const sluggifyFilename = (filename: string): string => {
  return filename
    // Trim leading/trailing whitespace
    .trim()
    // remove invalid chars
    .replace(/['’]+/g, "")
    // Replace invalid chars with dashes.
    .replace(/[^\p{L}\p{M}\d\-._]+/gu, "-")
    // Replace runs of one or more dashes with a single dash
    .replace(/-+/g, "-")
    // remove initial dot from dotfiles
    .replace(/^\./g, "")
    .toLowerCase()
    // remove leading and trailing dashes
    .replace(/^-+/, "")
    .replace(/-+$/, "");
};

/*********************
  Main function
*********************/

const getSlugAndNameForNewArbitraryFile = (
  namespace: string, // e.g. "files" for "files/image.png"
  originalFilename: string,
  existingSlugs: Set:<Slug>,
): { slug: Slug, filename: string } => {
  const extension = getExtensionFromFilename(originalFilename);
  const originalFilenameWithoutExtension = removeExtensionFromFilename(
    originalFilename,
  );
  const sluggifiedFileStem = sluggifyFilename(originalFilenameWithoutExtension);

  let n = 1;

  while (true) {
    const showIntegerSuffix = n > 1;
    const stemWithOptionalIntegerSuffix = showIntegerSuffix
      ? `${sluggifiedFileStem}-${n}`
      : sluggifiedFileStem;

    const filename = stemWithOptionalIntegerSuffix
    + (
      extension
        ? (
          stemWithOptionalIntegerSuffix
            ? "."
            : ""
        ) + extension.trim().toLowerCase()
        : ""
    );

    const slug: Slug = `${namespace}/${filename}`;

    if (!existingSlugs.has(slug)) {
      return { slug, filename };
    }
    n++;
  }
};

Subtext

Subtext markup, as defined in https://github.com/polyrainbow/subtext/ is the default content type for content sections. Implementations MUST incorporate a Subtext parser to be able to evaluate the edges of the Subtext graph.

Interpretation of slashlinks

Implementations MUST interpret the value of a Subtext slashlink as a slug and, if the entity that the slug refers to exists, interpret this slashlink as an edge of the Subtext graph.

Interpretation of wikilinks

Implementations MUST resolve a Wikilink value to a slug and, if the entity that the slug refers to exists, interpret this Wikilink as an edge of the Subtext graph.

This section is non-normative.

Note: Wikilinks can only point to other notes or aliases, but not arbitrary graph files, because dots are replaced when resolving the slug from the Wikilink value. Dots are a common symbol used in Wikilink texts, so it is not desirable to leave them as-is when resolving a slug.

Wikilink slug resolver algorithm

Implementations MUST use the following algorithm to resolve a slug from a Wikilink value:

/*
  We will replace dots with dashes, as we do not allow
  these chars in note slugs (even though they are generally allowed
  in slugs).
  As a consequence, this means that uploaded files with dots in slugs
  (like `files/image.png`) cannot be referenced via a Wikilink.
  Also, it will replace series of multiple slashes (//, ///, ...)
  with single slashes (/).
  In order to link to nested note slugs, we have to use "//" as separator,
  e.g. [[Person//Alice A.]]
*/
const sluggifyWikilinkText = (text: string): string => {
  return text
    // Trim leading/trailing whitespace
    .trim()
    // remove invalid chars
    .replace(/['’]+/g, "")
    // Replace invalid chars with dashes. Keep / for processing afterwards
    .replace(/[^\p{L}\p{M}\d\-_/]+/gu, "-")
    // replace single slashes
    .replace(/(?<!\/)\/(?!\/)/g, "-")
    // replace multiple slashes (//, ///, ...) with /
    .replace(/\/\/+/g, "/")
    // Replace runs of one or more dashes with a single dash
    .replace(/-+/g, "-")
    .toLowerCase()
    // remove leading and trailing dashes
    .replace(/^-+/, "")
    .replace(/-+$/, "");
};

Additional implementation file

An additional implementation file is a file in the graph directory that is neither a Subtext graph file, nor an arbitrary graph file. Implementations MAY store additional implementation files in the graph directory. Such files MUST NOT have the filename extension .subtext and there MUST NOT be a Subtext graph file with the same slug as the additional implementation file's name.

This section is non-normative.

Note: As an example, an application might want to store the favorite notes of a user inside the graph directory. It could do that by creating a file named favorites.txt. The application then needs to take care that no arbitrary graph file with the same slug is created.
The application might also want to use dotfiles (e.g. .favorites) for this use case. Since dots (.) are disallowed at the beginning of a slug, there is no danger of a collision with a slug.

Newline character

Newline characters are \n characters. \r characters are ignored.

License

CC-BY-SA 4.0