The first step in any morloc
project is to describe the relevant data. In
this post, I will describe how a set of training and testing data for a machine
learning project can be typed into morloc
. This post will introduce the
concept of dimensional typing, which is not yet implemented in morloc
, so the
morloc
code in this post will not currently run. Instead the focus of this
post is to describe the features I want to build into morloc
in the near
future.
As a concrete example of a training/testing dataset, I will use the MNIST collection of hand-drawn digits. This collection consists of 60000 28X28 pixel training images and 10000 28X28 pixel testing images. All images are on the grayscale with values between 0 and 255 representing intensities (from black to white).
We can represent MNIST
as a type with 4 parameters. This type captures the
top-level shape of the data (a pair or training and testing pairs).
type (MNIST x_training y_training x_testing y_testing) =
((x_training, y_training), (x_testing, y_testing))
To capture the shape one layer deeper, we can remove the parameterized types with a list of lists of lists of integers (awkward) for the data and a list of strings for the labels.
type MNIST = (([[[Int]]], [Int]), ([[[Int]]], [Int]))
Where [a]
represents an ordered list of elements of type a
. Besides being
awkward, the term []
is does not capture the rectangular shape of the
data. We can replace the internal with
Matrix Int
as below:
type MNIST = (([Matrix Int], [Int]), ([Matrix Int], [Int]))
This type still does not capture the dependencies between data and label dimensions. So we can extend the types with explicit dimensional data:
type (MNIST label) =
( (Tensor_{60000,28,28} Int, Tensor_{60000} Int)
, (Tensor_{10000,28,28} Int, Tensor_{10000} Int))
The addition of dimensions to the type allows the dimensionality of the program
to be typechecked at compile time and serves as machine and human readable
documentation. They also allow runtime validation of input data. One further
step we could take would be to replace the Tensor
type by generalizing the
[]
notation to arbitrary dimension. As shown below:
type (MNIST label) =
( ([Int]_{n=60000,28,28}, [Int]_n)
, ([Int]_{m=10000,28,28}, [Int]_m))
I am also introducing dimension variables (m
and n
). Note that all indexing
expressions in a given type signature are in the same scope. They form a small,
integer-based language nested inside the larger type language. Repeated
dimensions could be removed, though in this case it doesn’t improve
readability:
type (MNIST label) =
( ([Int]_{n=60000, x=28, y=x}, [Int]_n)
, ([Int]_{m=10000, x, y }, [Int]_m))
In the case of the MNIST data, we already know the dimensionality of the data. But if we want a description of learning data that can be reused in many contexts we can generalize the type as follows:
type (MLData cell label) =
( ([cell]_{n, x...}, [label]_n)
, ([cell]_{m, x...}, [label]_m))
Where n
and m
represent the number of training or testing objects; cell
and label
represent the generic data and label types; and x…
represents
the dimensionality of the data. This general machine learning input data type
requires the form and dimension of the training and test data be the same and
requires exactly one label for each data object. The MNIST data type
is a subset of this more general MLData type.
There is more type information we could encode in the MNIST type, such as
constraints in the allowed values for data values (0-255) and label values
(0-9). I will discuss constraints later in this post. We also might want to layer an
ontology over the data, for example stating that the input matrices are of
logical type "GrayscaleImage" which is itself a term in a broader ontology.
Mapping morloc
types into deep ontological frameworks is of great importance
to the morloc
ecosystem, but I will leave this discussion to a future post.