The following gist will be helpful for the latter part of this article:
Inspecting ggml_cont
Recently, I’ve been playing around with GGML. While doing so, I was looking through the examples, and I saw this in mnist_common.cpp
:
|
|
This was on line 362. It preceded a dense matrix multiplication and addition for a fully-connected layer. It’s pretty clear what ggml_reshape_2d
does. ggml_permute
was a little confusing at first, but I found this article that discusses an analogous operation in NumPy that explains what the permutation does. However, ggml_cont
was a little bit confusing. In ggml.h
, all it says is:
|
|
Ok, that’s a little vague, but it basically tells us that ggml_cont
makes the supplied tensor contiguous in memory. Let’s dig into the code. Looking at ggml.c
, ggml_cont
just calls ggml_cont_impl
which does the following:
|
|
Well, this doesn’t explain much. But, it shows us that it duplicates the tensor argument a
and then marks this operation as GGML_OP_CONT
. Recall that in mnist_common.cpp
, this is called before ggml_reshape_2d
. Let’s look at that function:
|
|
This function has a precondition that a
must be contiguous, which it checks with ggml_is_contiguous
. Looking through a sequence of nested calls, we see that ggml_is_contiguous_n
is the function that’s called: Now, let’s look at that:
|
|
Reading through this, we can see finally what is happening. GGML checks if every dimension of order n
or above is contiguous (here, tensor->nb
is the “stride” and tensor->ne
holds the shape of tensor
). The function ggml_is_contiguous
effectively calls ggml_is_contiguous(tensor, 0)
, basically checking that adjacent elements are, in fact, contiguous in memory.
Using ggml_cont
in a toy program
You might think this seems a little arduous. From the first comment above ggml_cont
’s declaration that said “make contiguous”, you could surmise that, indeed, ggml_cont
made all the adjacent elements contiguous. But, now we examine ggml_cont
’s behavior (the way I did it the first time before I looked at the source code). I wrote a little program, and inside of main
, I had this:
|
|
Note: print_tensor
is available at the gist at the start of the article. It does exactly what you think it does.
This doesn’t do anything with ggml_cont
just yet. It initializes an array with integers between 0 and 15, inclusive. Then, it stores it in a GGML tensor. I then reshape tensor
to get a $2 \times 2 \times 4 \times 1$ tensor called t
. Then, I permute the dimensions of this tensor to get a $4 \times 2 \times 2 \times 1$ tensor called permuted_t
. This is pretty simple, so let’s look at the output:
|
|
An unexpected turn
So far, so good. Now, I apply ggml_cont
and then reshape the tensor to be $8 \times 2$:
|
|
Based on what we saw before, this should give me something like:
|
|
But, in reality, we get this:
|
|
What’s going on? We called reshape_4d
and it was fine, so clearly the issue is with ggml_cont
. I was stumped for a bit, so I asked DeepSeek. It gave me some verbose output, most of which was useless. But there was one part of the response that solved half the puzzle:
Tensor not actually computed yet: GGML uses a graph-based approach where operations are only computed when needed.
Of course! When you create a tensor in GGML (like when you call ggml_new_tensor_1d
), the tensor does not hold any data. You are building the computational graph, i.e, the sequence of operations that you want GGML to perform. To actually compute the answer, you need to allocate the computational graph as follows:
|
|
Then you get the desired output:
|
|
Glass half empty
But this still leaves one problem. How come ggml_reshape_4d
and ggml_permute
worked before we allocated the computational graph? This is where our inspection of the source code pays off: if you look at ggml_reshape_2d
again:
|
|
You can see that ggml_new_tensor_impl
is called, and the tensor a
is passed to it. Therefore, when reshaping, the data in a
is passed to result. So, when we get the result of reshape_4d
(or reshape_2d
or reshape_3d
), it contains the original data from a
, with the only difference being the shape of the tensor. This explains why we are able to print the result of reshape_4d
. A similar situation holds for ggml_permute
, except it calls ggml_view_tensor
on a
.
Now, let’s look at ggml_cont
again:
We see that
|
|
We see that it calls ggml_dup_tensor
, which is implemented as follows:
|
|
So this calls ggml_new_tensor
, but this time src
is not passed in. Indeed, ggml_new_tensor
just creates a new tensor of the same type and same dimensions as src
, but it does not supply any of the data contained in src
itself. The result is that we get an empty tensor with the same dimensions as src
(whose adjacent elements are contiguous in memory), but it does not contain any data. This solves the second half of the puzzle: we didn’t allocate the computational graph and we cannot trust – without inspecting the source code – that an operation like ggml_cont
, ggml_reshape_2d
, or ggml_permute
will execute the desired transformation before the computational graph is allocated and computed.
Conclusion
TLDR: Allocate the computational graph before inspecting the result of any tensor operations, whether that be ggml_cont
, ggml_conv_2d
, ggml_pool_2d
, etc. Though some of these operations (like reshaping) will preserve the original data, some of them do not execute until the computational graph is built and allocated.