How much does this draft smell?

Hi all, newbie here, I'm trying to implement a small autograd crate, this is what I came up so far:

use ndarray::{
    Array, Array2, Array3, Axis, Data, Dim, Dimension, Ix, Ix1, Ix2, IxDyn, IxDynImpl,
    ShapeBuilder, StrideShape,
};
use num_traits::{Float, One, Zero};
use std::ops::Add;

pub struct Tensor<A, D>
where
    D: Dimension,
{
    data: Array<A, D>,
    grad: Option<Box<Tensor<A, IxDyn>>>,
    grad_fn: Option<
        Box<
            dyn FnOnce(
                Option<Box<Tensor<A, IxDyn>>>,
                &mut Option<Box<Tensor<A, IxDyn>>>,
                &mut Option<Box<Tensor<A, IxDyn>>>,
            ) -> (),
        >,
    >,
    requires_grad: bool,
    is_leaf: bool,
}

impl<A, D> Tensor<A, D>
where
    D: Dimension,
{
    pub fn unbroadcast<T: Dimension>(self, target: &Tensor<A, T>) -> Tensor<A, IxDyn>
    where
        A: Clone + Zero + Add<Output = A>,
    {
        let mut unbroadcasted = Tensor {
            data: self.data.into_dyn(),
            grad: None,
            grad_fn: None,
            requires_grad: self.requires_grad,
            is_leaf: self.is_leaf,
        };
        while unbroadcasted.data.ndim() > target.data.ndim() {
            unbroadcasted.data = unbroadcasted.data.sum_axis(Axis(0));
        }
        for ax_descr in target.data.axes() {
            if ax_descr.len() == 1 {
                let axis = ax_descr.axis();
                unbroadcasted.data = unbroadcasted.data.sum_axis(axis).insert_axis(axis);
            }
        }
        unbroadcasted
    }
}

impl<A, D> Clone for Tensor<A, D>
where
    A: Clone,
    D: Dimension,
{
    fn clone(&self) -> Self {
        Self {
            data: self.data.clone(),
            grad: None,
            grad_fn: None,
            requires_grad: false,
            is_leaf: false,
        }
    }
}

impl<T: Dimension + 'static, U: Dimension + 'static, A: Clone + Zero + 'static> Add<&Tensor<A, T>>
    for &Tensor<A, U>
{
    type Output = Tensor<A, U>;
    fn add(self, rhs: &Tensor<A, T>) -> Tensor<A, U> {
        let new = &self.data + &rhs.data;
        let rhs_copy = rhs.clone();
        let self_copy = self.clone();
        Tensor {
            data: new,
            grad: None,
            grad_fn: Some(Box::new(move |grad, lhs_grad, rhs_grad| {
                let unwrapped = grad.unwrap();
                let grad_copy = unwrapped.clone();
                *lhs_grad = Some(Box::new(grad_copy.unbroadcast(&self_copy)));
                *rhs_grad = Some(Box::new(unwrapped.unbroadcast(&rhs_copy)));
            })),
            requires_grad: self.requires_grad || rhs.requires_grad,
            is_leaf: false,
        }
    }
}

#[macro_export]
macro_rules! tensor {
    ($([$([$($x:expr),* $(,)*]),+ $(,)*]),+ $(,)*; $y:expr) => {let new;{
        new=$crate::Tensor{
            data: $crate::Array3::from(vec![$([$([$($x,)*],)*],)*]),
            grad:None,
            grad_fn:None,
            requires_grad:$y,
            is_leaf:true
    }}new};
    ($([$($x:expr),* $(,)*]),+ $(,)*; $y:expr) => {{
        $crate::Tensor{
            data: $crate::Array2::from(vec![$([$($x,)*],)*]),
            grad:None,
            grad_fn:None,
            requires_grad:$y,
            is_leaf:true
    }}};
    ($($x:expr),* $(,)*; $y:expr) => {{
        $crate::Tensor{
            data: $crate::Array::from(vec![$($x,)*]),
            grad:None,
            grad_fn:None,
            requires_grad:$y,
            is_leaf:true
    }}};
}

(Playground)

The thing I'm concerned about so far is the memory usage, is there a way to use less memory? Most importantly, am I thinking right about the problem? This is a very early sketch and I've got plenty to learn so every comment is more than welcome. Thanks in advance for your time.

It's hard to say how good the code is when we don't know what it's supposed to do. I'd start by defining at least part of the public API. Then document that API, at least a little bit. Right now you seen to have some code that can't possibly be used (not enough is pub), which makes it hard to work backwards to figure out what you're trying to accomplish.

Is also avoid macros to start with. They're hard to read and hard to use. Better to start with an API that uses types, as that's way easier to read and learn.

1 Like

Now that you mention it, it seems reasonable to provide some more information :sweat_smile:.

This code is an extract from a project I'm currently developing. It basically implements a struct, namely Tensor, and the addition operation between Tensors. In order to differentiate w.r.t the addends, in the Tensor resulting from the addition is stored a closure whose role is to set the addends gradient. Currently, as can be seen, there's plenty of clone operations in my code as the closure that implements the differentiation takes and moves the gradients of the addends plus the one of the result.

I am wondering if there are some other ways of performing this operation without cloning so much stuff.

I'n the end my aim is to implement the whole backprop algorithm therefore, in order to avoid diving nose first in a big, slow and memory-hungry mess, I'll gladly accept some comments on this idea.

You have a hidden graph (binary tree) of tensors inside your code whereby every tensor created from Tensor::add stores the lhs and rhs tensor of the add operation. Cloning this tensor will recursively clone all the other lhs and rhs from the lhs and rhs respectively, which causes memory to explode and could even lead to a stack overflow due to the recursion. If chaining adds is so important for you, I strongly advice using Rc or Arc, which performs a shallow copy. You can either have the Rc be external or internal w.r.t. the Tensor type, depending on your needs. External Rc is more verbose to use outside the API, but allows usage of a tensor, that is not boxed. How useful that is, you'll have to know. Internal Rc means, that every tensor is stored on the heap, but allows for a nicer API.

Example (Internal):

// This is simply your current Tensor renamed
struct TensorInner { /* … */ }

struct Tensor {
    inner: Rc<TensorInner>
}

Example (External):

fn add(
    self: Rc<Self>,
    rhs: &Rc<Tensor<A, T>>
) -> Tensor<A, U> { /* … */ } 

Btw, it's weird, that you take ownership of self in add, but still clone it. If you have to clone it either way, consider taking a reference, instead or refactor your code to avoid the clone of self. I missed, that the impl is for &Tensor, not Tensor.

EDIT / P.S.:
Since a tree is one-directional, you could possibly even get away with regular references and no Rc-boxing would be needed. That could complicate the API quite a bit and may restrict it outside of the API. It most likely nudges the user towards using a collection, that can push with &self instead of &mut self, which results in a design, that is similar to a linked list or multi-level tree to avoid almost guaranteed borrow checker complaints except for the most trivial example usages, where everything resides on the stack. I don't recommend this.

P.P.S.: Using Rc does not completely avoid the potential stack overflow. The stack overflow can still happen when calling grad_fn, because the tree structure still exists. Some kind of trampoline would be needed to avoid a potential stack overflow, i.e. you return a function, that either yields the result or another function. Building a trampoline for a tree data structure is non-trivial, though. You're essentially going to end up writing your own stack implementation, using heap memory, to be able to handle the recursion. However, perhaps the stack overflow doesn't happen in practice and you can simply avoid opening this can of worms. I really do hope so.

3 Likes

Cloning this tensor will recursively clone all the other lhs and rhs from the lhs and rhs respectively, which causes memory to explode (..)

When I clone a tensor I basically return another Tensor that contains the data cloned from the other. No recursive clone happens as far as I know, it is a dumb workaround nevertheless.

Btw, it's weird, that you take ownership of self in add, but still clone it. If you have to clone it either way, consider taking a reference, instead or refactor your code to avoid the clone of self.

self in add is a reference. Correct me if I'm wrong.

Currently I'd like to achieve the following things:

  1. Avoid cloning rhs and lhs in the add, I tried to clone the reference, i.e. let rhs_copy = &rhs.clone() but It didn't work.

  2. I'd also very much like to retain some pointers to the parents of a tensor, either by storing them in the Tensor struct itself, but I don't understand how to do this, as they do not necessarily have the same dimension as their son, or by having them returned by grad_fn; I have tried the latter approach and I have been punished by an avalanche of lifetime errors.

Currently I'm stuck in between those two issues.

a + b + c is creating a tree:

    +
   / \
  +   c
 / \
a   b
  1. It clones a and b to create <a + b> (anonymous type that represents the addition of tensors). The clone operation of <a + b> is implemented by calling clone on a and b.
  2. It clones <a + b>, which clones a and b and it clones c to create <<a + b> + c>. The clone operation of <<a + b> + c> is implemented by calling clone on <a + b> and c.

In this example, clone of the tensor is called inside clone, once. If you add d, then clone is called inside clone is called inside clone. While from a concrete point of view, there is no recursion, from an abstract point of view, there is. It depends on whether you differentiate between the clone implentations or not.

What's more important than debating over the term recursion in this context is, that the calls are nested, which increases the amount of stack space needed to perform a clone, potentially leading to a stack overflow, if enough additions have been performed.

Oops, I didn't see you implemented for &Tensor. Now, it makes sense, why you take self.

1 Like

Ok, now I get it, thank you very much.

This raises in me the following question: am I thinking about this right? What would it be the most idiomatic way of providing a user with the means of doing something like:

let a = SomeTensor();
let b = SomeTensor();
let c = SomeTensor();

let d = &a + &b + &c;
let e = &a + &b + &d;

And then to differentiate w.r.t. e or d?

Is addition of tensors associative, i.e. (a + b) + c = a + (b + c) and commutative, i.e. a + b = b + a? In that case, one solution would be to balance the tree to minimize depth.

Another way to avoid the tree is to convert it to a stack.

enum Item {
    Op(fn(&Tensor, &Tensor)->Tensor),
    Val(Tensor),
}

and work with a post-fix operation list: [Val(Tensor), Val(Tensor), Op(+), Val(Tensor), Op(+)]

Either way, you have to make the dependency explicit as a parameter of the function and can't just use a closure to capture the parent tensors implicitly. What exactly this parameter accepts, depends on your needs. You may have to use an enum or some kind of runtime type information.

Since you're cloning arrays so much, maybe ArcArray would be a good fit? It's clone-on-write. I'd avoid doing any operations or element-by-element traversal on "IxDyn" arrays since there are some slow cases. It's a bit unfortunate, dyn dimensional arrays have the same fast paths as other dimensionalities, but the fallback case is worse.

1 Like

I'm not sure why you're using a closure at all. The gradient of a sum is the sum of the gradients. Nothing weird required, it's the simplest case for automated differentiation. A use example would be helpful, as you're doing things that are complicated for no apparent reason.

I would also suggest not calling the operation "clone" which isn't really cloning the Tensor, as it returns quite a different Tensor.

1 Like

:grinning: I'm happy to read from you, I'm appreciating very much ndarray, can you clarify a bit more what would those fallback cases be?

Also thanks everyone for the precious suggestions, I'll keep you all up.