What can I do to avoid "thread 'main' has overflowed its stack" when working with large arrays

That looks to me like the default implementation on the trait, no?

For low-enough alignments (which is almost everything, unless you're forcing particularly-high allignment manually), it's just a call to calloc: alloc.rs - source

(Or whatever the allocator you're using chooses to do.)

In Rust, they're not allocating at all - they're using whatever you've created them from.

I tried to find the implementation of calloc in libc, found this

It seems that, calloc is more than malloc. a malloc must be called. A memset is always under something on "likely" or "expect" patch

void *
__libc_calloc (size_t n, size_t elem_size)
{
  mstate av;
  mchunkptr oldtop;
  INTERNAL_SIZE_T sz, oldtopsize;
  void *mem;
  unsigned long clearsize;
  unsigned long nclears;
  INTERNAL_SIZE_T *d;
  ptrdiff_t bytes;

  if (__glibc_unlikely (__builtin_mul_overflow (n, elem_size, &bytes)))
    {
       __set_errno (ENOMEM);
       return NULL;
    }

  sz = bytes;

  if (!__malloc_initialized)
    ptmalloc_init ();

  MAYBE_INIT_TCACHE ();

  if (SINGLE_THREAD_P)
    av = &main_arena;
  else
    arena_get (av, sz);

  if (av)
    {
      /* Check if we hand out the top chunk, in which case there may be no
	 need to clear. */
#if MORECORE_CLEARS
      oldtop = top (av);
      oldtopsize = chunksize (top (av));
# if MORECORE_CLEARS < 2
      /* Only newly allocated memory is guaranteed to be cleared.  */
      if (av == &main_arena &&
	  oldtopsize < mp_.sbrk_base + av->max_system_mem - (char *) oldtop)
	oldtopsize = (mp_.sbrk_base + av->max_system_mem - (char *) oldtop);
# endif
      if (av != &main_arena)
	{
	  heap_info *heap = heap_for_ptr (oldtop);
	  if (oldtopsize < (char *) heap + heap->mprotect_size - (char *) oldtop)
	    oldtopsize = (char *) heap + heap->mprotect_size - (char *) oldtop;
	}
#endif
    }
  else
    {
      /* No usable arenas.  */
      oldtop = 0;
      oldtopsize = 0;
    }
  mem = _int_malloc (av, sz);

  assert (!mem || chunk_is_mmapped (mem2chunk (mem)) ||
          av == arena_for_chunk (mem2chunk (mem)));

  if (!SINGLE_THREAD_P)
    {
      if (mem == 0 && av != NULL)
	{
	  LIBC_PROBE (memory_calloc_retry, 1, sz);
	  av = arena_get_retry (av, sz);
	  mem = _int_malloc (av, sz);
	}

      if (av != NULL)
	__libc_lock_unlock (av->mutex);
    }

  /* Allocation failed even after a retry.  */
  if (mem == 0)
    return 0;

  mchunkptr p = mem2chunk (mem);

  /* If we are using memory tagging, then we need to set the tags
     regardless of MORECORE_CLEARS, so we zero the whole block while
     doing so.  */
  if (__glibc_unlikely (mtag_enabled))
    return tag_new_zero_region (mem, memsize (p));

  INTERNAL_SIZE_T csz = chunksize (p);

  /* Two optional cases in which clearing not necessary */
  if (chunk_is_mmapped (p))
    {
      if (__builtin_expect (perturb_byte, 0))
        return memset (mem, 0, sz);

      return mem;
    }

#if MORECORE_CLEARS
  if (perturb_byte == 0 && (p == oldtop && csz > oldtopsize))
    {
      /* clear only the bytes from non-freshly-sbrked memory */
      csz = oldtopsize;
    }
#endif

  /* Unroll clear of <= 36 bytes (72 if 8byte sizes).  We know that
     contents have an odd number of INTERNAL_SIZE_T-sized words;
     minimally 3.  */
  d = (INTERNAL_SIZE_T *) mem;
  clearsize = csz - SIZE_SZ;
  nclears = clearsize / sizeof (INTERNAL_SIZE_T);
  assert (nclears >= 3);

  if (nclears > 9)
    return memset (d, 0, clearsize);

  else
    {
      *(d + 0) = 0;
      *(d + 1) = 0;
      *(d + 2) = 0;
      if (nclears > 4)
        {
          *(d + 3) = 0;
          *(d + 4) = 0;
          if (nclears > 6)
            {
              *(d + 5) = 0;
              *(d + 6) = 0;
              if (nclears > 8)
                {
                  *(d + 7) = 0;
                  *(d + 8) = 0;
                }
            }
        }
    }

  return mem;
}

I suspect it uses the heap for large data but it is possible NumPy places such things on the stack.

The "trick" is to include "stack probes". When a thread starts, the operating system reserves the requested amount of stack space and places a "guard page" at the end. When the guard page is touched a page fault occurs, the operating takes charge, it extends the stack, places a guard page at the new end, then returns control to the thread. In other words, the operating system grows the stack as needed as long as the thread touches the guard page. That's the purpose of a "stack probe". When a function needs lots of stack space it can "probe ahead" to ensure the stack has been extended. Obviously, there's a performance hit from stack probes so some compilers / run-time libraries choose not to do that. Apparently Rust does not.

If NumPy does use the stack then it very likely also uses stack probes to ensure there's enough space.

Rust has them, modulo how much LLVM supports a given platform. It's an exploit mitigation though (it segfaults), not a stack extension mechanism.

1 Like

On some platforms, it does appear to be possible to grow the stack at runtime -- rustc does that (ensure_sufficient_stack in rustc_data_structures::stack - Rust) to be able to processes excessively-nested syntax, for example.

3 Likes

I think all tier 1 is covered.

The implementation includes the answer to the ultimate question of life, the universe, and everything. That's handy.