This is present in the code without bound checks:
vpbroadcastd ymm1, dword ptr [r10]
mov rcx, r8
mov rdx, qword ptr [rsp]
xor ebp, ebp
mov rdi, qword ptr [rsp + 16]
.LBB0_26:
vpaddd ymm2, ymm1, ymmword ptr [r12 + rdx - 96]
vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx]
vpcmpeqd ymm3, ymm2, ymm3
vpxor ymm3, ymm3, ymm0
vpmaskmovd ymmword ptr [r12 + rcx], ymm3, ymm2
vpaddd ymm2, ymm1, ymmword ptr [r12 + rdx - 64]
vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 32]
vpcmpeqd ymm3, ymm2, ymm3
vpxor ymm3, ymm3, ymm0
vpmaskmovd ymmword ptr [r12 + rcx + 32], ymm3, ymm2
vpaddd ymm2, ymm1, ymmword ptr [r12 + rdx - 32]
vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 64]
vpcmpeqd ymm3, ymm2, ymm3
vpxor ymm3, ymm3, ymm0
vpmaskmovd ymmword ptr [r12 + rcx + 64], ymm3, ymm2
vpaddd ymm2, ymm1, ymmword ptr [r12 + rdx]
vpmaxud ymm3, ymm2, ymmword ptr [r12 + rcx + 96]
vpcmpeqd ymm3, ymm2, ymm3
vpxor ymm3, ymm3, ymm0
vpmaskmovd ymmword ptr [r12 + rcx + 96], ymm3, ymm2
add rbp, 32
sub rdx, -128
sub rcx, -128
add rdi, -4
jne .LBB0_26
cmp qword ptr [rsp + 24], 0
je .LBB0_30
.LBB0_28:
vpbroadcastd ymm1, dword ptr [r10]
lea rcx, [rbx + 4*rbp]
lea rdx, [rax + 4*rbp]
xor edi, edi
.LBB0_29:
vpaddd ymm2, ymm1, ymmword ptr [rdx + rdi]
vpmaxud ymm3, ymm2, ymmword ptr [rcx + rdi]
vpcmpeqd ymm3, ymm2, ymm3
vpxor ymm3, ymm3, ymm0
vpmaskmovd ymmword ptr [rcx + rdi], ymm3, ymm2
add rdi, 32
cmp r15, rdi
jne .LBB0_29
which is not available in the code with bound checks. And if you look at the generated assembly, you can also discover that the code is much longer, and given that the algorithm written in this way is not cache friendly, it is understandable why this can be slower.
But, as I already said, working with matrices and graphs is hard, if LAPACK is still used nowadays there are reason (and remember that it is written in Fortran, not C, and this for reasons as well). If you want to write a good implementation of this (and not a trivial one like we are discussing), it is probably necessary to consider a better memory layout, using Z-order curves for instance. Which brings to the fact that it is generally a good idea to use crates for handling matrices and graphs.
It is important to always benchmark the code, but doing that without reasoning on what is happening behind the scenes (and without using something like hyperfine to obtain significative results) can only help choosing between two solutions. Rationally analyzing what code is generated is necessary to find an optimal solution.