For the second question, unless you've established that the execution of tx happens-before x.load in t2, then yes, that would be possible. Since your code doesn't do that, then it'd be possible. (However, .join() should internally use sufficient atomic synchronization to establish a happens-before relationship between the execution of the thread and anything coming after .join().)
For the first question, note that you establish that x.store(true, Release); happens-before the y.load(Acquire) in t1 and y.store(true, Release); happens-before the x.load(Acquire) in t2.
However, nothing establishes that (say) "either y.load(Acquire) in t1 happens-before the x.load(Acquire) in t2, or vice versa", or any other similar condition that would allow you to conclude "either y.store(true, Release); happens-before the y.load(Acquire) in t1 or x.store(true, Release); happens-before the x.load(Acquire) in t2".
Therefore, it'd be possible in the abstract model for y.store(true, Release); to not be visible to y.load(Acquire) in t1 and for x.store(true, Release); to not be visible to x.load(Acquire) in t2. That scenario would lead to _z == 0 at the end of main.
Note that even though you do a SeqCst load of z, there's only guaranteed to be a total modification order of SeqCst operations; a SeqCst somewhere in the code doesn't force everything to be SeqCst. I'm fairly sure that you need SeqCst on the operations on x and y, and Relaxed would actually suffice for the ordering on z.load(_). (I think it would be alright for the two stores to be Relaxed and the four loads of x and y to be SeqCst, but don't trust me on that. I'm not familiar enough with SeqCst.)
For a concrete-ish example, you could imagine that you have four cores, each one executing a spawned thread, and it takes longer for atomic stores to reach further-away cores (and they won't wait for other cores' operations to be visible unless you tell them to):
| C1 | C2 | C3 | C4 |
| _tx | t1 | t2 | _ty |
We can imagine that the sequence of events is:
x.store(true, Release) and y.store(true, Release) occur in C1 and C4
- The store to
x propagates to C2, and the store to y propagates to C3
- The while-loop in
t1 exits, and sees that y.load(Acquire) is false. Similar for t2.
- Finally, the stores to
x and y finish propagating to all cores.
I'm just mentioning that scenario in case it's more intuitive to you; when it comes to writing correct atomic code, I think reasoning about concrete machines won't help as much as reasoning about the formal model.