Both versions look similar in assembly. It's not too surprising that they are not exactly identical, optimization looks for a local optimum not a global one, but I would guess both versions should be similarly efficient.
There is only one loop in both versions of the code -- both only do one pass over the data. Note that iterators are lazy, and take_while doesn't consume the whole iterator all at once. Also all the functions get inlined in both versions. So the code is not as different as you say.
Yeah, I have no idea how I could overlook that. However, the "harder to optimize" argument arising out of the increased complexity still stands, probably (the generated code is longer and contains 3 more branches).