That explanation of the prime number thing doesn't seem to actually match what's in the paper. GPT4 goes from a wordy explanation of how it arrived at the correct answer, "yes", to a single-word incorrect "no". GPT3.5 goes from a wordy explanation that has the right chain of thought but the wrong answer "no" to a very wordy explanation with the correct answer "yes". Neither of those seem to be predicated on either of the models just answering one way for everything.