The input image is first split into overlapping patches. Then, those patches go through tokens reduction block and main transformer to learn features with global information. To abstract global ...