Any advice on adding a CNN backbone？

i want to add a CNN like ResNet50 or VGG as backbone, which output feature map as the input of swin-transformer
cause my own dataset only include a few thousands images, so i don't want to take the raw image or just normalized image as the input of swin-transformer
Any advice on adding a CNN backbone, requirement of the input size of swin-transformer seems to be strict.