I was looking for something similar back at the start of summer and the best I could find at the time was a Microsoft model on hugging face - huggingface.co/microsoft/speecht5_tts. It’s a bit robotic, but its pretty versatile and since it outputs a .wav file it’s easy to integrate it into any system you might be working on/with.
Only thing that’s difficult about it is that you need to understand sampling rates to make sure the voice is created correctly, but I think the example on the hugging face page works as is.