[LG] Is your batch size the problem? Revisiting the Adam-SGD gap in language modeling [Max Planck Institute for Intelligent Systems] arxiv.org