Parker Riley, Daniel Deutsch, George F. Foster, Viresh Ratnakar, Ali Dabirmoghaddam, Markus Freitag: Finding Replicable Human Evaluations via Stable Ranking Probability. NAACL-HLT 2024: 4908-4919