The repo only include how to generate answer, we should evaluate in which metric? Exact match or other metrics.