Shouldn't you exploit the fact that m_prime=1 in m256_mul() line 596 of p256-m.c, for the time critical reduction modulo p?