### Learning rates

Cosine Annealing: uses cos/2 (so half of the cosine function) to decrease the learning rate as it trains. We can use Cosine Annealing on Stochastic Gradiend Descent with warm restarts, that means that when a cycle ends the learning rate will jump again to the highest learning rate (restart on top of the cosine function) and…