Editing Context mixing (section)

=== Logistic Mixing ===

Let <math>P_i(1)</math> be the prediction by the  <math>i</math>'th model that the next bit will be a 1. Then the final prediction <math>P(1)</math> is calculated:

*<math>x_i = \text{stretch}(P_i(1))</math>
*<math display="inline">P(1) = \text{squash}(\sum_i w_i x_i)</math>

where <math>P(1)</math> is the probability that the next bit will be a 1, <math>P_i(1)</math> is the probability estimated by the  <math>i</math>'th model, and

*<math>\text{stretch}(x) = \ln(x / (1 - x))</math>
*<math>\text{squash}(x) = \text{stretch}^{-1}(x) = 1/(1 + e^{-x})</math>

After each prediction, the model is updated by adjusting the weights to minimize coding cost.

*<math>w_i \leftarrow w_i + \eta x_i (y - P(1))</math>

where <math>\eta</math> is the learning rate (typically 0.002 to 0.01), <math>y</math> is the predicted bit, and (<math>y - P(1)</math>) is the prediction error.