Editing Categorical variable (section)

==Number of possible values==
Categorical [[random variable]]s are normally described statistically by a [[categorical distribution]], which allows an arbitrary ''K''-way categorical variable to be expressed with separate probabilities specified for each of the ''K'' possible outcomes.  Such multiple-category categorical variables are often analyzed using a [[multinomial distribution]], which counts the frequency of each possible combination of numbers of occurrences of the various categories. [[Regression analysis]] on categorical outcomes is accomplished through [[multinomial logistic regression]], [[multinomial probit]] or a related type of [[discrete choice]] model.

Categorical variables that have only two possible outcomes (e.g., "yes" vs. "no" or "success" vs. "failure") are known as ''binary variables'' (or ''Bernoulli variables'').  Because of their importance, these variables are often considered a separate category, with a separate distribution (the [[Bernoulli distribution]]) and separate regression models ([[logistic regression]], [[probit regression]], etc.).  As a result, the term "categorical variable" is often reserved for cases with 3 or more outcomes, sometimes termed a ''multi-way'' variable in opposition to a binary variable.

It is also possible to consider categorical variables where the number of categories is not fixed in advance.  As an example, for a categorical variable describing a particular word, we might not know in advance the size of the vocabulary, and we would like to allow for the possibility of encountering words that we have not already seen.  Standard statistical models, such as those involving the [[categorical distribution]] and [[multinomial logistic regression]], assume that the number of categories is known in advance, and changing the number of categories on the fly is tricky.  In such cases, more advanced techniques must be used.  An example is the [[Dirichlet process]], which falls in the realm of [[nonparametric statistics]].  In such a case, it is logically assumed that an infinite number of categories exist, but at any one time most of them (in fact, all but a finite number) have never been seen.  All formulas are phrased in terms of the number of categories actually seen so far rather than the (infinite) total number of potential categories in existence, and methods are created for incremental updating of statistical distributions, including adding "new" categories.