Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Categorical variable
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
==Number of possible values== Categorical [[random variable]]s are normally described statistically by a [[categorical distribution]], which allows an arbitrary ''K''-way categorical variable to be expressed with separate probabilities specified for each of the ''K'' possible outcomes. Such multiple-category categorical variables are often analyzed using a [[multinomial distribution]], which counts the frequency of each possible combination of numbers of occurrences of the various categories. [[Regression analysis]] on categorical outcomes is accomplished through [[multinomial logistic regression]], [[multinomial probit]] or a related type of [[discrete choice]] model. Categorical variables that have only two possible outcomes (e.g., "yes" vs. "no" or "success" vs. "failure") are known as ''binary variables'' (or ''Bernoulli variables''). Because of their importance, these variables are often considered a separate category, with a separate distribution (the [[Bernoulli distribution]]) and separate regression models ([[logistic regression]], [[probit regression]], etc.). As a result, the term "categorical variable" is often reserved for cases with 3 or more outcomes, sometimes termed a ''multi-way'' variable in opposition to a binary variable. It is also possible to consider categorical variables where the number of categories is not fixed in advance. As an example, for a categorical variable describing a particular word, we might not know in advance the size of the vocabulary, and we would like to allow for the possibility of encountering words that we have not already seen. Standard statistical models, such as those involving the [[categorical distribution]] and [[multinomial logistic regression]], assume that the number of categories is known in advance, and changing the number of categories on the fly is tricky. In such cases, more advanced techniques must be used. An example is the [[Dirichlet process]], which falls in the realm of [[nonparametric statistics]]. In such a case, it is logically assumed that an infinite number of categories exist, but at any one time most of them (in fact, all but a finite number) have never been seen. All formulas are phrased in terms of the number of categories actually seen so far rather than the (infinite) total number of potential categories in existence, and methods are created for incremental updating of statistical distributions, including adding "new" categories.
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)