Editing Data profiling

{{refimprove article|date=August 2010}}

'''Data profiling''' is the process of examining the data available from an existing information source (e.g. a database or a [[computer file|file]]) and collecting [[descriptive statistics|statistics]] or informative summaries about that data.<ref name="Johnson2009">{{cite encyclopedia |first=Theodore |last=Johnson |date=2009 |title=Data Profiling |encyclopedia=Encyclopedia of Database Systems |editor-last=Springer |editor-first=Heidelberg}}</ref> The purpose of these statistics may be to:
# Find out whether existing data can be easily used for other purposes
# Improve the ability to search data by [[tag (metadata)|tagging]] it with [[Index term|keywords]], descriptions, or assigning it to a category
# Assess [[data quality]], including whether the data conforms to particular standards or patterns<ref>{{Cite journal|last1=Woodall|first1=Philip|last2=Oberhofer|first2=Martin|last3=Borek|first3=Alexander|year=2014|title=A classification of data quality assessment and improvement methods|url=http://www.inderscience.com/link.php?id=68656|journal=International Journal of Information Quality|language=en|volume=3|issue=4|page=298|doi=10.1504/ijiq.2014.068656}}</ref>
# Assess the risk involved in [[data integration|integrating data]] in new applications, including the challenges of [[Join (SQL)|join]]s
# Discover [[metadata]] of the source database, including value patterns and [[frequency distribution|distributions]], [[candidate key|key candidates]], [[inclusion dependency|foreign-key candidates]], and [[functional dependency|functional dependencies]]
# Assess whether known metadata accurately describes the actual values in the source database
# Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
# Have an enterprise view of all data, for uses such as [[master data management]], where key data is needed, or [[data governance]] for improving data quality.

==Introduction==

Data profiling refers to the analysis of information for use in a [[data warehouse]] in order to clarify the structure, content, relationships, and derivation rules of the data.<ref name="Kimball2008">{{cite book |first=Ralph |last=Kimball |display-authors=etal |date=2008 |title=The Data Warehouse Lifecycle Toolkit |url=https://archive.org/details/datawarehouselif00kimb_924 |url-access=limited |edition=Second |publisher=Wiley |isbn=9780470149775 |pages=[https://archive.org/details/datawarehouselif00kimb_924/page/n17 376]}}</ref> Profiling helps to not only understand anomalies and assess data quality, but also to discover, register, and assess enterprise metadata.<ref name="Loshin2009">{{cite book |first=David |last=Loshin |date=2009 |title=Master Data Management |url=https://archive.org/details/masterdatamanage00losh |url-access=limited |publisher=Morgan Kaufmann |isbn=9780123742254 |pages=[https://archive.org/details/masterdatamanage00losh/page/n197 94]–96}}</ref><ref name="Loshin2003">{{cite book |first=David |last=Loshin |date=2003 |title=Business Intelligence: The Savvy Manager's Guide, Getting Onboard with Emerging IT |publisher=Morgan Kaufmann |isbn=9781558609167 |pages=110–111}}</ref> The result of the analysis is used to determine the suitability of the candidate source systems, usually giving the basis for an early go/no-go decision, and also to identify problems for later solution design.<ref name="Kimball2008"/>

==How data profiling is conducted==

Data profiling utilizes methods of descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and additional metadata information obtained during data profiling such as data type, length, discrete values, uniqueness, occurrence of null values, typical string patterns, and abstract type recognition.<ref name="Loshin2009"/><ref name="Rahm2000">{{cite journal |first1=Erhard |last1=Rahm |first2=Hong |last2=Hai Do |title=Data Cleaning: Problems and Current Approaches |journal=Bulletin of the Technical Committee on Data Engineering |publisher=IEEE Computer Society |volume=23 |number=4 |date=December 2000}}</ref><ref name="Singh2010">{{cite journal |first1=Ranjit |last1=Singh |first2=Kawaljeet |last2=Singh |display-authors=etal |title=A Descriptive Classification of Causes of Data Quality Problems in Data Warehousing |journal=IJCSI International Journal of Computer Science Issue |volume=7 |issue=3 |series=2 |date=May 2010}}</ref> The metadata can then be used to discover problems such as illegal values, misspellings, missing values, varying value representation, and duplicates.

Different analyses are performed for different structural levels. E.g. single columns could be profiled individually to get an understanding of frequency distribution of different values, type, and use of each column. Embedded value dependencies can be exposed in a cross-columns analysis. Finally, overlapping value sets possibly representing foreign key relationships between entities can be explored in an inter-table analysis.<ref name="Loshin2009"/>

Normally, purpose-built tools are used for data profiling to ease the process.<ref name="Kimball2008"/><ref name="Loshin2009"/><ref name="Rahm2000"/><ref name="Singh2010"/><ref name="Kimball2004">{{cite web |first=Ralph |last=Kimball |date=2004 |title=Kimball Design Tip #59: Surprising Value of Data Profiling |publisher=Kimball Group |url=http://www.kimballgroup.com/wp-content/uploads/2012/05/DT59SurprisingValue.pdf}}</ref><ref name="Olson2003">{{cite book |title=Data Quality: The Accuracy Dimension |url=https://archive.org/details/dataqualityaccur00olso_641 |url-access=limited |first=Jack E. |last=Olson |date=2003 |publisher=Morgan Kaufmann |pages=[https://archive.org/details/dataqualityaccur00olso_641/page/n159 140]–142}}</ref> The computational complexity increases when going from single column, to single table, to cross-table structural profiling. Therefore, performance is an evaluation criterion for profiling tools.<ref name="Loshin2003"/>

==When is data profiling conducted?==

According to Kimball,<ref name="Kimball2008"/> data profiling is performed several times and with varying intensity throughout the data warehouse developing process. A light profiling assessment should be undertaken immediately after candidate source systems have been identified and DW/BI business requirements have been satisfied. The purpose of this initial analysis is to clarify at an early stage if the correct data is available at the appropriate detail level and that anomalies can be handled subsequently. If this is not the case the project may be terminated.<ref name="Kimball2008"/>

Additionally, more in-depth profiling is done prior to the dimensional modeling process in order assess what is required to convert data into a dimensional model. Detailed profiling extends into the ETL system design process in order to determine the appropriate data to extract and which filters to apply to the data set.<ref name="Kimball2008"/>

Additionally, data profiling may be conducted in the data warehouse development process after data has been loaded into staging, the data marts, etc. Conducting data at these stages helps ensure that data cleaning and transformations have been done correctly and in compliance of requirements.

==Benefits and examples==

The benefits of data profiling are to improve data quality, shorten the implementation cycle of major projects, and improve users' understanding of data.<ref name="Olson2003"/> Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.<ref name="Loshin2003"/> Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.<ref name="Olson2003"/>

==See also==
* [[Data quality]]
* [[Data governance]]
* [[Master data management]]
* [[Database normalization]]
* [[Data visualization]]
* [[Analysis paralysis]]
* [[Data analysis]]

==References==
{{Reflist}}

{{DEFAULTSORT:Data Profiling}}
[[Category:Data analysis]]
[[Category:Data management]]
[[Category:Data quality]]