Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Data engineering
(section)
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Tools == === Compute === High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is [[dataflow programming]], in which the computation is represented as a [[directed graph]] (dataflow graph); nodes are the operations, and edges represent the flow of data.<ref name="sigops">{{cite web |last1=Schwarzkopf |first1=Malte |title=The Remarkable Utility of Dataflow Computing |url=https://www.sigops.org/2020/the-remarkable-utility-of-dataflow-computing/ |website=ACM SIGOPS |access-date=31 July 2022 |date=7 March 2020}}</ref> Popular implementations include [[Apache Spark]], and the [[deep learning]] specific [[TensorFlow]].<ref name="sigops" /><ref name="sparkpaper">{{cite web |url=https://cs.stanford.edu/~matei/papers/2016/cacm_apache_spark.pdf |access-date=31 July 2022|title=sparkpaper}}</ref><ref name="tensorflow paper">{{cite web |last1=Abadi |first1=Martin |last2=Barham |first2=Paul |last3=Chen |first3=Jianmin |last4=Chen |first4=Zhifeng |last5=Davis |first5=Andy |last6=Dean |first6=Jeffrey |last7=Devin |first7=Matthieu |last8=Ghemawat |first8=Sanjay |last9=Irving |first9=Geoffrey |last10=Isard |first10=Michael |last11=Kudlur |first11=Manjunath |last12=Levenberg |first12=Josh |last13=Monga |first13=Rajat |last14=Moore |first14=Sherry |last15=Murray |first15=Derek G. |last16=Steiner |first16=Benoit |last17=Tucker |first17=Paul |last18=Vasudevan |first18=Vijay |last19=Warden |first19=Pete |last20=Wicke |first20=Martin |last21=Yu |first21=Yuan |last22=Zheng |first22=Xiaoqiang |title=TensorFlow: A system for large-scale machine learning |url=https://research.google/pubs/pub45381/ |website=12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) |access-date=31 July 2022 |pages=265β283 |date=2016}}</ref> More recent implementations, such as [[Differential Dataflow|Differential]]/[[Timely Dataflow|Timely]] Dataflow, have used [[incremental computing]] for much more efficient data processing.<ref name="sigops" /><ref name="differential-paper">{{cite web |last1=McSherry |first1=Frank |last2=Murray |first2=Derek |last3=Isaacs |first3=Rebecca |last4=Isard |first4=Michael |title=Differential dataflow |website=[[Microsoft]] |url=https://www.microsoft.com/en-us/research/publication/differential-dataflow/ |access-date=31 July 2022 |date=5 January 2013}}</ref><ref name="differential-github">{{cite web |title=Differential Dataflow |url=https://github.com/TimelyDataflow/differential-dataflow |publisher=Timely Dataflow |access-date=31 July 2022 |date=30 July 2022}}</ref> === Storage === Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used. Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving. ==== Databases ==== If the data is structured and some form of [[online transaction processing]] is required, then [[databases]] are generally used.<ref name="mit">{{cite web |title=Lecture Notes {{!}} Database Systems {{!}} Electrical Engineering and Computer Science {{!}} MIT OpenCourseWare |url=https://ocw.mit.edu/courses/6-830-database-systems-fall-2010/pages/lecture-notes/ |website=ocw.mit.edu |access-date=31 July 2022}}</ref> Originally mostly [[relational database]]s were used, with strong [[ACID]] transaction correctness guarantees; most relational databases use [[SQL]] for their queries. However, with the growth of data in the 2010s, [[NoSQL]] databases have also become popular since they [[Horizontal scaling#Horizontal and vertical scaling|horizontally scaled]] more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the [[object-relational impedance mismatch]].<ref name="leavitt">{{cite journal |last1=Leavitt |first1=Neal |title=Will NoSQL Databases Live Up to Their Promise? |journal=Computer |date=February 2010 |volume=43 |issue=2 |pages=12β14 |doi=10.1109/MC.2010.58 }}</ref> More recently, [[NewSQL]] databases β which attempt to allow horizontal scaling while retaining ACID guarantees β have become popular.<ref name="aslett2012">{{cite web |url=http://cs.brown.edu/courses/cs227/archives/2012/papers/newsql/aslett-newsql.pdf |title=How Will The Database Incumbents Respond To NoSQL And NewSQL? |first=Matthew |last=Aslett |publisher=451 Group |publication-date=April 4, 2011 |year=2011 |access-date=February 22, 2020}} </ref><ref name="sigmodrecord">{{cite conference |first1=Andrew |last1=Pavlo |first2=Matthew |last2=Aslett |title=What's Really New with NewSQL? |book-title=SIGMOD Record |year=2016 |url=https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf |access-date=February 22, 2020}} </ref><ref>{{cite web |url=https://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext |title=NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps |first=Michael |last=Stonebraker |publisher=Communications of the ACM Blog |date=June 16, 2011 |access-date=February 22, 2020}}</ref><ref name="high scalability">{{cite web |url=http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html |title=Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In |first=Todd |last=Hoff |date=September 24, 2012 |access-date=February 22, 2020}}</ref> ==== Data warehouses ==== {{Main|Data warehouse}} If the data is structured and [[online analytical processing]] is required (but not online transaction processing), then [[data warehouse]]s are a main choice.<ref name="ibm1">{{cite web |title=What is a Data Warehouse? |url=https://www.ibm.com/cloud/learn/data-warehouse |website=www.ibm.com |access-date=31 July 2022 |language=en-us}}</ref> They enable data analysis, mining, and [[artificial intelligence]] on a much larger scale than databases can allow,<ref name="ibm1"/> and indeed data often flow from databases into data warehouses.<ref name="aws1">{{cite web |title=What is a Data Warehouse? {{!}} Key Concepts {{!}} Amazon Web Services |url=https://aws.amazon.com/data-warehouse/ |website=Amazon Web Services, Inc. |access-date=31 July 2022}}</ref> [[Business analyst]]s, data engineers, and data scientists can access data warehouses using tools such as SQL or [[business intelligence]] software.<ref name="aws1"/> ==== Data lakes ==== A [[data lake]] is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain [[structured data]] from [[Relational database|relational databases]], [[semi-structured data]], [[unstructured data]], and [[binary data]]. A data lake can be created on premises or in a cloud-based environment using the services from [[Cloud computing|public cloud]] vendors such as [[Amazon (company)|Amazon]], [[Microsoft]], or [[Google]]. ==== Files ==== If the data is less structured, then often they are just stored as [[Computer file|files]]. There are several options: * [[File system]]s represent data hierarchically in nested folders.<ref name="redhat1">{{cite web |title=File storage, block storage, or object storage? |url=https://www.redhat.com/en/topics/data-storage/file-block-object-storage |website=www.redhat.com |access-date=31 July 2022 |language=en}}</ref> * [[Block storage]] splits data into regularly sized chunks;<ref name="redhat1" /> this often matches up with (virtual) [[hard drives]] or [[solid state drives]]. * [[Object storage]] manages data using [[metadata]];<ref name="redhat1" /> often each file is assigned a key such as a [[UUID]].<ref name="s3">{{cite web |title=Cloud Object Storage β Amazon S3 β Amazon Web Services |url=https://aws.amazon.com/s3 |website=Amazon Web Services, Inc. |access-date=31 July 2022}}</ref> === Management === The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a [[workflow management system]] (e.g. [[Apache Airflow|Airflow]]) to allow the data tasks to be specified, created, and monitored.<ref name="airflow">{{cite web |title=Home |url=https://airflow.apache.org/ |website=Apache Airflow |access-date=31 July 2022 |language=en}}</ref> The tasks are often specified as a [[directed acyclic graph |directed acyclic graph (DAG)]].<ref name="airflow" />
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)