Open main menu
Home
Random
Recent changes
Special pages
Community portal
Preferences
About Wikipedia
Disclaimers
Incubator escapee wiki
Search
User menu
Talk
Dark mode
Contributions
Create account
Log in
Editing
Data engineering
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
{{distinguish|Information Engineering}} {{Short description|Software engineering approach to designing and developing information systems}} {{Use mdy dates|date=August 2021}} '''Data engineering''' refers to the building of [[System|systems]] to enable the collection and usage of [[data]]. This data is usually used to enable subsequent [[data analytics|analysis]] and [[data science]], which often involves [[machine learning]].<ref name="whatis1">{{cite web |title=What is Data Engineering? {{!}} A Quick Glance of Data Engineering |url=https://www.educba.com/what-is-data-engineering/ |website=EDUCBA |access-date=31 July 2022 |date=5 January 2020}}</ref><ref name="whatis2">{{cite web |title=Introduction to Data Engineering |url=https://www.dremio.com/resources/guides/intro-data-engineering/ |website=Dremio |access-date=31 July 2022}}</ref> Making the data usable usually involves substantial [[computer|compute]] and [[computer data storage|storage]], as well as [[data processing]]. == History == Around the 1970s/1980s the term '''information engineering methodology''' (IEM) was created to describe [[database design]] and the use of [[software]] for data analysis and processing.<ref name="hist1">{{cite web |last1=Black |first1=Nathan |title=What is Data Engineering and Why Is It So Important? |url=https://quanthub.com/what-is-data-engineering/ |website=QuantHub |access-date=31 July 2022 |date=15 January 2020}}</ref> These techniques were intended to be used by [[database administrator]]s (DBAs) and by [[systems analyst]]s based upon an understanding of the operational processing needs of organizations for the 1980s. In particular, these techniques were meant to help bridge the gap between strategic business planning and information systems. A key early contributor (often called the "father" of information engineering methodology) was the Australian [[Clive Finkelstein]], who wrote several articles about it between 1976 and 1980, and also co-authored an influential [[Savant Institute]] report on it with James Martin.<ref>"Information engineering," [https://books.google.com/books?id=U2Da-O9RAgIC&pg=PA29 part 3], [https://books.google.com/books?id=aMrnCDJzb9MC&pg=RA1-PA1 part 4], [https://books.google.com/books?id=Ux9iw6tMs6MC&pg=PA32 part 5], [https://books.google.com/books?id=dPLZ7QidjbEC&pg=RA1-PA1 Part 6]" by Clive Finkelstein. In ''Computerworld, In depths, appendix.'' May 25 β June 15, 1981.</ref><ref>Christopher Allen, Simon Chatwin, Catherine Creary (2003). ''Introduction to Relational Databases and SQL Programming.''</ref><ref>[[Terry Halpin]], [[Tony Morgan (computer scientist)|Tony Morgan]] (2010). ''Information Modeling and Relational Databases.'' p. 343</ref> Over the next few years, Finkelstein continued work in a more business-driven direction, which was intended to address a rapidly changing business environment; Martin continued work in a more data processing-driven direction. From 1983 to 1987, Charles M. Richter, guided by Clive Finkelstein, played a significant role in revamping IEM as well as helping to design the IEM software product (user data), which helped automate IEM. In the early 2000s, the data and data tooling was generally held by the [[information technology]] (IT) teams in most companies.<ref name="hist2">{{cite web |last1=Dodds |first1=Eric |title=The History of the Data Engineering and the Megatrends |url=https://www.rudderstack.com/blog/the-data-engineering-megatrend-a-brief-history |website=Rudderstack |access-date=31 July 2022}}</ref> Other teams then used data for their work (e.g. reporting), and there was usually little overlap in data skillset between these parts of the business. In the early 2010s, with the rise of the [[internet]], the massive increase in data volumes, velocity, and variety led to the term [[big data]] to describe the data itself, and data-driven tech companies like [[Facebook]] and [[Airbnb]] started using the phrase ''' data engineer'''.<ref name="hist1" /><ref name="hist2" /> Due to the new scale of the data, major firms like [[Google]], Facebook, [[Amazon (company)|Amazon]], [[Apple Inc.|Apple]], [[Microsoft]], and [[Netflix]] started to move away from traditional [[Extract transform load|ETL]] and storage techniques. They started creating '''data engineering''', a type of [[software engineering]] focused on data, and in particular [[data infrastructure|infrastructure]], [[data warehouse|warehousing]], [[Information privacy|data protection]], [[cybersecurity]], [[data mining|mining]], [[data modelling|modelling]], [[data processing|processing]], and [[metadata]] management.<ref name="hist1" /><ref name="hist2" /> This change in approach was particularly focused on [[cloud computing]].<ref name="hist2" /> Data started to be handled and used by many parts of the business, such as [[sales]] and [[marketing]], and not just IT.<ref name="hist2" /> == Tools == === Compute === High-performance computing is critical for the processing and analysis of data. One particularly widespread approach to computing for data engineering is [[dataflow programming]], in which the computation is represented as a [[directed graph]] (dataflow graph); nodes are the operations, and edges represent the flow of data.<ref name="sigops">{{cite web |last1=Schwarzkopf |first1=Malte |title=The Remarkable Utility of Dataflow Computing |url=https://www.sigops.org/2020/the-remarkable-utility-of-dataflow-computing/ |website=ACM SIGOPS |access-date=31 July 2022 |date=7 March 2020}}</ref> Popular implementations include [[Apache Spark]], and the [[deep learning]] specific [[TensorFlow]].<ref name="sigops" /><ref name="sparkpaper">{{cite web |url=https://cs.stanford.edu/~matei/papers/2016/cacm_apache_spark.pdf |access-date=31 July 2022|title=sparkpaper}}</ref><ref name="tensorflow paper">{{cite web |last1=Abadi |first1=Martin |last2=Barham |first2=Paul |last3=Chen |first3=Jianmin |last4=Chen |first4=Zhifeng |last5=Davis |first5=Andy |last6=Dean |first6=Jeffrey |last7=Devin |first7=Matthieu |last8=Ghemawat |first8=Sanjay |last9=Irving |first9=Geoffrey |last10=Isard |first10=Michael |last11=Kudlur |first11=Manjunath |last12=Levenberg |first12=Josh |last13=Monga |first13=Rajat |last14=Moore |first14=Sherry |last15=Murray |first15=Derek G. |last16=Steiner |first16=Benoit |last17=Tucker |first17=Paul |last18=Vasudevan |first18=Vijay |last19=Warden |first19=Pete |last20=Wicke |first20=Martin |last21=Yu |first21=Yuan |last22=Zheng |first22=Xiaoqiang |title=TensorFlow: A system for large-scale machine learning |url=https://research.google/pubs/pub45381/ |website=12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) |access-date=31 July 2022 |pages=265β283 |date=2016}}</ref> More recent implementations, such as [[Differential Dataflow|Differential]]/[[Timely Dataflow|Timely]] Dataflow, have used [[incremental computing]] for much more efficient data processing.<ref name="sigops" /><ref name="differential-paper">{{cite web |last1=McSherry |first1=Frank |last2=Murray |first2=Derek |last3=Isaacs |first3=Rebecca |last4=Isard |first4=Michael |title=Differential dataflow |website=[[Microsoft]] |url=https://www.microsoft.com/en-us/research/publication/differential-dataflow/ |access-date=31 July 2022 |date=5 January 2013}}</ref><ref name="differential-github">{{cite web |title=Differential Dataflow |url=https://github.com/TimelyDataflow/differential-dataflow |publisher=Timely Dataflow |access-date=31 July 2022 |date=30 July 2022}}</ref> === Storage === Data is stored in a variety of ways, one of the key deciding factors is in how the data will be used. Data engineers optimize data storage and processing systems to reduce costs. They use data compression, partitioning, and archiving. ==== Databases ==== If the data is structured and some form of [[online transaction processing]] is required, then [[databases]] are generally used.<ref name="mit">{{cite web |title=Lecture Notes {{!}} Database Systems {{!}} Electrical Engineering and Computer Science {{!}} MIT OpenCourseWare |url=https://ocw.mit.edu/courses/6-830-database-systems-fall-2010/pages/lecture-notes/ |website=ocw.mit.edu |access-date=31 July 2022}}</ref> Originally mostly [[relational database]]s were used, with strong [[ACID]] transaction correctness guarantees; most relational databases use [[SQL]] for their queries. However, with the growth of data in the 2010s, [[NoSQL]] databases have also become popular since they [[Horizontal scaling#Horizontal and vertical scaling|horizontally scaled]] more easily than relational databases by giving up the ACID transaction guarantees, as well as reducing the [[object-relational impedance mismatch]].<ref name="leavitt">{{cite journal |last1=Leavitt |first1=Neal |title=Will NoSQL Databases Live Up to Their Promise? |journal=Computer |date=February 2010 |volume=43 |issue=2 |pages=12β14 |doi=10.1109/MC.2010.58 }}</ref> More recently, [[NewSQL]] databases β which attempt to allow horizontal scaling while retaining ACID guarantees β have become popular.<ref name="aslett2012">{{cite web |url=http://cs.brown.edu/courses/cs227/archives/2012/papers/newsql/aslett-newsql.pdf |title=How Will The Database Incumbents Respond To NoSQL And NewSQL? |first=Matthew |last=Aslett |publisher=451 Group |publication-date=April 4, 2011 |year=2011 |access-date=February 22, 2020}} </ref><ref name="sigmodrecord">{{cite conference |first1=Andrew |last1=Pavlo |first2=Matthew |last2=Aslett |title=What's Really New with NewSQL? |book-title=SIGMOD Record |year=2016 |url=https://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016.pdf |access-date=February 22, 2020}} </ref><ref>{{cite web |url=https://cacm.acm.org/blogs/blog-cacm/109710-new-sql-an-alternative-to-nosql-and-old-sql-for-new-oltp-apps/fulltext |title=NewSQL: An Alternative to NoSQL and Old SQL for New OLTP Apps |first=Michael |last=Stonebraker |publisher=Communications of the ACM Blog |date=June 16, 2011 |access-date=February 22, 2020}}</ref><ref name="high scalability">{{cite web |url=http://highscalability.com/blog/2012/9/24/google-spanners-most-surprising-revelation-nosql-is-out-and.html |title=Google Spanner's Most Surprising Revelation: NoSQL is Out and NewSQL is In |first=Todd |last=Hoff |date=September 24, 2012 |access-date=February 22, 2020}}</ref> ==== Data warehouses ==== {{Main|Data warehouse}} If the data is structured and [[online analytical processing]] is required (but not online transaction processing), then [[data warehouse]]s are a main choice.<ref name="ibm1">{{cite web |title=What is a Data Warehouse? |url=https://www.ibm.com/cloud/learn/data-warehouse |website=www.ibm.com |access-date=31 July 2022 |language=en-us}}</ref> They enable data analysis, mining, and [[artificial intelligence]] on a much larger scale than databases can allow,<ref name="ibm1"/> and indeed data often flow from databases into data warehouses.<ref name="aws1">{{cite web |title=What is a Data Warehouse? {{!}} Key Concepts {{!}} Amazon Web Services |url=https://aws.amazon.com/data-warehouse/ |website=Amazon Web Services, Inc. |access-date=31 July 2022}}</ref> [[Business analyst]]s, data engineers, and data scientists can access data warehouses using tools such as SQL or [[business intelligence]] software.<ref name="aws1"/> ==== Data lakes ==== A [[data lake]] is a centralized repository for storing, processing, and securing large volumes of data. A data lake can contain [[structured data]] from [[Relational database|relational databases]], [[semi-structured data]], [[unstructured data]], and [[binary data]]. A data lake can be created on premises or in a cloud-based environment using the services from [[Cloud computing|public cloud]] vendors such as [[Amazon (company)|Amazon]], [[Microsoft]], or [[Google]]. ==== Files ==== If the data is less structured, then often they are just stored as [[Computer file|files]]. There are several options: * [[File system]]s represent data hierarchically in nested folders.<ref name="redhat1">{{cite web |title=File storage, block storage, or object storage? |url=https://www.redhat.com/en/topics/data-storage/file-block-object-storage |website=www.redhat.com |access-date=31 July 2022 |language=en}}</ref> * [[Block storage]] splits data into regularly sized chunks;<ref name="redhat1" /> this often matches up with (virtual) [[hard drives]] or [[solid state drives]]. * [[Object storage]] manages data using [[metadata]];<ref name="redhat1" /> often each file is assigned a key such as a [[UUID]].<ref name="s3">{{cite web |title=Cloud Object Storage β Amazon S3 β Amazon Web Services |url=https://aws.amazon.com/s3 |website=Amazon Web Services, Inc. |access-date=31 July 2022}}</ref> === Management === The number and variety of different data processes and storage locations can become overwhelming for users. This inspired the usage of a [[workflow management system]] (e.g. [[Apache Airflow|Airflow]]) to allow the data tasks to be specified, created, and monitored.<ref name="airflow">{{cite web |title=Home |url=https://airflow.apache.org/ |website=Apache Airflow |access-date=31 July 2022 |language=en}}</ref> The tasks are often specified as a [[directed acyclic graph |directed acyclic graph (DAG)]].<ref name="airflow" /> == Lifecycle == === Business planning === Business objectives that executives set for what's to come are characterized in key business plans, with their more noteworthy definition in tactical business plans and implementation in operational business plans. Most businesses today recognize the fundamental need to grow a business plan that follows this strategy. It is often difficult to implement these plans because of the lack of transparency at the tactical and operational degrees of organizations. This kind of planning requires feedback to allow for early correction of problems that are due to miscommunication and misinterpretation of the business plan. === Systems design === The design of data systems involves several components such as architecting data platforms, and designing data stores.<ref name="course">{{cite web |title=Introduction to Data Engineering |url=https://www.coursera.org/learn/introduction-to-data-engineering |website=Coursera |access-date=31 July 2022 |language=en}}</ref><ref>{{Cite book|title=What are The Phases of Information Engineering|last=Finkelstein|first=Clive}}</ref> === Data modeling === {{ main article | Data modelling }} This is the process of producing a [[data model]], an [[abstract model]] to describe the data and relationships between different parts of the data.<ref name="model1">{{cite web |title=What is Data Modelling? Overview, Basic Concepts, and Types in Detail |url=https://www.simplilearn.com/what-is-data-modeling-article |website=Simplilearn.com |access-date=31 July 2022 |date=15 June 2021}}</ref> == Roles == === Data engineer === A ''' data engineer''' is a type of software engineer who creates [[big data]] [[Extract, transform, load|ETL]] pipelines to manage the flow of data through the organization. This makes it possible to take huge amounts of data and translate it into [[business intelligence|insights]].<ref>{{cite report |last1=Tamir |first1=Mike |last2=Miller |first2=Steven |last3=Gagliardi |first3=Alessandro |date=11 December 2015 |title=The Data Engineer |ssrn=2762013 }}</ref> They are focused on the production readiness of data and things like formats, resilience, scaling, and security. Data engineers usually hail from a software engineering background and are proficient in programming languages like [[Java (programming language)|Java]], [[Python (programming language)|Python]], [[Scala (programming language)|Scala]], and [[Rust (programming language)|Rust]].<ref>{{Cite web|date=2019-02-07|title=Data Engineer vs. Data Scientist|url=https://www.springboard.com/blog/data-engineer-vs-data-scientist/|access-date=2021-03-14|website=Springboard Blog|language=en-US}}</ref><ref name="hist1" /> They will be more familiar with databases, architecture, cloud computing, and [[Agile software development]].<ref name="hist1" /> === Data scientist === {{ main article | Data science}} '''Data scientists''' are more focused on the analysis of the data, they will be more familiar with [[mathematics]], [[algorithms]], [[statistics]], and [[machine learning]].<ref name="hist1" /><ref>{{Cite web |date=Jan 5, 2017 |title=What is Data Science and Why it's Important |url=https://www.edureka.co/blog/what-is-data-science/ |publisher=Edureka}}</ref> ==See also== * [[Big data]] * [[Information technology]] * [[Software engineering]] * [[Computer science]] ==References== {{Reflist|2}} ==Further reading== {{refbegin|2}} * {{cite book |last1=Hares |first1=John S. |title=Information Engineering for the Advanced Practitioner |date=1992 |publisher=Wiley |isbn=978-0-471-92810-2 }} * {{cite book |last1=Finkelstein |first1=Clive |title=An Introduction to Information Engineering: From Strategic Planning to Information Systems |date=1989 |publisher=Addison-Wesley |isbn=978-0-201-41654-1 }} * {{cite book |last1=Finkelstein |first1=Clive |title=Information Engineering: Strategic Systems Development |date=1992 |publisher=Addison-Wesley |isbn=978-0-201-50988-5 }} * Ian Macdonald (1986). "Information engineering". in: ''Information Systems Design Methodologies''. T.W. Olle et al. (ed.). North-Holland. * Ian Macdonald (1988). "Automating the Information engineering methodology with the Information Engineering Facility". In: ''Computerized Assistance during the Information Systems Life Cycle''. [[T.W. Olle]] et al. (ed.). North-Holland. * [[James Martin (author)|James Martin]] and [[Clive Finkelstein]]. (1981). ''Information engineering''. Technical Report (2 volumes), Savant Institute, Carnforth, Lancs, UK. * James Martin (1989). ''Information engineering''. (3 volumes), Prentice-Hall Inc. * {{cite book |last1=Finkelstein |first1=Clive |title=Enterprise Architecture for Integration: Rapid Delivery Methods and Technologies |date=2006 |publisher=Artech House |isbn=978-1-58053-713-1 }} * {{cite book |last1=Reis |first1=Joe |last2=Housley |first2=Matt |title=Fundamentals of Data Engineering |date=2022 |publisher=O'Reilly Media |isbn=978-1-0981-0827-4 }} {{refend}} ==External links== {{commons category| Information Engineering}} * [http://www.informatik.uni-bremen.de/uniform/gdpa/methods/m-iem.htm The Complex Method IEM] {{Webarchive|url=https://web.archive.org/web/20190720070308/http://www.informatik.uni-bremen.de/uniform/gdpa/methods/m-iem.htm |date=July 20, 2019 }} * [https://web.archive.org/web/20060215222446/http://sysdev.ucdavis.edu/WEBADM/document/rad-archapproach.htm Rapid Application Development] * [http://www.ies.aust.com Enterprise Engineering and Rapid Delivery of Enterprise Architecture] {{Authority control}} {{Engineering fields}} [[Category:Software engineering]] [[Category:Information systems]] [[Category:Data management]] [[Category:Data engineering]] [[Category:Engineering disciplines]]
Edit summary
(Briefly describe your changes)
By publishing changes, you agree to the
Terms of Use
, and you irrevocably agree to release your contribution under the
CC BY-SA 4.0 License
and the
GFDL
. You agree that a hyperlink or URL is sufficient attribution under the Creative Commons license.
Cancel
Editing help
(opens in new window)
Pages transcluded onto the current version of this page
(
help
)
:
Template:Authority control
(
edit
)
Template:Cite book
(
edit
)
Template:Cite conference
(
edit
)
Template:Cite journal
(
edit
)
Template:Cite report
(
edit
)
Template:Cite web
(
edit
)
Template:Commons category
(
edit
)
Template:Distinguish
(
edit
)
Template:Engineering fields
(
edit
)
Template:Main
(
edit
)
Template:Main article
(
edit
)
Template:Refbegin
(
edit
)
Template:Refend
(
edit
)
Template:Reflist
(
edit
)
Template:Short description
(
edit
)
Template:Use mdy dates
(
edit
)
Template:Webarchive
(
edit
)