原文译文操作

The demand for skilled data engineers is projected to rapidly grow. No wonder that’s the case; no matter what your company does, to succeed in today’s competitive environment, you need a robust infrastructure to both store and access your company’s data, and you need it from the very beginning.

What exactly does a data engineer do, though? And how does one become a data engineer? In this article, we’re going to talk about this interesting field and how you can become a data engineer.

What Does a Data Engineer Do?

Data engineers are responsible for the creation and maintenance of analytics infrastructure that enables almost every other function in the data world. They are responsible for the development, construction, maintenance, and testing of architectures, such as databases and large-scale processing systems. As part of this, Data Engineers are also responsible for the creation of data set processes used in modeling, mining, acquisition, and verification.

对熟练的数据工程师的需求预计将迅速增长。事实就是这样;无论你的公司做什么,为了在如今的竞争环境中取得成功, 你需要一个健全的基础设施来存储和访问公司的数据,而且你从一开始就需要它。

但数据工程师具体做些什么? 如何成为一名数据工程师? 在这篇文章中,我们将探讨这个有趣的领域,以及如何成为一名数据工程师。

数据工程师做些什么?

数据工程师负责创建和维护分析性基础设施,实现几乎所有数据世界中的其余功能。它们负责开发、构建、维护和测试体系结构,如数据库和大规模处理系统。 作为其中的一部分,数据工程师还负责创建用于建模、挖掘、获取和验证的数据集处理过程。

纠正翻译

Engineers are expected to have a solid command of common scripting languages and tools for this purpose and are expected to use this skill set to constantly improve data quality and quantity by leveraging and improving data analytics systems.

The Difference Between Data Engineer and Data Scientist

While there is a certain amount overlap when it comes to skills and responsibilities, these two positions are being increasingly separated into distinct roles.

Data scientists are much more focused on the interaction with the data infrastructure rather than the building and maintenance thereof. They are often tasked with conducting high-level market and business operation research to identify trends and relations, and as part of this, they use a variety of sophisticated machines and methods to interact with and act upon data.

工程师们被期望扎实掌握可靠的通用脚本语言和工具来达到这个目的,并被期望使用这系列技能改变和改进数据分析系统,以便不断提高数据质量和数量。

数据工程师与数据科学家的区别

当涉及到技能和职责时,存在一定的重叠,这两个职位正越来越多地被独立确切的角色。

数据科学家更关注于与数据基础设施的交互,而不是其构建和维护。 他们通常负责进行高水平的市场和商业运作研究,以确定趋势和关系,作为其中的一部分,他们使用各种复杂的机器和方法来与数据交互和作用。

纠正翻译

Data scientists are often well-versed in Machine Learning and advanced statistical modeling, as they are expected to take the raw data and turn it into actionable, understandable content with the help of advanced mathematical models and algorithms. This information is often used as an analysis source to tell the “bigger picture” to the decision makers.

So what makes a data scientist different from a data engineer? Generally speaking, the main difference is one of focus. Data engineers are much more focused on building infrastructure and architecture for data generation; data scientists are focused rather on advanced mathematics and statistical analysis on that generated data.

数据科学家经常精通机器学习和高级统计建模,因为他们被期望将原始数据借助先进的数学模型和算法,转化为可操作、可理解的内容。这些信息通常被用作分析来源,以向决策者讲述“更远大的画面”。

那么,数据科学家与数据工程师的区别是什么呢??一般来说,主要区别是专点不同,数据工程师更专注于为数据生成建立基础设施和体系结构; 数据科学家更关注的是对产生的数据进行先进的数学数据的统计分析。

纠正翻译

Data Engineers Key Skills

Here's a couple of the key skills needed from data engineers.

Tools and Components of Data Architecture

Since data engineers are much more concerned with analytics infrastructure, most of their required skills are, predictably, architecture-centric.

In-Depth Knowledge of SQL and Other Database Solutions

Data Engineers need to understand database management, and as such, in-depth knowledge of SQL is hugely valuable. Likewise, other database solutions, such as Cassandra or Bigtable, are great to know if you plan on doing freelance or for hire engineering, as not every database is going to be built in the recognizable standard.

数据工程师的关键技能

以下是数据工程师需要的两项关键技能

数据结构的工具和组件

由于数据工程师更关心分析基础设施,他们所需的技能大部分是可以预见的,以体系结构为中心的。

对SQL和其他数据库解决方案有深入的了解

数据工程师需要了解数据库管理,因此深入了解SQL是非常有价值的。同样的,其他数据库解决方案,如Cassandra或Bigtable,如果你打算做自由数据工程师或租用工程,了解这两个高性能数据库系统是很重要的,因为并不是每个数据库都要基于可识别标准中。

纠正翻译

Data Warehousing and ETL Tools

Data warehousing and ETL experience is essential to this position. Data warehousing solutions like Redshift or Panoply, as well as familiarity with ETL Tools, such as with StitchData or Segment is hugely valuable. Similarly, experience with data storage and retrieval is equally vital, as the amount of data being dealt with is simply astronomical.

Hadoop-Based Analytics (HBase, Hive, MapReduce, etc.)

Having a strong understanding of Apache Hadoop-based analytics is a very common requirement in this space, with knowledge of HBase, Hive, and MapReduce often considered a requirement.

数据入库与ETL工具

数据入库和ETL经验对这个职位至关重要,数据入库解决方案就像RedshiftPanoply,而熟悉ETL工具, 就像熟悉StitchDataSegment,两者都非常有价值。同样,数据存储和检索的经验同样重要,因为处理的数据量简直是天文数字。

基于Hadoop的分析 (HBase, Hive, MapReduce, etc.)

对基于Apache Hadoop的分析有很透彻的理解是这个空间中的一个非常常见的需求,认知HBase、Hive, 以及MapReduce通常被视为是一种要求。

纠正翻译

Coding

Speaking of solutions, knowledge of coding is a definite plus here (and also possibly a requirement for many positions). Familiarity, if not outright expertness, is very valuable in Python, C/C++, Java, Perl, Golang, or other such languages.

Machine Learning

While mainly the focus of data scientist, some level of understanding of how to act upon this data is also invaluable for Data Engineers. For this reason, some knowledge of statistical analysis and the basics data modeling are hugely valuable.  

While machine learning is technically something relegated to the Data Scientist, knowledge in this area is helpful to construct solutions usable by your cohorts. This knowledge has the added benefit of making you extremely marketable in this space, as being able to “put on both hats” in this case makes you a formidable tool.

编码

说到解决方案,对编码的认知是一个绝对的加分项 (也可能是许多职位的要求)。仅是熟识,如果不是完全熟练的话,在Python、C/C++、Java、Perl、Golang或其他语言中时非常有价值的。

机器学习

虽然从一定程度上理解如何依照这些数据采取行动主要是数据科学家的焦点,但这点对于数据工程师也是非常宝贵的。出于这个原因,一些统计分析知识和基础数据建模是非常有价值的。

虽然机器学习在技术上被贬低为数据科学家。这方面的知识有助于你的伙伴构造可用的解决方案。这种知识使你在这个空间极具市场价值,有如在这种情况下身兼”工程师与科学家“双重身份,使你成为令人仰视的工具。

纠正翻译

Various Operating Systems

Finally, intimate knowledge of UNIX, Linux, and Solaris is very helpful, as many math tools are going to be based in these systems due to their unique demands for root access to hardware and operating system functionality above and beyond that of Microsoft’s Windows or Mac OS.

How Can I Become a Data Engineer?

Data engineering typically requires a more hybrid approach to education than other, more traditional careers. While teachers often have a degree specifically in teaching, Data Engineers often have a Computer Sciences or Information Technology degree that was then further parlayed with vendor specific Certification programs and training materials.

各种操作系统

最后,熟悉UNIXLinuxSolaris很有帮助,由于它们对硬件和操作系统功能的根访问的独特需求在微软Windows或Mac OS之上,所以许多数学工具将基于这些系统。

我怎样才能成为一名数据工程师?

数据工程师需要的教育,通常比其他较传统的职业更为复杂多面化。而教师往往在教学中有专门的某领域学位。数据工程师通常拥有计算机科学或信息技术学位,然后,进一步根据供应商特定的认证计划和培训材料有效发展。

纠正翻译

As such, your degree, while important, is only part of the story; getting the proper certifications can be hugely valuable. There are a few data engineering-specific certifications:

  • Google’s Certified Professional — data engineering. This certification establishes that the student is familiar with data engineering principles and can function as either an associate or a professional in the field.
  • IBM Certified Data Engineer — Big Data. This certification focuses more on Big Data-specific applications of data engineering skill sets rather than general skills but is considered a gold standard by many.
  • CCP Data Engineer from Cloudera: Specific to Cloudera’s solutions, this certification shows the student has experience in ETL tools and analytics.
  • Secondary certifications, such as the MCSE (Microsoft Certified Solutions Expert), cover a wide range of topics but have specific sub-certifications such as MCSE: Data Management and Analytics.

这样,你的学位虽然重要,但只是故事的一部分:获得适当的认证非常有价值,有一些数据工程特定的认证:

  • 谷歌认证专业— 数据工程师。该认证确立了熟悉数据工程原理,并能在该领域中担任联系人或专业人员的学生。
  • IBM注册数据工程师— 大数据。这种认证更关注于数据工程技能组合的大数据特定应用,而不是一般技能,但被许多人认为是金标准。
  • CCP的Cloudera数据工程师: 针对Cloudera的解决方案。该证书表明学生有ETL工具的使用和分析经验。
  • 二级证书,比如MCSE (微软认证解决方案专家)。涵盖广泛的主题,但有特定的子项认证,如MCSE:数据管理和分析。
纠正翻译

There are, of course, online courses that purport to offer significant training in this field. Udemy offers numerous courses in Data Engineering and data science, and other sites, such as EdX and Memrise offer similar coursework. Some sites, such as DataCamp, are heavily focused specifically on data science and engineering, while others, such as Galvanize, are more broad-based.

While these solutions can help you get your feet in the water, so to speak, they come with the caveat that they rarely dispense or confer certification, and at best, many only offer a certificate or diploma. As such, while they are great for general learning, they should not be considered a replacement for actual certification or accredited diploma issuance.

当然,也有旨在提供这个领域重要的培训的在线课程,Udemy提供数据工程和数据科学的众多课程,还有其他网站,如EdXMemrise也提供类似的课程作业。有些网站,如DataCamp,特别侧重于于数据科学和工程,而其他网站,如Galvanize更侧重于涉及面的广泛性。

可以这么说,虽然这些解决方案可以帮助步入这些行业,但是请注意他们很少颁发或授予认证,许多网站至多只提供证书或文凭。因此,虽然它们对常规学习很有帮助,但不应被认为是实际资质证书或公认毕业证书的替代品。

纠正翻译

Hopefully, this piece has illuminated the specific talents, skills, and requirements expected of a data engineer. While the field is rapidly growing, it is fraught with obstacles. Therefore, attaining the best education possible while filling any gaps in skill sets with proper certification is key.

令人充满希望的是,这篇文章阐明了数据工程师所被期望拥有的特殊才能、技能和要求。虽然该领域正在迅速发展,但也充满了障碍。因此,在适当的认证过程中弥补技能组合的空白,同时获得最好的教育是关键。

纠正翻译