Click here to learn more about author Gilad David Maayan.
Is SQL a requisite to be a data scientist? The answer is yes. Data Science has evolved, and while many data scientists still work with CSV files (text files with comma-separated values) these are not the best choice. The Python Panda library allows you to load data from CSV files but these files have a number of constraints. For example, they usually don’t connect to a database, requiring you to generate a new CSV file extract every time you have updated data. In the age of Big Data, this is simply impractical.
Relational databases provide the required support and agility to work with big data repositories. PostgreSQL is one of the leading relational database management systems. Designed especially to work with large datasets, Postgres is a perfect match for data science. In this article, we’ll cover what the pros and cons of using Postgres for Data Science are.
What Is a Data Scientist?
Data science is one of the most sought-after career options in recent years. As organizations manage huge datasets, the need for experts who can extract valuable information from them increases.
Data scientists process Big Data, helping organizations get actionable insights for their data. For example, they can detect a marketing niche or refine the company’s offerings to the latest market trends.
The field of Data Science is as diverse as the data it manipulates, but there are some core skills every beginner data scientist should have.
- Programming Skills: Data scientists use coding to analyze and process information, so having strong programming skills is important. Database and statistical programming languages, such as SQL or R, are a must. The more programming languages data scientists know and the more tools they can apply them to, the better.
- Statistics and Math Skills: Data science requires working with statistical data. You need the right algorithm to extract from the dataset what you want to know. To do this, data scientists need to have a strong grasp of statistics and math concepts.
Technical skills are a must. However, data scientists also need soft skills, such as determination, attention to detail, and analytical ability. Two of the basic skills for a career in data science are curiosity and flexibility. For example, data scientists often need to come up with out-of-the-box algorithms for data problems.
What Is PostgreSQL?
PostgreSQL is a relational database management system (RDBMS). This open-source platform is developed by the PostgreSQL Global Development Group, which is a mix of companies and individual contributors. It is supported by an extensive market for Software-as-a-Service (SaaS) solutions used to run Postgres in the cloud and on-premises.
PostgresSQL’s key features include:
- Free License: The platform is free to download, use, expand, and distribute.
- Complex Queries Support: One of the basic features of Postgres is its ability to process complex queries. Complex queries are requests you make from a database that goes beyond the basic SQL requests of SELECT and WHERE.
- Multiversion Concurrency Control: This feature enables different users to read and write information to a database at the same time.
- User-defined Types: Users can customize functions to define individual data types. Data scientists work with known or unknown data. Combining two or three data types into a new one helps data scientists address complex problems with lots of data coming from disparate sources.
- High Compliance of SQL ISO/IEC 9075 Standard: PostgreSQL meets 150 of the 164 mandatory features required for full compliance to the standard. Compliance with SQL standards has been a priority for Postgres since its inception.
- Strong Community Support: Postgres has a large community of contributors dedicated to expanding and developing the platform. There are extensive support documentation and forums.
- Programming Language Support: PostgresSQL supports all major programming languages such as Python, C, and Java. It also supports NoSQL queries through JSON.
- Multi-environment Support: One of the characteristics of PostgreSQL is its support of
cloud and on-premise environments. This is called multi-environment or cross-environment support. A recent report by RightScale shows most organizations using Postgres chose a hybrid cloud environment, a mix of on-premise and cloud systems.
Pros and Cons of Postgres for Data Science
PostgresSQL supports big data by adding JSON-B for documents and PostGIS for geolocation systems. It allows users to adapt the platform for their workloads.
The platform combines data analytics and transactional capabilities in hybrid transactional/analytical processing (HTAP). This feature enables databases to perform online analytical processing (OLAP) and online transaction processing (OLTP) at the same time. Organizations can use the HTAP technology in Postgres to manage information from IoT devices and other operational applications, for example.
Postgres has become very popular among data scientists because of its flexibility and extensibility. However, it does not store the data in columns, which makes it difficult for large data warehouses to process the data.
As with everything, there are pros and cons of using PostgresSQL for Data Science. Here are some of the advantages and disadvantages:
- SQL Rich: Due to the emphasis on SQL standard compliance, Postgres supports a lot of SQL syntax. That includes common table expressions, table inheritance, and Windows functions.
- Unstructured Data Support: Postgres supports NoSQL data, such as XML, JSON, and HStore.
- Parallel Queries: This feature allows running all cores in a processor at the same time. This is especially important in data science, where often there is a general query running.
- Declarative Partitioning: This capability makes it possible to can specify how to divide a table into pieces called partitions. That enables you, for example, to separate a different partition for each area code when working with large research datasets that are geographically distributed.
- No Compression: Not having enough space can limit the performance of some analyses. Compressing the data helps avoid forming a bottleneck when uploading to the cloud. PostgreSQL does not provide data compression, which makes uploading the data much slower.
- No Columns: Analytic databases usually store data in columns instead of rows. Postgres’ lack of columnar tables makes data ingestion difficult. For example, if you have a table with 300 columns you will need to break it into two or more tables to read it.
- No Built-in Machine Learning: This is probably the biggest drawback, as machine learning is essential to handling big data. Machine learning requires working with huge amounts of data to learn on its own and having this feature built in helps data science to process the huge datasets in an efficient way, with minimal, if any, human intervention. You can add Apache MADLib, an open-source library of database algorithms for machine learning, but sadly, it does not run on Windows. Fortunately, you can overcome this challenge by installing PLP/Python with a sci-kit-learn library to write your own machine learning algorithms directly in Postgres.
Where Can You Learn the Basics?
To learn PostgreSQL, you should start by learning SQL basics. This will give you a strong foundation on which to build your PostgreSQL knowledge. You can learn from free tutorials such as Codecademy. Once you have mastered the basics of SQL, you can start learning how to use it with Postgres.
While learning from the documentation can be the most thorough way, it can be tedious. There are many free and paid PostgreSQL courses online. Some of them include:
- PostgreSQL Tutorial: This free tutorial covers the fundamentals of Postgres, including basic functions.
- Postgres Video Course: This free video course uses Postgres to teach the basics of SQL for beginners.
- Postgres Administration Essentials: This comprehensive training is geared to database administrators and architects. While the courses themselves are free, the program requires a subscription.
- Postgres for Data Engineers: This paid course is geared for data engineers and covers the fundamentals of Postgres with data science examples and cases.
PostgreSQL provides a low-cost, powerful processing solution for Data Science. The biggest issue is that it does not provide data compression. You can work around this challenge by uploading in batches or running the database only in a cloud environment.
Still, a starting data scientist should strongly consider learning PostgreSQL. Postgres can prepare you to work with most data science tools, becoming your swiss-army knife of database knowledge.