John Snow Labs Now Provides Curated Data in Apache Parquet Format

by Angela Guess

A recent release out of the company reports, “John Snow Labs now delivers all datasets in Apache Parquet format. The new format drastically accelerates queries on common benchmarks. It also reduces disk space, bandwidth as well as CPU usage. It is available alongside with the existing CSV and JSON data formats and can be found on all subscriptions. Apache Parquet is an efficient and a general-purpose columnar file format. It is self-describing, language-independent and also supports multiple compression algorithms and partitioning for big data sets and nested data structures. John Snow Labs is the first to deliver a data repository in Parquet format in the healthcare space, which is experiencing fast growing adoption of big data analytics technologies.”

The release goes on, “Parquet was designed for Apache Hadoop and has been adopted by Apache Spark, Cloudera Impala, Hive, Presto and Apache Drill. The majority of big data analytics platform now recommend it as the most efficient, highest performing data format. Here are recent publicly available benchmarks: (1) IBM evaluated multiple data formats for Spark SQL showed Parquet to be: 11 times faster than querying text files; 75% reduced data storage thanks to built-in compression; The only format to query large files (1 TB in size) with no errors; Higher scan throughput on Spark 1.6. (2) Cloudera examined different queries and discovered that Parquet was: 2 to 15 times faster than Avro, and far faster than CSV; 72% smaller on a wide table and 25% smaller on a narrow table.”

Data Topics

John Snow Labs Now Provides Curated Data in Apache Parquet Format

Leave a Reply Cancel reply