-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Hive Catalog fails to read Spark DataSource tables - incorrectly treats Parquet as SequenceFile #70956
Description
Steps to reproduce the behavior (Required)
-
Create a Spark DataSource table in Spark SQL:
CREATE TABLE test_spark_datasource_table (
id INT,
name STRING
)
USING PARQUET
LOCATION 'hdfs://lan/user/hive/warehouse/test.db/test_table'; -
Insert test data into the table in Spark SQL:
INSERT INTO test_spark_datasource_table VALUES
(1, 'Alice'),
(2, 'Bob'); -
Query the table through Hive Catalog in StarRocks:
SELECT * FROM hive_catalog.test_db.test_spark_datasource_table;
Expected behavior (Required)
StarRocks should correctly identify the table as a Parquet table and read the actual schema and data from the Parquet files.
Real behavior (Required)
Query fails with the following error:
Failed to open the off-heap table scanner. java exception details: java.io.IOException: Failed to open the hive reader.at com.starrocks.hive.reader.HiveScanner.open(HiveScanner.java:218)
Caused by: java.io.IOException: hdfs:/user/hive/warehouse/test.db/test_table/part-00004-b0139440-a3b4-4755-864c-0c1c386da2f8-c000.snappy.parquet
not a SequenceFile
at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:2036)
at org.apache.hadoop.io.SequenceFile$Reader.initialize(SequenceFile.java:1982)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1931)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1945)
at org.apache.hadoop.mapred.SequenceFileRecordReader.(SequenceFileRecordReader.java:49)
at org.apache.hadoop.mapred.SequenceFileInputFormat.getRecordReader(SequenceFileInputFormat.java:64)
at com.starrocks.hive.reader.HiveScanner.initReader(HiveScanner.java:194)
at com.starrocks.hive.reader.HiveScanner.open(HiveScanner.java:214)
Root Cause:
Spark DataSource tables store their metadata differently from traditional Hive tables. When created with USING PARQUET/ORC/etc, Spark stores:
- A fake SerDe (SequenceFile) in the standard Hive metastore fields
- A fake column schema (cols)
- The actual file format and schema in TABLE_PARAMS with keys like spark.sql.sources.provider and spark.sql.sources.schema
StarRocks version (Required)
- 3.3.11