Age | Commit message (Collapse) | Author |
|
closes #1674
|
|
Add support for avg row-width and major type statistics.
Parallelize the ANALYZE implementation and stats UDF implementation to improve stats collection performance.
Update/fix rowcount, selectivity and ndv computations to improve plan costing.
Add options for configuring collection/usage of statistics.
Add new APIs and implementation for stats writer (as a precursor to Drill Metastore APIs).
Fix several stats/costing related issues identified while running TPC-H nad TPC-DS queries.
Add support for CPU sampling and nested scalar columns.
Add more testcases for collection and usage of statistics and fix remaining unit/functional test failures.
Thanks to Venki Korukanti (@vkorukanti) for the description below (modified to account for new changes). He graciously agreed to rebase the patch to latest master, fixed few issues and added few tests.
FUNCS: Statistics functions as UDFs:
Separate
Currently using FieldReader to ensure consistent output type so that Unpivot doesn't get confused. All stats columns should be Nullable, so that stats functions can return NULL when N/A.
* custom versions of "count" that always return BigInt
* HyperLogLog based NDV that returns BigInt that works only on VarChars
* HyperLogLog with binary output that only works on VarChars
OPS: Updated protobufs for new ops
OPS: Implemented StatisticsMerge
OPS: Implemented StatisticsUnpivot
ANALYZE: AnalyzeTable functionality
* JavaCC syntax more-or-less copied from LucidDB.
* (Basic) AnalyzePrule: DrillAnalyzeRel -> UnpivotPrel StatsMergePrel FilterPrel(for sampling) StatsAggPrel ScanPrel
ANALYZE: Add getMetadataTable() to AbstractSchema
USAGE: Change field access in QueryWrapper
USAGE: Add getDrillTable() to DrillScanRelBase and ScanPrel
* since ScanPrel does not inherit from DrillScanRelBase, this requires adding a DrillTable to the constructor
* This is done so that a custom ReflectiveRelMetadataProvider can access the DrillTable associated with Logical/Physical scans.
USAGE: Attach DrillStatsTable to DrillTable.
* DrillStatsTable represents the data scanned from a corresponding ".stats.drill" table
* In order to avoid doing query execution right after the ".stats.drill" table is found, metadata is not actually collected until the MaterializationVisitor is used.
** Currently, the metadata source must be a string (so that a SQL query can be created). Doing this with a table is probably more complicated.
** Query is set up to extract only the most recent statistics results for each column.
closes #729
|
|
closes #1642
- Add output column names to JdbcRecordReader and use them for storing the results since column names in result set may differ when aliases aren't specified
|
|
- Remove plugins usage for instantiating test databases and tables
- Replace derby with h2 database
closes #1603
|
|
non-Linux systems
closes #1580
|
|
plugin
closes #1542
|
|
DrillFilterRel
- Fix workspace case insensitivity for JDBC storage plugin
|
|
- Fix RDBMS integration tests (expected decimal output and testCrossSourceMultiFragmentJoin)
- Update libraries versions
- Resolve NPE for empty result
|
|
- adding .circleci/config.yml to the project to launch CircleCI
- custom memory parameters
- usage of CircleCI machine
- excluding "SlowTest" and "UnlikelyTest" groups
- update maven version
- adding libaio.so library to solve MySQL integration tests
- update com.jcabi:jcabi-mysql-maven-plugin library version
- TODO descriptions for the future enhancements of CircleCI build for Drill
close apache/drill#1493
|
|
execution
closes #1455
|
|
|
|
- Fix compilation errors for new version of Guava.
- Remove usage of deprecated API
- Shade guava and add dependencies to the shaded version
- Ban unshaded package
- Introduce drill-shaded module and move guava-shaded under it
- Add methods to convert shaded guava lists to the unshaded ones
- Add instruction for publishing artifacts to the Apache repository
|
|
closes #1425
|
|
closes #1415
|
|
implicitRIDColumn.
closes #1401
|
|
information.
|
|
- Storage Plugins Handler service is used op the Drill start-up stage and it updates storage plugins configs from
storage-plugins-override.conf file. If plugins configs are present in the persistence store - they are updated,
otherwise bootstrap plugins are updated and the result configs are loaded to persistence store. If the enabled
status is absent in the storage-plugins-override.conf file, the last plugin config enabled status persists.
- 'drill.exec.storage.action_on_plugins_override_file' Boot option is added. This is the action, which should be
performed on the storage-plugins-override.conf file after successful updating storage plugins configs.
Possible values are: "none" (default), "rename" and "remove".
- The "NULL" issue with updating Hive plugin config by REST is solved. But clients are still being instantiated for disabled
plugins - DRILL-6412.
- "org.honton.chas.hocon:jackson-dataformat-hocon" library is added for the proper deserializing HOCON conf file
- additional refactoring: "com.typesafe:config" and "org.apache.commons:commons-lang3" are placed into DependencyManagement
block with proper versions; correct properties for metrics in "drill-override-example.conf" are specified
closes #1345
|
|
The operator is missing in the profile protobuf. This commit introduces that.
1. Added protobuf files (incl generated C++ and Java)
2. Updated JdbcSubScan's getOperatorType API
closes #1297
|
|
When viewing a profile for a query against a JDBC source, the visualized plan is not rendered. This is because the generated SQL pushed down to the JDBC source has a line break injected just before the FROM clause.
The workaround is to strip away any injected newlines ('\\n') at least for the SQL defined in the text plan, so that the backend Javascript can render it correctly.
In addition, any single line comments are also removed, but any block comments (i.e. /* .. */ ) are retained as they might carry hints.
This closes #1295
|
|
AbstractStoragePlugin
closes #1282
|
|
Timestamp types. (#3)
close apache/drill#1247
* DRILL-6242 - Use java.time.Local{Date|Time|DateTime} classes to hold values from corresponding Drill date, time, and timestamp types.
Conflicts:
exec/java-exec/src/main/java/org/apache/drill/exec/vector/complex/fn/ExtendedJsonOutput.java
Fix merge conflicts and check style.
|
|
|
|
Add ExprVisitors for VARDECIMAL
Modify writers/readers to support VARDECIMAL
- Added usage of VarDecimal for parquet, hive, maprdb, jdbc;
- Added options to store decimals as int32 and int64 or fixed_len_byte_array or binary;
Add UDFs for VARDECIMAL data type
- modify type inference rules
- remove UDFs for obsolete DECIMAL types
Enable DECIMAL data type by default
Add unit tests for DECIMAL data type
Fix mapping for NLJ when literal with non-primitive type is used in join conditions
Refresh protobuf C++ source files
Changes in C++ files
Add support for decimal logical type in Avro.
Add support for date, time and timestamp logical types.
Update Avro version to 1.8.2.
|
|
closes #1207
|
|
closes #1198
|
|
1. Overrided serialization methods for instances with passwords
2. Changed file permissions for configuration files
closes #1139
|
|
1. Fixed ser / de issues for Hive, Kafka, Hbase plugins.
2. Added physical plan submission unit test for all storage plugins in contrib module.
3. Refactoring.
closes #1108
|
|
closes #1045
|
|
|
|
change: Add back AbstractConverter in RelSet.java' from Calcite into DRILL
|
|
After the changes, made in CALCITE-1056 if the filter has a predicate that is always false, RelBuilder.filter() method returns values rel node instead of filter rel node. In order to preserve column types, DrillRelBuilder.empty() method, which is returned by filter method was overridden, and now it returns filter with a false predicate. (advice to override this method was in its javadoc) The goal of all other changes in this commit is to use our custom RelBuilder for all rules that are used in Drill.
|
|
- fixed all compiling errors (main changes were: Maven changes, chenges RelNode -> RelRoot, implementing some new methods from updated interfaces, chenges some literals, logger changes);
- fixed unexpected column errors, validation errors and assertion errors after Calcite update;
- fixed describe table/schema statement according to updated logic;
- added fixes with time-intervals;
- changed precision of BINARY to 65536 (was 1048576) according to updated logic (Calcite overrides bigger precision to own maxPrecision);
- ignored some incorrect tests with DRILL-3244;
- changed "Table not found" message to "Object not found within" according to new Calcite changes.
|
|
1. Increased test parallelism and fixed associated bugs
2. Added test categories and categorized tests appropriately
- Don't exclude anything by default
- Increase test timeout
- Fixed flakey test
closes #940
|
|
Unify logback files.
|
|
empty batch.
1. Modify ScanBatch's logic when it iterates list of RecordReader.
1) Skip RecordReader if it returns 0 row && present same schema. A new schema (by calling Mutator.isNewSchema() ) means either a new top level field is added, or a field in a nested field is added, or an existing field type is changed.
2) Implicit columns are presumed to have constant schema, and are added to outgoing container before any regular column is added in.
3) ScanBatch will return NONE directly (called as "fast NONE"), if all its RecordReaders haver empty input and thus are skipped, in stead of returing OK_NEW_SCHEMA first.
2. Modify IteratorValidatorBatchIterator to allow
1) fast NONE ( before seeing a OK_NEW_SCHEMA)
2) batch with empty list of columns.
2. Modify JsonRecordReader when it get 0 row. Do not insert a nullable-int column for 0 row input. Together with ScanBatch, Drill will skip empty json files.
3. Modify binary operators such as join, union to handle fast none for either one side or both sides. Abstract the logic in AbstractBinaryRecordBatch, except for MergeJoin as its implementation is quite different from others.
4. Fix and refactor union all operator.
1) Correct union operator hanndling 0 input rows. Previously, it will ignore inputs with 0 row and put nullable-int into output schema, which causes various of schema change issue in down-stream operator. The new behavior is to take schema with 0 into account
in determining the output schema, in the same way with > 0 input rows. By doing that, we ensure Union operator will not behave like a schema-lossy operator.
2) Add a UnionInputIterator to simplify the logic to iterate the left/right inputs, removing significant chunk of duplicate codes in previous implementation.
The new union all operator reduces the code size into half, comparing the old one.
5. Introduce UntypedNullVector to handle convertFromJson() function, when the input batch contains 0 row.
Problem: The function convertFromJSon() is different from other regular functions in that it only knows the output schema after evaluation is performed. When input has 0 row, Drill essentially does not have
a way to know the output type, and previously will assume Map type. That works under the assumption other operators like Union would ignore batch with 0 row, which is no longer
the case in the current implementation.
Solution: Use MinorType.NULL at the output type for convertFromJSON() when input contains 0 row. The new UntypedNullVector is used to represent a column with MinorType.NULL.
6. HBaseGroupScan convert star column into list of row_key and column family. HBaseRecordReader should reject column star since it expectes star has been converted somewhere else.
In HBase a column family always has map type, and a non-rowkey column always has nullable varbinary type, this ensures that HBaseRecordReader across different HBase regions will have the same top level schema, even if the region is
empty or prune all the rows due to filter pushdown optimization. In other words, we will not see different top level schema from different HBaseRecordReader for the same table.
However, such change will not be able to handle hard schema change : c1 exists in cf1 in one region, but not in another region. Further work is required to handle hard schema change.
7. Modify scan cost estimation when the query involves * column. This is to remove the planning randomness since previously two different operators could have same cost.
8. Add a new flag 'outputProj' to Project operator, to indicate if Project is for the query's final output. Such Project is added by TopProjectVisitor, to handle fast NONE when all the inputs to the query are empty
and are skipped.
1) column star is replaced with empty list
2) regular column reference is replaced with nullable-int column
3) An expression will go through ExpressionTreeMaterializer, and use the type of materialized expression as the output type
4) Return an OK_NEW_SCHEMA with the schema using the above logic, then return a NONE to down-stream operator.
9. Add unit test to test operators handling empty input.
10. Add unit test to test query when inputs are all empty.
DRILL-5546: Revise code based on review comments.
Handle implicit column in scan batch. Change interface in ScanBatch's constructor.
1) Ensure either the implicit column list is empty, or all the reader has the same set of implicit columns.
2) We could skip the implicit columns when check if there is a schema change coming from record reader.
3) ScanBatch accept a list in stead of iterator, since we may need go through the implicit column list multiple times, and verify the size of two lists are same.
ScanBatch code review comments. Add more unit tests.
Share code path in ProjectBatch to handle normal setupNewSchema() and handleNullInput().
- Move SimpleRecordBatch out of TopNBatch to make it sharable across different places.
- Add Unit test verify schema for star column query against multilevel tables.
Unit test framework change
- Fix memory leak in unit test framework.
- Allow SchemaTestBuilder to pass in BatchSchema.
close #906
|
|
main code changes are in Calcite library.
update drill's calcite version to 1.4.0-drill-r20.
close #793
|
|
This closes #326.
|
|
for planning purpose
Also move Hive partition pruning rules to logical storage plugin rulesets.
this closes #300
|
|
avoid noise in logs
This closes #281
|
|
This commit adds integration tests for the JDBC plugin with MySQL. It
also refactors the existing Derby tests to have the same general pattern
as the MySQL tests: data is defined in an external .sql file and maven
is used to start/stop external resources for testing.
Add tests for ENUM and YEAR types.
Tests for the CLOB type with Derby.
This closes #251
|
|
are unsupported.
This closes #240
|
|
|
|
This closes #225
|
|
Makes the classpath scanning a build time class discovery
Makes the fmpp generation incremental
Removes some slowness in DrillBit closing
Reduces the build time by 30%
This closes #148
|
|
Fixes issues with bit, date, time and timestamp types in MySQL.
|
|
- add extends AutoCloseable to RecordReader, and rename cleanup() to close().
- fix many warnings
- formatting fixes
DRILL-1942-readers:
- renamed cleanup() to close in the new JdbcRecordReader
Close apache/drill#154
|
|
- Move to leverage Calcite's JDBC adapter capabilities for pushdowns, schema, etc.
- Add test cases using Derby
|