Amazon Science 团队计划于VLDB 2024 (August 26-30 2024) 发布 redset 数据集
Redset是一个数据集,包含了三个月的AWS Redshift fleet 中选定实例样本上运行的用户查询元数据。
·
数据集介绍
Redset是一个数据集,包含了三个月的AWS Redshift fleet 中选定实例样本上运行的用户查询元数据。
数据集用途
Amazon Science 团队打算在VLDB2024 期间开放该部分数据, 虽然目前数据集还没有开放,但是从数据集的Schema 来看,和在VLDB 2024 会议期间公开该数据集。 可以大胆猜测Redshift 开发团队会在VLDB 上有重要论文发布,同时公布其真实用户的提升数据。
研究数据集的Schema ,从另外一方面看,顶级数据库大厂,对数据库的核心指标的描述,或者监控维度在这个Schema 里面已经得到应有的表达。该数据集未来一段时间应该会成为数据库领域优化的benchmark 数据集,值得大家关注 。 数据集链接请看https://www.selectdataset.com/dataset/1dfe70fc50251057041a91e5a882eb57。
后续数据集公开后,数据库领域感兴趣的小伙伴,可以第一时间去看看。
数据集 Schema
| Column | Name Description |
|---|---|
| instance_id | Uniquely identifies a redshift cluster |
| cluster_size | Size of the cluster (only available for provisioned) |
| user_id | Identifies the user that issued the query |
| database_id | Identifies the database that was queried |
| query_id | Unique per instance |
| arrival_time | Timestamp when the query arrived on the system |
| compile_duration_ms | Time the query spent compiling in milliseconds |
| queue_duration_ms | Time the query spent queueing in milliseconds |
| execution_duration_ms | Time the query spent executing in milliseconds |
| feature_fingerprint | Hash value of the query fingerprint. A proxy for query-likeness, though not based on text. Will overestimate repetition. |
| was_aborted | Whether the query was aborted during its lifetime |
| was_cached | Whether the query was answered from result cache |
| cache_source_query_id | If query was answered from result cache, this is the query id for the query which populated the cache |
| query_type | Type of query, e.g.., select, copy, ... |
| num_permanent_tables_accessed | Number of permanent table accesses by the query (regular database table) |
| num_external_tables_accessed | Number of external tables accessed by the query |
| num_system_tables_accessed | Number of system tables accessed by the query |
| read_table_ids | Comma separated list of unique permanent table ids read by the query |
| write_table_ids | Comma separated list of unique table ids written to by the query |
| mbytes_scanned | Total number of megabytes scanned by the query |
| mbytes_spilled | Total number of megabytes spilled by the query |
| num_joins | Number of joins in the query plan |
| num_scans | Number of scans in the query plan |
| num_aggregations | Number of aggregations in the query plan |
DAMO开发者矩阵,由阿里巴巴达摩院和中国互联网协会联合发起,致力于探讨最前沿的技术趋势与应用成果,搭建高质量的交流与分享平台,推动技术创新与产业应用链接,围绕“人工智能与新型计算”构建开放共享的开发者生态。
更多推荐


所有评论(0)