以下の内容はhttps://kakakakakku.hatenablog.com/entry/2025/07/28/081738より取得しました。


DuckDB で Hugging Face のデータセットにクエリを実行する

DuckDB で hf:// パスを使うと Hugging Face のデータセット (Parquet, CSV, JSON, JSONL など) にクエリを実行できる🦆DuckDB のアナウンスを探したら2024年5月にリリースされていた❗️これは知らなかった.さっそく試していく \( 'ω')/

duckdb.org

Hugging Face 側のドキュメントは以下にある.

huggingface.co

DuckDB で指定する URL 形式は以下と書いてあった.

hf://datasets/{my-username}/{my-dataset}/{path_to_file} 

Hugging Face Datasets

huggingface.co

試す: cais/mmlu データセット📊

まずは cais/mmlu データセット(多分野の多肢選択式問題集)にクエリを実行する❗️

huggingface.co

件数を取得する

⚫◗ SELECT COUNT(*)
    FROM 'hf://datasets/cais/mmlu/all/test-*.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    14042     │
└──────────────┘

subject 別の件数を取得する

⚫◗ SELECT subject, COUNT(*) AS COUNT
    FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
    GROUP BY subject
    ORDER BY COUNT DESC;
┌─────────────────────────────────────┬───────┐
│               subject               │ COUNT │
│               varchar               │ int64 │
├─────────────────────────────────────┼───────┤
│ professional_law                    │  1534 │
│ moral_scenarios                     │   895 │
│ miscellaneous                       │   783 │
│ professional_psychology             │   612 │
│ high_school_psychology              │   545 │
│ high_school_macroeconomics          │   390 │
│ elementary_mathematics              │   378 │
│ moral_disputes                      │   346 │
│ prehistory                          │   324 │
│ philosophy                          │   311 │
│ high_school_biology                 │   310 │
│ nutrition                           │   306 │
│ professional_accounting             │   282 │
│ professional_medicine               │   272 │
│ high_school_mathematics             │   270 │
│ clinical_knowledge                  │   265 │
│ security_studies                    │   245 │
│ high_school_microeconomics          │   238 │
│ high_school_world_history           │   237 │
│ conceptual_physics                  │   235 │
│ marketing                           │   234 │
│ human_aging                         │   223 │
│ high_school_statistics              │   216 │
│ high_school_us_history              │   204 │
│ high_school_chemistry               │   203 │
│ sociology                           │   201 │
│ high_school_geography               │   198 │
│ high_school_government_and_politics │   193 │
│ college_medicine                    │   173 │
│ world_religions                     │   171 │
│ virology                            │   166 │
│ high_school_european_history        │   165 │
│ logical_fallacies                   │   163 │
│ astronomy                           │   152 │
│ high_school_physics                 │   151 │
│ electrical_engineering              │   145 │
│ college_biology                     │   144 │
│ anatomy                             │   135 │
│ human_sexuality                     │   131 │
│ formal_logic                        │   126 │
│ international_law                   │   121 │
│ econometrics                        │   114 │
│ machine_learning                    │   112 │
│ public_relations                    │   110 │
│ jurisprudence                       │   108 │
│ management                          │   103 │
│ college_physics                     │   102 │
│ college_mathematics                 │   100 │
│ college_chemistry                   │   100 │
│ us_foreign_policy                   │   100 │
│ college_computer_science            │   100 │
│ high_school_computer_science        │   100 │
│ abstract_algebra                    │   100 │
│ business_ethics                     │   100 │
│ global_facts                        │   100 │
│ computer_security                   │   100 │
│ medical_genetics                    │   100 │
├─────────────────────────────────────┴───────┤
│ 57 rows                           2 columns │
└─────────────────────────────────────────────┘

college_computer_science subject の件数を取得する

cais/mmlu データセットの場合は subject で絞り込む以外に URL でサブカテゴリを指定することもできる👌

⚫◗ SELECT COUNT(*)
    FROM 'hf://datasets/cais/mmlu/all/test-*.parquet'
    WHERE subject = 'college_computer_science';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     100      │
└──────────────┘

⚫◗ SELECT COUNT(*)
    FROM 'hf://datasets/cais/mmlu/college_computer_science/test-*.parquet';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│     100      │
└──────────────┘

college_computer_science subject の question を10件取得する

question の文字数が長いため DuckDB で line フォーマットを使う.

⚫◗ .mode line
⚫◗ SELECT question
    FROM 'hf://datasets/cais/mmlu/college_computer_science/test-*.parquet'
    LIMIT 10;
question = The access matrix approach to protection has the difficulty that

question = An integer c is a common divisor of two integers x and y if and only if c is a divisor of x and c is a divisor of y. Which of the following sets of integers could possibly be the set of all common divisors of two integers?

question = In the NoNicks operating system, the time required by a single file-read operation has four nonoverlapping components:
disk seek time-25 msec
disk latency time-8 msec
disk transfer time- 1 msec per 1,000 bytes
operating system overhead-1 msec per 1,000 bytes + 10 msec
In version 1 of the system, the file read retrieved blocks of 1,000 bytes. In version 2, the file read (along with the underlying layout on disk) was modified to retrieve blocks of 4,000 bytes. The ratio of-the time required to read a large file under version 2 to the time required to read the same large file under version 1 is approximately

question = You want to cluster 7 points into 3 clusters using the k-Means Clustering algorithm. Suppose after the first iteration, clusters C1, C2 and C3 contain the following two-dimensional points: C1 contains the 2 points: {(0,6), (6,0)} C2 contains the 3 points: {(2,2), (4,4), (6,6)} C3 contains the 2 points: {(5,5), (7,7)} What are the cluster centers computed for these 3 clusters?

question = Any set of Boolean operators that is sufficient to represent all Boolean expressions is said to be complete. Which of the following is NOT complete?

question = Consider the collection of all undirected graphs with 10 nodes and 6 edges. Let M and m, respectively, be the maximum and minimum number of connected components in any graph in the collection. If a graph has no selfloops and there is at most one edge between any pair of nodes, which of the following is true?

question = Resolution theorem proving for showing that a formula of propositional logic is not satisfiable has which of the following properties?
I. It is a sound proof system in the sense that there does not exist a proof of the unsatisfiability of a satisfiable formula of propositional logic.
II. It is a complete proof system in the sense that there is a proof of unsatisfiability for every unsa tisfiable formula of propositional logic.
III. It is a succinct proof system in the sense that whenever an unsatisfiable formula F of propositional logic has a resolution proof, F also has a proof whose length is polynomial in the length of F.

question = Which of the following statements describe(s) properties of a purely segmented memory system?
I. It divides memory into units of equal size.
II. It permits implementation of virtual memory.
III. It suffers from internal fragmentation.

question = Which of the following statements about floating-point arithmetic is NOT true?

question = Suppose sharing of tiles in a multilevel directory structure is achieved with directory entries that are links pointing to a node containing information about a shared file. Information in this node includes (1) the owner of the file, (2) a count of the number of links to the tile, and (3) the disk block numbers of the file. What is a primary drawback to this approach to sharing?

試す: motherduckdb/duckdb-text2sql-25k データセット📊

次に motherduckdb/duckdb-text2sql-25k データセット(プロンプトと DuckDB クエリのペア)にクエリを実行する❗️

huggingface.co

件数を取得する

⚫◗ SELECT COUNT(*)
    FROM 'hf://datasets/motherduckdb/duckdb-text2sql-25k/train.jsonl';
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│    25000     │
└──────────────┘

category 別の件数を取得する

⚫◗ SELECT category, COUNT(*) AS COUNT
    FROM 'hf://datasets/motherduckdb/duckdb-text2sql-25k/train.jsonl'
    GROUP BY category
    ORDER BY COUNT DESC;
┌───────────────────────────────────────┬───────┐
│               category                │ COUNT │
│                varchar                │ int64 │
├───────────────────────────────────────┼───────┤
│ sql/duckdb_table_functions            │  1325 │
│ sql/configuration                     │  1125 │
│ sql/functions/char807 │
│ sql/functions/nested                  │   586 │
│ sql/functions/numeric515 │
│ csv_import                            │   494 │
│ sql/information_schema                │   462 │
│ data/parquet/overview                 │   461 │
│ sql/statements/copy                   │   451 │
│ sql/functions/timestamptz             │   449 │
│ sql/functions/utility                 │   400 │
│ data/csv/overview                     │   372 │
│ sql/samples                           │   370 │
│ sql/functions/timestamp339 │
│ data/multiple_files/overview          │   337 │
│ extensions/sqlite                     │   336 │
│ extensions/spatial                    │   323 │
│ sql/functions/datepart                │   310 │
│ sql/functions/dateformat              │   308 │
│ sql/statements/pivot                  │   304 │
│ sql/introduction                      │   274 │
│ sql/statements/insert                 │   266 │
│ sql/statements/select                 │   265 │
│ extensions/httpfs                     │   261 │
│ sql/functions/date258 │
│ sql/functions/patternmatching         │   251 │
│ guides/import/csv_import              │   244 │
│ sql/aggregates                        │   244 │
│ sql/pragmas                           │   238 │
│ sql/statements/export                 │   237 │
│ sql/query_syntax/from237 │
│ sql/statements/create_table           │   235 │
│ sql/expressions/cast210 │
│ sql/query_syntax/with210 │
│ sql/statements/alter_table            │   206 │
│ sql/functions/bitstring               │   199 │
│ extensions/json                       │   198 │
│ guides/import/parquet_import          │   190 │
│ data/csv/auto_detection               │   187 │
│ sql/data_types/struct                 │   180 │
│ sql/query_syntax/orderby              │   179 │
│ extensions/full_text_search           │   177 │
│ data/parquet/metadata                 │   177 │
│ data/partitioning/hive_partitioning   │   177 │
│ guides/import/excel_export            │   176 │
│ sql/functions/interval                │   173 │
│ sql/statements/create_sequence        │   165 │
│ guides/import/s3_import               │   163 │
│ sql/query_syntax/select               │   162 │
│ guides/meta/explain161 │
│ sql/data_types/timestamp160 │
│ sql/query_syntax/grouping_sets        │   157 │
│ sql/data_types/numeric153 │
│ sql/functions/overview                │   152 │
│ data/insert                           │   152 │
│ sql/data_types/map                    │   150 │
│ guides/meta/summarize                 │   149 │
│ guides/import/s3_export               │   148 │
│ sql/data_types/union147 │
│ sql/statements/unpivot                │   146 │
│ sql/data_types/list                   │   145 │
│ sql/data_types/enum                   │   143 │
│ data/partitioning/partitioned_writes  │   141 │
│ sql/query_syntax/unnest               │   137 │
│ sql/data_types/interval               │   136 │
│ sql/expressions/subqueries            │   135 │
│ data/csv/tips                         │   134 │
│ guides/import/json_import             │   133 │
│ sql/query_syntax/groupby              │   131 │
│ sql/data_types/overview               │   131 │
│ sql/data_types/nulls                  │   128 │
│ sql/data_types/text                   │   125 │
│ sql/statements/attach                 │   121 │
│ sql/expressions/collations            │   121 │
│ sql/statements/create_type            │   120 │
│ guides/meta/list_tables               │   119 │
│ extensions/substrait                  │   118 │
│ sql/indexes                           │   117 │
│ extensions/postgres                   │   115 │
│ sql/window_functions                  │   115 │
│ sql/statements/checkpoint             │   113 │
│ guides/sql_features/asof_join         │   112 │
│ guides/import/query_sqlite            │   110 │
│ extensions/iceberg                    │   110 │
│ guides/import/json_export             │   108 │
│ data/json/overview                    │   104 │
│ extensions/aws                        │   104 │
│ sql/statements/create_macro           │   103 │
│ guides/import/parquet_export          │   103 │
│ sql/query_syntax/filter               │   100 │
│ guides/import/csv_export              │    96 │
│ data/multiple_files/combining_schemas │    96 │
│ sql/query_syntax/setops               │    96 │
│ sql/statements/set94 │
│ data/appender                         │    91 │
│ data/overview                         │    90 │
│ sql/query_syntax/sample               │    87 │
│ sql/expressions/comparison_operators  │    86 │
│ guides/import/http_import             │    86 │
│ sql/query_syntax/values86 │
│ sql/data_types/blob86 │
│ guides/index84 │
│ sql/statements/create_view            │    82 │
│ sql/data_types/boolean79 │
│ sql/expressions/case78 │
│ sql/functions/time                    │    77 │
│ guides/meta/explain_analyze           │    77 │
│ extensions/mysql                      │    76 │
│ sql/expressions/logical_operators     │    76 │
│ sql/statements/update                 │    75 │
│ sql/query_syntax/where73 │
│ sql/query_syntax/having70 │
│ sql/statements/vacuum                 │    68 │
│ extensions/excel                      │    68 │
│ sql/constraints                       │    67 │
│ sql/statements/drop67 │
│ sql/expressions/in64 │
│ sql/statements/create_schema          │    60 │
│ sql/data_types/date58 │
│ guides/import/query_parquet           │    58 │
│ sql/query_syntax/limit                │    55 │
│ sql/data_types/bitstring              │    54 │
│ sql/query_syntax/qualify              │    52 │
│ data/parquet/tips                     │    52 │
│ extensions/inet                       │    50 │
│ extensions/arrow                      │    49 │
│ extensions/overview                   │    48 │
│ guides/import/excel_import            │    46 │
│ sql/expressions/star                  │    46 │
│ guides/meta/describe                  │    45 │
│ extensions/working_with_extensions    │    44 │
│ extensions/tpcds                      │    44 │
│ extensions/azure                      │    41 │
│ sql/statements/delete41 │
│ guides/sql_features/full_text_search  │    39 │
│ sql/statements/call                   │    38 │
│ sql/functions/enum                    │    38 │
│ sql/functions/blob38 │
│ sql/data_types/timezones              │    37 │
│ extensions/tpch                       │    36 │
│ guides/import/query_postgres          │    34 │
│ extensions/official_extensions        │    31 │
│ sql/query_syntax/window               │    31 │
│ sql/expressions/overview              │    27 │
│ sql/case_sensitivity                  │    25 │
│ connect18 │
│ extensions/jemalloc                   │    16 │
│ sql/statements/use                    │    15 │
│ sql/statements/alter_view             │    13 │
│ sql/data_types/time                   │    13 │
│ extensions/icu                        │     8 │
│ extensions/autocomplete               │     5 │
├───────────────────────────────────────┴───────┤
│ 152 rows                            2 columns │
└───────────────────────────────────────────────┘

sql/functions/date category の prompt と query を10件取得する

prompt と query の文字数が長いため DuckDB で line フォーマットを使う.

⚫◗ .mode line
⚫◗ SELECT prompt, query
    FROM 'hf://datasets/motherduckdb/duckdb-text2sql-25k/train.jsonl'
    WHERE category = 'sql/functions/date'
    LIMIT 10;
prompt = Truncate the date April 20, 1992 to the nearest 2-month interval, with an offset of 1 month.
 query = SELECT time_bucket(INTERVAL '2 months', DATE '1992-04-20', INTERVAL '1 month')

prompt = Find the average official ratings per month for all performances.
 query = SELECT time_bucket(INTERVAL '1 month', "Date") as Month, AVG("Official_ratings_(millions)") as Average_Ratings
FROM "performance"
GROUP BY Month
ORDER BY Month;

prompt = How many years have passed since each shop with a finite open date started its operation?
 query = SELECT Shop_Name,
       date_diff('year', Open_Date, CURRENT_DATE) as years_since_opening
FROM shop
WHERE isfinite(Open_Date) = TRUE;

prompt = Generate a SQL query to adjust the StartDate of all policies with the same year in StartDate and EndDate by adding the difference in months between EndDate and StartDate.
 query = SELECT Policy_ID, Start_Date + INTERVAL (EXTRACT(MONTH FROM End_Date) - EXTRACT(MONTH FROM Start_Date)) MONTH AS Adjusted_Start_Date
FROM Policies
WHERE EXTRACT(YEAR FROM Start_Date) = EXTRACT(YEAR FROM End_Date);

prompt = Create a query that returns the start date of the current month.
 query = SELECT make_date(EXTRACT(YEAR FROM current_date), EXTRACT(MONTH FROM current_date), 1) AS start_of_month;

prompt = List the body builder IDs along with their total scores truncated to the month level, sorted by the truncated total in descending order.
 query = SELECT body_builder_id, date_trunc('month', total) as month_total
FROM body_builder
WHERE total IS NOT NULL
ORDER BY month_total DESC;

prompt = For each country, find the minimum between population and area.
 query = SELECT Country_name, least(Country_info.Population, Country_info.Area) as Min_Population_Area
FROM country

prompt = Calculate the number of days between registration and attendance for each student-course pair.
 query = SELECT student_id, course_id, date_diff('day', registration_date, date_of_attendance) as days_between
FROM Student_Course_Attendance sca
JOIN Student_Course_Registrations scr ON sca.student_id = scr.student_id AND sca.course_id = scr.course_id
ORDER BY student_id, course_id, days_between;

prompt = Truncate the date '1992-03-07' to the first day of the month.
 query = SELECT date_trunc('month', make_date(1992, 3, 7));

prompt = Format the birth date of gymnasts as the month and year of their birth.
 query = SELECT name, strftime(Person_Achievements['Birth_Date'], '%B %Y') as Birth_Month_Year
FROM people
WHERE Person_Achievements['Sport'] = 'Gymnastics'

まとめ

DuckDB で hf:// パスを使うと Hugging Face のデータセットにクエリを実行できる🦆

他にも気になるデータセットもあって活用できそう.特に DuckDB の入門コンテンツとして使えそうだ〜💪

huggingface.co

huggingface.co

huggingface.co




以上の内容はhttps://kakakakakku.hatenablog.com/entry/2025/07/28/081738より取得しました。
このページはhttp://font.textar.tv/のウェブフォントを使用してます

不具合報告/要望等はこちらへお願いします。
モバイルやる夫Viewer Ver0.14