parquet
Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...
A lot of duckdb seems to be oriented around using parquet files. The command line tools for manipulating them don't seem to be great so far.
The best way I've found to head a parquet file so far is to use pandas:
python -c "import pandas; df=pandas.read_parquet('data/playerstats.parquet'); print(df.tail(10).to_csv())"
I print it as csv because then I can display it nicely with tv:
python -c "import pandas; df=pandas.read_parquet('data/playerstats.parquet'); print(df.tail(40).to_csv())" | tv -ea | bat
piping it to bat makes it so that you can scroll through all the columns; if you don't do that tv
will tell you what columns it isn't displaying, but it won't show them.
Neat, if you use nushell, it's just:
/Users/llimllib/code/nbastatsใopen-df playerstats.parquet
โญโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโฎ
โ # โ PLAYER_ID โ PLAYER_NAME โ NICKNAME โ TEAM_ID โ ... โ
โ โ โ โ โ โ โ
โโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโค
โ 0 โ 201985 โ AJ Price โ AJ โ 1610612754 โ ... โ
โ 1 โ 201166 โ Aaron โ Aaron โ 1610612745 โ ... โ
โ โ โ Brooks โ โ โ โ
โ 2 โ 201189 โ Aaron Gray โ Aaron โ 1610612740 โ ... โ
parquet includes a CLI in its project (brew install parquet-cli
). It's a java program though, so naturally it's a terrible CLI citizen. It also appears to be completely undocumented on the parquet site and comes with no man page.
anyway, parquet cat -10 <file>
will print the first 10 rows, and -c
will select columns:
$ parquet cat playerstats.parquet -n 10 -c player_name,team_abbreviation
{"player_name": "AJ Price", "team_abbreviation": "MIN"}
{"player_name": "Aaron Brooks", "team_abbreviation": "DEN"}
{"player_name": "Aaron Gray", "team_abbreviation": "SAC"}
{"player_name": "Adonis Thomas", "team_abbreviation": "PHI"}
{"player_name": "Al Harrington", "team_abbreviation": "WAS"}
{"player_name": "Al Horford", "team_abbreviation": "ATL"}
{"player_name": "Al Jefferson", "team_abbreviation": "CHA"}
{"player_name": "Al-Farouq Aminu", "team_abbreviation": "NOP"}
{"player_name": "Alan Anderson", "team_abbreviation": "BKN"}
{"player_name": "Alec Burks", "team_abbreviation": "UTA"}
Note that normal unix arguments like -n10
instead of -n
will not work.
parquet help cat
will print the documentation for that subcommand, but parquet cat --help
will try to open the --help
file.