CSV kit

A very useful piece of software for working and analysing csv files is csvkit. A few useful functions are the following which have been taken from the above link.

Shows you stats about the file (e.g. col names, type, nulls, min, max, median, std deviation, unique values, most frequent)

$csvstat myfile.csv

Peak at excel file to display in terminal

$in2csv myfile.xlsx
$in2csv myfile.xlsx > myfile.csv  # can write xlsx file to csv by chaining operations

Look at data, formats nicely with borders

$csvlook myfile.csv,

See column names

$csvcut -n myfile.csv

Cut out specific columns

$csvcut -c 2,5,6 myfile.csv  # Cut out columns to just 2,5,6
$csvcut -c col1,col2,col3 myfile.csv  # can also specify by name (make sure not to leave spaces)

Chain commands to pass data along using the pipeline symbol |

$csvcut -c RespondentID,CollectorID,StartDate,EndDate Sheet_1.csv | csvlook | head

Find cells matching a regular expression (e.g. pattern “ddd-123-dddd”)

$csvgrep -c phone_number -r "\d{3}-123-\d{4}" mydata.csv > matchrx.csv

Merge csv files together so you can do aggregate analysis (assuming they have the same headers)

$csvstack firstfile.csv secondfile.csv > combinedfile.csv
# You can add an additional '-g' flag on csvstack to add a 'grouping column' (e.g. it'll say firstfile, secondfile)

Execute a SQL query directly on a CSV file

$csvsql --query "SELECT * FROM mydata WHERE AGE > 30" mydata.csv > query_results.csv

Sort data using csvsort

$csvcut -c StartDate,EndDate myfile.csv | csvsort -c StartDate | csvlook | head  # use -r to reverse order (i.e. desc) for csvsort

# Execute a SQL query by referencing two CSV files and doing a SQL Join
$csvsql --query "select m.usda_id, avg(i.sepal_length) as mean_sepal_length from iris as i join irismeta as m on (i.species = m.species) group by m.species" examples/iris.csv examples/irismeta.csv

Generate a create table statement for your csv data

$csvsql -i sqlite myfile.csv  # Can specify type of db with '-i' flag, other db's include mysql, mssql

Join to csv’s on common column header

$ csvjoin -c fips data.csv acs2012_5yr_population.csv > joined.csv

Automatically create a SQL table and import a CSV into the database (postgresql)

$createdb dbname
$csvsql --db postgresql://username:password@localhost:port/dbname --insert mydata.csv
$psql -q dbname -c "\d mydata"  # shows table, col names, data types
$psql -q dbname -c "SELECT * FROM mydata"

Extract a table from a SQL database into a CSV

sql2csv --db postgresql://username:password@localhost/dbname --query "select * from mydata" > extract.csv
sql2csv --db mssql://username:password@servername/dbname --query "select top 100 * from [dbname].[dbo].[tablename]" > extract.csv

Conduct analysis on csv in python

$ csvpy data.csv
Welcome! "data.csv" has been loaded in a CSVKitReader object named "reader".
>>> print len(list(reader))
1037
>>> quit()
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s