CSV in Kotlin with Casting CSV

It all started with a small task to store a list of data records from an Android app in a simple file format that could easily be viewed and analyzed on any other platform. So I thought CSV would be the way to go. Knowing Gson, I was expecting there to be a simple CSV library that would make it as easy to read and write CSV as it is to handle JSON with Gson: Just define a data model, throw the raw file at it, and let it do its magic... But I didn't find such a thing. Just the usual libraries to work with raw CSV, which I wanted to avoid.

So I got myself a small weekend project and built a library on my own: Casting CSV.

CSV in Kotlin

Quick start

If you want to read data from a CSV file and map it directly to a Kotlin data class, or write data to a CSV file, just include the Casting CSV dependency, available on JCenter, to your project.

With Gradle, that would look like this:

implementation "com.floern.castingcsv:casting-csv-kt:1.1"

Now assume we have a CSV file containing a list of some transactions:

sender,receiver,amount
"John","Fred",42
"Claire","Mary",123
"Bob","Donald",16
"David","Jack",288

Read CSV

To read this data, we have to define a data class, that represents our data in the file. For each column in the CSV file we declare a corresponding property in the primary constructor:

data class Transaction(
    val sender: String,
    val receiver: String,
    val amount: Int
)

From there it's easy going: We just make use of the fromCSV() function from the Casing CSV library:

val transactions = castingCSV().fromCSV<Transaction>(csvFile)

And we're done!

Write CSV

Given the list of transactions from above, let's say we want to add a new transaction and save the CSV to disk. To accomplish that we'll simply use the toCSV() function:

val newTransactions = transactions.toMutableList()
// add new transaction
newTransactions += Transaction("Jeff", "Cathie", 78)
// save CSV
castingCSV().toCSV(newTransactions, csvFile.outputStream())

Now the CSV file contains another row:

"Jeff","Cathie",78

Custom type adapters

Out of the box we've only got support for primitive Kotlin types Int, Long, Byte, Short, Float, Double, Boolean and String, plus all of their nullable counterparts.

So, can we do something about other, more complex types?

As it happens, several datasets stored as CSV use some form of time or date fields. For this example, revisiting the example from above, let's assume the CSV file contains a column for the date of the transaction in the format yyyy-mm-dd.

sender,receiver,amount,date
"John","Fred",42,"2019-03-24"
"Claire","Mary",123,"2020-04-01"
"Bob","Donald",16,"2020-11-29"

To represent the date in our Kotlin data class, we'll just use the LocalDate from Java 8, which makes it rather easy to parse and serialize that date format.

data class Transaction(
    val sender: String,
    val receiver: String,
    val amount: Int,
    val date: LocalDate
)

Now we need to tell Casting CSV how it has to handle the field type LocalDate. We do that by creating a custom TypeAdapter class which allows us to specify how the raw CSV string value has to be converted to a LocalDate and vice versa:

class LocalDateAdapter : TypeAdapter<LocalDate>() {
    override fun serialize(value: LocalDate?): String? = value?.toString()
    override fun deserialize(token: String): LocalDate? = LocalDate.parse(token)
}

Then we link the date property to our adapter with the @CsvTypeAdapter annotation:

data class Transaction(
    val sender: String,
    val receiver: String,
    val amount: Int,
    @CsvTypeAdapter(LocalDateAdapter::class)
    val date: LocalDate
)

And with that we're ready to handle dates in CSV.

Large files

All good so far, but if we try to read a CSV file that's really huge we might start having trouble with memory.

fromCsvAsSequence() provides a method to read CSV line by line using a Sequence. We can define a lambda that works on the Sequence and returns a final value. The lambda semantics is required, as we've got an open file handle within its scope, which is closed as soon the function returns. So it's not possible to work on the Sequence afterwards.

Given a set of transactions we could sum up the amount of all transactions:

val totalAmount = castingCSV()
    .fromCsvAsSequence(csvFile.inputStream()) { transactions: Sequence<Transaction> ->
        transactions.sumOf { it.amount }
    }
println("Total transaction amount: €/£/¥/${totalAmount}")

Since the lines are read lazily, this can also be used to implement, for example, a findFirst() functionality without having to parse the whole file.

By the way, toCSV() allows to write CSV using a Sequence too.

Configure CSV format

So far we've only used standard CSV, but can adjust the CSV reader and writer to match our needs.

Instead of instantiating a simple CastingCSV instance with castingCSV(), we can make use of the config lambda and create a custom CsvConfig by specifying the properties we want.

Most commonly we might have to change the quotes or the delimiter symbol. Instead of the standard " and , we can convert the format to some TSV variation using a single quote ' and, of course, tabs \t as a delimiter.

castingCSV {
    // quote character to encapsulate fields:
    quoteChar = '\''
    // delimiter separating the fields:
    delimiter = '\t'
}

In case we have to work with a CSV file from the past millennia that doesn't use UTF-8 yet, we might adjust the charset property to something more archaic:

castingCSV {
    charset = Charset.forName("ISO-8859-1")
}

Here is a list of all properties that can be specified, each with its default value:

castingCSV {
    // Charset to be used for the encoding of the CSV file:
    charset = Charset.forName("UTF-8")
    // Quote character to encapsulate fields:
    quoteChar = '"'
    // Delimiter separating the fields:
    delimiter = ','
    // Character to escape quotes inside strings:
    escapeChar = '"'
    // Skip empty lines when reading a CSV file, throw otherwise:
    skipEmptyLine = false
    // Skip lines with a different number of fields when reading a CSV file, throw otherwise:
    skipMismatchedRow = false
    // Value to write a `null` field:
    nullCode = ""
    // Line terminator:
    lineTerminator = "\r\n"
    // Append a line break at the end of file:
    outputLastLineTerminator = true
    // Quote mode: Only fields containing special characters, or all fields:
    quoteWriteMode = WriteQuoteMode.CANONICAL
}

Filter fields

By default, all properties are serialized; in the order they appear in the class definition. With the headers parameter, there is an option to specify which properties of the data class should be written to the CSV output and in what order.

So if we only need to save a subset of the fields, say, to anonymize our transactions by excluding the sender and receiver, we can do that:

val csv = castingCSV().toCSV(transactions, header = listOf("date", "amount"))

The same goes for reading a subset of the CSV input in case the data class does not contain all the CSV fields.

Inner workings

While we're already at it, why don't we dig a little deeper and look at some implementation details?

Reflection magic

Automagically mapping the CSV columns to the data class properties and vice versa is the core functionality of the library, and the reason of its existence in the first place. The concept is quite simple: Take all the column names from the header of the CSV file, take all the properties of the data class, and pair the ones with the same name. To be precise, we're not using the properties of the data class, nor its fields, but the parameters of the primary constructor. Why does that matter? When we want to create a new instance of the class, we have to choose a constructor (the one and only primary constructor) and provide a list of the arguments with their values:

val constructor = MyClass::class.primaryConstructor!!
val parameters = constructor.parameters
val myObject = constructor.callBy(
    parameters.associateBy(
        { param -> param },
        { param -> getValueFromCsv(param.name) }
        // getValueFromCsv() implementation left as an exercise for the reader
    )
)

So that makes sense for deserialization, but what about the other direction? For serialization it actually would be easiest to access the fields or the properties (these two are not the same thing) of the class, get their name and value and write that to the CSV output. So that's easy.

Or is it?

@annotations

There's one tiny little quirk when it comes to annotations, for example the one that allows us to provide a custom type adapter. The Kotlin compiler generates a property and its backing field for each constructor parameter, but annotations added to parameters are not propagated to the properties and fields. So if we use reflection to get the annotations of a field, we won't get any...

We can fix that by listing the annotation with the use-site targets field and property in addition to param:

data class MyClass(
    @param:CsvTypeAdapter(DateAdapter::class)
    @field:CsvTypeAdapter(DateAdapter::class)
    @property:CsvTypeAdapter(DateAdapter::class)
    val date: Date
)

...but that's just a horrible solution. As a developer using the library I wouldn't want to be forced to write something like that. I don't even want to specify any annotation target at all. As we previously learned from the deserialization part, the annotation without an explicit use-site target gets assigned to the constructor parameter. So from a developer experience point of view, we want to go that way. Thus, we have to pair up the class properties with their corresponding constructor parameter by matching their names:

val properties = MyClass::class.memberProperties.toList()
val parameters = MyClass::class.primaryConstructor!!.parameters
val paired = properties.map { property ->
    Pair(
        property,
        parameters.find { parameter -> property.name == parameter.name }
    )
}

Now we can read the value from the property and get the annotations from the parameter. A workaround I can live with.

Code generation?

Since the current approach requires reflection it's not necessarily the most performant code. Also, if we're using any code obfuscation tool like Proguard, we have to define exception rules for the data classes, otherwise we can't map the CSV fields to the class properties at runtime.

I'm thinking of code generation at compile time to get rid of reflection. I've been using Moshi to work with JSON. It allows to generate complete JSON adapters at compile time specifically for the data model we use, so there is no need for reflection at runtime anymore.

Maybe I will implement that in a future version of Casting CSV.

Java?

While this library is completely written in and designed for Kotlin, it technically can be used with Java, or any JVM language for that matter. But the model class has to be a Kotlin data class, so there's that...

Frankly, using a Kotlin API from Java is a mess. If you want to use it with Java I'd recommend writing a wrapper around the CastingCSV API to hide the Kotlin specifics.

Github

Casting CSV is available as open source on Github: https://github.com/Floern/casting-csv-kt

It's licensed under the Apache License 2.0.

Feel free to open issues and feature requests, or even create pull requests for bug fixes.