Last 12 weeks · 0 commits
3 of 6 standards met
Repository: BurntSushi/rust-csv. Description: A CSV parser for Rust, with Serde support. Stars: 1914, Forks: 243. Primary language: Rust. Languages: Rust (99.5%), Shell (0.3%), Python (0.2%). License: Unlicense. Topics: csv, library, rust, rust-library. Open PRs: 34, open issues: 64. Last activity: 4mo ago. Community health: 42%. Top contributors: BurntSushi, paolobarbolini, brandonw, EvinRobertson, fhartwig, huonw, igor-raits, jturner314, thaliaarchi, timhabermaas and others.
Rust
This is a new attempt to add serialization support for maps and flattened fields, since #223 seem to have stalled out. This adds checks to ensure a stable column order. Why There are multiple reasons: To support having a row that is made up of nested structs. Sometimes you have some data that is in a struct which is used throughout the program, and you want to write it in csv with an additional column. Currently you would have to create two structs and copy all members of the first struct into the second, which can be annoying for structs with a large number of members: Example without this PR Example with this PR To support some dynamic columns that depend on some configuration which is not known at compile-time in addition to some common fields that are known at compiletime. Example To archive symmetry with the deserializer, which already has support for the attribute, so it's surprising that serializing support does not. How This implementation focusses on minimal impact of serialization performance. Adding serialization support on it's own does not add any overhead and is implemented in the first commit (30cc9a0227d112d0aa17a7a288bb39a55ad96d54). The required steps are as follows: When encountering a map or a struct with a member, will call . Similar to we check that we are not already in the process of serializing a row, so nested maps are not allowed. Then, for each entry of the map or member of the stuct, and is called. Finally, is called. The flatten attribute works the same way treats any struct with one or more member as a map, so the following inputs are equivalent to the serializer: However, this will fall apart when used with a map that has an unstable entry order, where the order of the entries is not guaranteed, like the . This is why commit 22acfdce90ff399d0b1eea174c3cb7e4bc45843a adds a check that errors when we encounter out of order keys. It does that by keeping a list of serialized keys and comparing each incoming key with the next expected key. Alternatives This implementation only collects keys and checks them for map based data, not for pure structs without members. This is done so that this check has no impact when no maps are used, but this means it is still possible to get out of order columns when mixing non-map and map based data: Example It is in my opinion very unlikely that anyone will do something like this, nevertheless it would be possible to collect all keys regardless of whether they come from a map or a struct. This would add overhead for non-map based data. An alternative to generating an error for out of order keys is also to support them by accumulating the serialized values and then writing them out at the end (in ) in the right order. That way would be supported, but we'd loose support for two colums with the same name and we'd need to store the serialized values in addition to the serialized keys. Future work I did not add an option to enable or disable the key order check, but this can be easily added if to the builder if we agree on this solution. I hope this helps to finally bring this feature to the crate. Thank you for your work in maintaining it!
I haven't found any prior discussions on this so I though it might be useful to open an issue for it as it's also a feature request and somewhat of a correctness bug. The current behavior of this crate is to try to flatten nested containers: https://docs.rs/csv/1.3.0/csv/struct.Writer.html#rules The behavior of serialize is fairly simple: 1. Nested containers (tuples, Vecs, structs, etc.) are always flattened (depth-first order). 2. If has_headers is true and the type contains field names, then a header row is automatically generated. However, some container types cannot be serialized, and if has_headers is true, there are some additional restrictions on the types that can be serialized. See below for details. This design decision is understandable as CSV is pretty limited and simply doesn't support nested containers (vs. JSON, etc.). However, I do wonder if this is a good default as it interferes with correctness. Disclaimer: My knowledge of CSV, Rust, and especially this crate is limited but here are my considerations: The main problem seems to be that there is no official CSV specification. I've used https://en.wikipedia.org/wiki/Comma-separated_values and https://datatracker.ietf.org/doc/html/rfc4180 as references. (Aside: Perhaps it would also make sense to document to which specification this crate intends to conform to?) A good rule seems to be the following: All records should have the same number of fields, in the same order. IMO this is a requirement for correctly parsing CSV files (with some exceptions like time series), especially if the first record is a "header" - otherwise there just isn't enough context to understand the structure of the data. This crate seems to support that rule as well: https://docs.rs/csv/1.3.0/csv/struct.Reader.html#error-handling By default, all records in CSV data must have the same number of fields. If a record is found with a different number of fields than a prior record, then an error is returned. This behavior can be disabled by enabling flexible parsing via the flexible method on ReaderBuilder. The writer also enforces that rule by default: https://docs.rs/csv/1.3.0/csv/struct.WriterBuilder.html#method.flexible When disabled (which is the default), writing CSV data will return an error if a record is written with a number of fields different from the number of fields written in a previous record. So the default is to flatten nested containers (like a in a ) and error until is used. It certainly makes sense that defaults to but I'd prefer if there was an option to support nested containers without changing the number of fields. I propose the following ideas (not sure how feasible they are though): 1. Let the user supply a custom string function to handle the transformation If possible, this crate would call that function each time it backtracks from such a nested container (optionally with the depth level?). The user could then choose an escaping technique to ensure that the string will represent a single CSV field. This approach is limted as the type information / context is lost but one could at least do simple transformations like quoting and escaping or replacing the delimiters. Using https://github.com/BurntSushi/rust-csv/issues/254 as an example, the user could use that custom funtion to rewrite () to, e.g., , , or (like in the desired example). Additional recursion levels could be supported through nested quoting, more delimiters (in that case the function should be called with the depth level), or custom approaches. Ideally, this would be pretty easy to integrate and flexible enough for most use cases. 2. Let the user supply a custom function that handles the nested containers Similar to 1. but the function would get the raw data and produce the string. The type information would be preserved but it would require reflection, increase the complexitiy, and could be considered out of scope. 3. Offer generic techniques to ensure nested containers can be put into a single CSV field This crate would implement one (or multiple) of the (hopefully pretty universal) custom string functions mentioned in 1. (possibly forcing the introduction of additional constraints). IMO my first proposal could be a decent tradeoff but I might've missed something. What do you think @BurntSushi? PS: I'm currently looking into a somewhat "exotic" use case with @ammernico (https://github.com/BurntSushi/rust-csv/issues/254#issuecomment-1822320445) where the types come from an API specification and mainly consist of structures that contain some vectors. This use case makes it difficult to flatten/convert the vectors into a string before passing the data to this CSV crate. PPS: Huge thanks for this very useful crate and amazing documentation! :)
I'm running a program where it's possible that different input csv files have different delimiters and currently my solution to make sure the delimiter is ';' is It would be awesome if you could run something like to get which delimiter is used in the file you are looking into. Would be nice for me at least. Thanks.