My first project: Bindings to a C library

kodemartin · November 22, 2020, 11:02am

Hi all,

I recently published my first crate (https://github.com/kodemartin/rustpostal) and I would greatly appreciate a review.

Thank you!

H2CO3 · November 22, 2020, 12:17pm

Hi! Thanks for your work, it's an interesting library. This is a use case that I've been pondering myself for some time.

Let's jump into the review!

First of all, the customary way of wrapping a C library in Rust is to split the functionality in two separate crates:
1. A "sys" crate, in your example it would be called libpostal-sys. This should contain the immediate C FFI bindings. These include C types and functions, as well as equivalents of other C constructs that can't be directly translated to Rust, e.g. const items that replace #define macros.
2. A higher-level crate providing a safe, idiomatic, preferably RAII-enabled wrapper around the lower-level -sys one.
This is preferable to your current approach of mixing safe and unsafe APIs in the same create.
Next, it would be nice to follow the Rust API Guidelines as closely as possible. A couple of things that stick out immediately:
1. Most of your types and functions have no documentation.
2. unsafe blocks are generally very long and coarse-grained, often spanning almost entire functions.
3. There are no derived trait implementations on most of your types. Default, Debug, Clone, Copy (if applicable, especially on simple enums and newtype wrappers around primitive types), PartialEq, Eq (if applicable) and Hash are the most important ones you'll want to provide for interoperability. It is very hard to use types that lack these implementations.
4. Relatedly, NormalizeOptions::get_default_options() should just be the Default::default() impl. NormalizeOptions::initialize() could be the derived Default impl. (Having a constructor named initialize() is confusing anyway – why do you need it? Why isn't NormalizeOptions::new() enough?)
In general, taking Vec<_> in function arguments is considered an anti-pattern (see e.g. expand_address_with_options()). They can only be used if the caller actually has a Vec. It's preferable to accept a slice (&[T]) or better yet, an Iterator<Item=T> instead.
Your NormalizeOptions::as_libpostal_options() function takes &mut self, which is inappropriate for a function named as_something() (without it being called as_mut_something()). Do you really need this? If you are mutating the structure, then the method shouldn't be called as_something(), since the idiomatic naming convention dictates that as_something() methods perform a read-only conversion between types, and do not mutate the receiver.
Exposing separate (and unsafe!) setup and the teardown functions is very error-prone. Consider wrapping setup and teardown functionality into an RAII "context" type, which calls setup() in its constructor and teradown() in its Drop implementation.
Your documentation for address::parse_address() is unidiomatic: you explicitly call .into_iter() on the result before iterating, which is unnecessary.
The use libc; statement in ffi.rs is unnecessary.
Globally suppressing compiler warnings against unused items (#![allow(unused)]) is a code smell, too. Why do you need it? You should in general fix warnings rather than allowing them, especially if you are allowing them globally.
The AddressComponents bitset shouldn't be implemented manually. Bitset boilerplate can be easily implemented using the bitflags crate.

The code is unnecessarily verbose in some places. For example, the following code snippet:

let mut n: libc::size_t = 0;
let n = &mut n as *mut libc::size_t;
let raw = ffi::libpostal_expand_address(address.as_ptr() as *const libc::c_char, options, n);

could be rewritten as just

let mut n: usize = 0;
let raw = ffi::libpostal_expand_address(address.as_ptr() as *const libc::c_char, options, &mut n);

because usize is guaranteed to be the same as libc::size_t, and an implicit conversion ("coercion") from &mut T to *mut T exists (as well as from &T to *const T, of course).

kodemartin · November 22, 2020, 4:24pm

This is an amazing review! Thank you very much for the time you spent, the depth
you went into, and the constructive quality of your comments.

Indeed, many aspects of my first release seemed a bit off, and getting feedback
on the customary and idiomatic ways to do things in Rust, is extremely helpful.
A lot of things you mention, simply did not occur to me due to lack of
experience with the language.

Most of the recommendations are straight-forward and outline a roadmap for the
project.

To recap:

Use dedicated libpostal-sys crate for the ffi.

Q: Do you think this relates only to the ffi.rs module of the crate?
Or do you see declarations in other modules that should not be there?
Use RAII-enabled wrappers on the higher-level crate.
Converge to the Rust API Guidelines.

Note: You raise very good points in your specific remarks (derive
traits, implement Defaultetc.), so no argue there!
Use bitflags for AddressComponents. Note: I value also the generic
implications of the comment, to reuse accepted community solutions and void
boilerplate.
Revisit reasons for suppressing compiler warning, and prefer fixing any
underlying issues.

Apart from these general directions, I would like to discuss more two specific
comments:

NormalizeOptions::as_libpostal_options() :

Your NormalizeOptions::as_libpostal_options() function takes &mut self ,
which is inappropriate for a function named as_something() (without it being
called as_mut_something() ). Do you really need this? If you are mutating
the structure, then the method shouldn't be called as_something() , since
the idiomatic naming convention dictates that as_something() methods perform
a read-only conversion between types, and do not mutate the receiver.

This has troubled me a lot during the implementation. Wanted to create a
higher-level interface to construct ffi::libpostal_normalize_options. This
requires the population of a *const *const libc::c_char field, and the
only solutions I could think of, was to mutate
NormalizeOptions.language_c_strs and NormalizeOptions.language_ptrs.
The latter holds the data in the target ffi structure for the whole
lifetime of NormalizeOptions. So this was the reaseon for &mut self.

After reviewing API guidelines C-Conv I reckon the to_ convention is more appropriate, no?
About verbose code snippets:

an implicit conversion ("coercion") from &mut T to *mut T exists (as well as from &T
to *const T

Is this because of the CoercedUnsized
implementation?

Thanks again!

H2CO3 · November 22, 2020, 5:08pm

All of the other modules contain pub unsafe functions, which should not be there. The prime example of this is the setup-teardown pair which is directly in lib.rs. But as_libpostal_options() is also unsafe.

Not really. to_xyz() and into_xyz() methods still shouldn't mutate. In general, conversion methods should not mutate the receiver. The difference between them is only that:

as_() methods convert from either references or owned values to references or cheaply-constructed values (typically, primitive types or small Copy types), e.g.: as_ref(), as_str(), as_bytes().
as_mut_() methods convert from either mutable references or owned values to a mutable reference. E.g.: as_mut(), as_mut_slice().
to_() methods convert from borrowed to owned / expensive-to-construct types, e.g.: to_owned(), to_string().
into_() methods convert from owned to owned types, e.g.: into(), into_inner().

Why? Couldn't you create a new *const *const char value? Currently, you are indeed returning the pointers inside self – and while I'm not familiar with the internals of libpostal, it seems really dubious that you are returning something with pointers inside self but that return value has raw pointers and thus it doesn't have a lifetime that would tie it to self. All in all, this smells like Undefined Behavior waiting to happen.

No, it isn't. Nothing's unsized here. &T always coerces to *const T and &mut T always coerces to *mut T, full stop. Unsized coercions are an entirely different kind of beast. They are related to the fact that you can e.g. convert a &[T; N] reference-to-array into a &[T] reference-to-slice, and that you can do the same using Box<[T]>, and trait objects, etc. None of those compilcations is happening here. Note how that CoerceUnsized impl you linked to has T on one side but U, a different type, on the other side. In your case, there aren't two different pointed types. It's &mut T to *mut T, not &mut T to *mut U.

It's a lot simpler: a raw pointer has strictly less capabilities and information content than a reference — basically, references are pointers with extra lifetime and borrowing information. Therefore, you can always convert a reference to a raw pointer that points to the same type.

kodemartin · November 22, 2020, 6:01pm

You got a point(er) there! In fact, if currently self is moved after the ffi structure is created and dropped for some reason, we will be left with dangling reference. I have to revisit this.

This makes sense. Have to understand coercion better.

Are there any other references, sample codes you could recommend on the topics we discussed?

system · February 20, 2021, 6:01pm

This topic was automatically closed 90 days after the last reply. We invite you to open a new topic if you have further questions or comments.

Topic		Replies	Views
Rust FFI crates? help	10	902	January 12, 2023
Looking for feedback on API bindings for a C library code review	3	373	January 19, 2022
Where stop the ffi binding rustification? help	3	555	January 12, 2023
References on creating Rust bindings? help	4	482	January 12, 2023
Rust or C++, I am not sure yet community	20	1784	August 26, 2023

My first project: Bindings to a C library

Related Topics