Convert parquet::data_type::Decimal to f64

I'm finding nothing on how to do this.

I realize it may be lossy. Have Parquet data with Decimal cols that needs to go into Postgres double precision (FLOAT8) fields. Any recommendations?

The Decimal enum from parquet has the scale and precision values. I think the easiest way to convert it into a f64 would be to use a string as an intermediate form and then parse it as f64.

1 Like

I don't even see a way to convert it to a string, do you? The Debug impl is derived so it just outputs the enum fields.

1 Like

Something like this might work. Not tested much… also, the floats are probably not always perfect precision (i.e. exactly the floating-point number that’s closest to the full-precision result), since there’s multiple steps involved, but it’s probably good enough.

use parquet::data_type::{AsBytes, Decimal};

fn decimal_to_f64(x: &Decimal) -> f64 {
    let bytes = x.as_bytes();
    let initial_zeros_amount = bytes.iter().position(|&b| b != 0).unwrap_or(bytes.len());
    let trimmed_bytes = &bytes[initial_zeros_amount..];

    // idea:
    // turn unscaled integer from big endian binary into a `f64` first
    // then apply `.scale()` information at the very end
    // I think, we don't really need `.precision()` at all - it's more of a validation thing?

    // if integer is too large, `f64` won't be fully accurate anyway
    // so we can discard less significant bytes - e.g. after the first `u64`-many bytes after trimming
    // containing at least 56 bits after (up to 7) leading zero bits and then 1 leading binary digit `1`
    // - which is more than the 52 mantissa bits `f64` can have, anyway
    let (integer_prefix, more_bytes_amount) =
        if let Some((initial, more)) = trimmed_bytes.split_first_chunk() {
            let initial = u64::from_be_bytes(*initial);
            (initial, more.len()) // record how many bytes were skipped (second tuple entry)
        } else {
            // else we can represent the full trimmed_bytes
            // but u64::from_be_bytes wants a full array, so we prepend a 0s-padding
            const N: usize = u64::BITS as usize / 8;
            let pad_amount = N - trimmed_bytes.len();
            let padded: [u8; N] = std::array::from_fn(|i| {
                i.checked_sub(pad_amount) // by subtrating the index, bytes are shifted to the right
                    .map_or(0, |i| trimmed_bytes[i]) // first bytes use the fallback to 0
            });
            (u64::from_be_bytes(padded), 0)
        };
    let prefix_as_float = integer_prefix as f64;
    // the more_bytes_amount corresponds to a factor of 2.pow(8) == 256, for each byte (8 bits)
    let integer_as_float = prefix_as_float * 256_f64.powi(more_bytes_amount.try_into().unwrap());

    // apply `.scale()` according to the formula `unscaledValue * 10^(-scale)`
    // see also: https://github.com/apache/parquet-format/blob/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/LogicalTypes.md#decimal
    integer_as_float * 10_f64.powi(-x.scale())
}
3 Likes

Holy cow that is insanely complex. :slight_smile:

1 Like

I was thinking something along these lines:

// imagine that we have a variable named parquet_decimal whose value represents the number 12345678.90
let mut intermediate_string = String::new();
  
for (index, byte) in parquet_decimal.data().iter().enumerate() {
  intermediate_string.push(char::from_digit(*byte as u32, 10).unwrap());
  if index == (parquet_decimal.precision() - parquet_decimal.scale() - 1) as usize {
    intermediate_string.push('.');
  }
}

let as_f64: f64 = intermediate_string.parse().unwrap();
1 Like

Which doesn’t match the data representation here at all, from what I could gather about it, though to be fair the direct rustdoc documentation is falling remarkably short on such crucial information.

The primitive type stores an unscaled integer value. For BYTE_ARRAY and FIXED_LEN_BYTE_ARRAY, the unscaled number must be encoded as two's complement using big-endian byte order (the most significant byte is the zeroth element).

1 Like

I see. I only looked at the type's documentation on docs.rs, and didn't see this.

1 Like

Good thing I’m reviewing my own quotes now, just realizing that I totally missed that this supports “signed decimal numbers”. I didn’t handle the negative case at all yet lol.


Edit: Adding sign-handling (mostly affects the trimming logic; and also the extending logic for short arrays), otherwise just changes u64 to i64

use parquet::data_type::{AsBytes, Decimal};

fn decimal_to_f64(x: &Decimal) -> f64 {
    let bytes = x.as_bytes();

    // First, trim redundant sign-extending bytes:

    // determine correct sign
    let neg = bytes.first().is_some_and(|&b| (b as i8).is_negative());
    // `s` is the bytes for sign extension; `0x00` for positive, `0xFF` for negative
    let s = if neg { -1_i8 as u8 } else { 0 };
    // count number of sign-extending bytes
    let mut after_extended_sign = bytes.iter().position(|&b| b != s).unwrap_or(bytes.len());
    let neg_after_extended = bytes
        .get(after_extended_sign)
        .is_some_and(|&b| (b as i8).is_negative());
    // last sign-extending byte might be necessary if not a single bits of the same sign follows
    // as indicated by stripping the whole prefix resulting in the wrong sign
    if neg != neg_after_extended {
        // can't underflow because `neg != neg_after_extended` is impossible when after_extended_sign == 0
        // and bytes.first() and bytes.get(after_extended_sign) are the same
        after_extended_sign -= 1;  
    }
    let trimmed_bytes = &bytes[after_extended_sign..];

    // Conversion Idea / Approach
    // ==========================
    //
    // Turn unscaled integer from big endian binary into a `f64` first,
    // then apply `.scale()` information at the very end.
    // I think, we don't really need `.precision()` at all - it's more of a validation thing?

    // If integer is too large, `f64` won't be fully accurate anyway
    // so we can discard less significant bytes - e.g. after the first `u64`-many bytes after trimming
    // containing at least 56 bits after (up to 7) leading zero[/sign] bits and then 1 leading binary digit
    // - which is more than the 52 mantissa bits `f64` can have, anyway
    let (integer_prefix, more_bytes_amount) =
        if let Some((initial, more)) = trimmed_bytes.split_first_chunk() {
            let initial = i64::from_be_bytes(*initial);
            (initial, more.len()) // record how many bytes were skipped (second tuple entry)
        } else {
            // else we can represent the full trimmed_bytes
            // but u64::from_be_bytes wants a full array, so we prepend a sign-padding
            const N: usize = i64::BITS as usize / 8;
            let pad_amount = N - trimmed_bytes.len();
            let padded: [u8; N] = std::array::from_fn(|i| {
                i.checked_sub(pad_amount) // by subtrating the index, bytes are shifted to the right
                    .map_or(s, |i| trimmed_bytes[i]) // first bytes use the fallback to `s`
            });
            (i64::from_be_bytes(padded), 0)
        };
    let prefix_as_float = integer_prefix as f64;
    // the more_bytes_amount corresponds to a factor of 2.pow(8) == 256, for each byte (8 bits)
    let integer_as_float = prefix_as_float * 256_f64.powi(more_bytes_amount.try_into().unwrap());

    // apply `.scale()` according to the formula `unscaledValue * 10^(-scale)`
    // see also: https://github.com/apache/parquet-format/blob/25f05e73d8cd7f5c83532ce51cb4f4de8ba5f2a2/LogicalTypes.md#decimal
    integer_as_float * 10_f64.powi(-x.scale())
}

comment-less condensed version (don’t ask me why I made this – this isn’t any more readable :sweat_smile:)

fn d_to_f64(d: &Decimal) -> f64 {
    let b = d.as_bytes();
    let n = b.first() > Some(&128);
    let s = 255 * (n as u8);
    let i = b.iter().take_while(|&&b| b == s).count();
    let t = &b[i - (n != (b.get(i) > Some(&128))) as usize..];
    let f = 8_usize.min(t.len());
    let a = std::array::from_fn(|j| j.checked_sub(8 - f).map_or(s, |j| t[j]));
    let r = i64::from_be_bytes(a) as f64;
    r * 256_f64.powi((t.len() - f).try_into().unwrap()) * 10_f64.powi(-d.scale())
}
3 Likes