I'm writing a little hobby tool to examine the UCD (it prints a markdown table) and am quite confused about the database. I'm using ucd-parse to read the database, which has its own pitfalls, like not being able to just read from memory. But what ultimately confuses me is that a UCD table row contains a name and a unicode 1.0 name. What further confuses me is that the name, for example contains just <control>
for control characters, and doesn't actually tell you anything about what the character really is. Why is Unicode like this? Which attribute should I use (name
or unicode1_name
)?
To try to work around this clunky way of doing things I've tried this algorithm:
use ucd_parse::{parse as parse_ucd, UnicodeData, UnicodeDataExpander};
// ...
println!("Loading UCD");
// Load the Unicode rows
let data: Vec<UnicodeData> = match parse_ucd(tempdir.path()) {
Ok(d) => d,
Err(e) => {
eprintln!("Error: the downloaded UCD is invalid: {}", e);
return;
}
};
// Expand all the rows into the full Unicode character set
let data: Vec<UnicodeData> = UnicodeDataExpander::new(data.iter().cloned()).collect();
println!("Loaded {} unicode characters", data.len());
let mut outfile = match File::create("ucd.md") {
Ok(f) => f,
Err(e) => {
eprintln!("Error: file creation failed for file ucd.dic: {}", e);
return;
}
};
let mut chars: BTreeMap<String, BTreeMap<u32, String>> = BTreeMap::new();
data.iter().for_each(|row| {
chars
.entry(row.general_category.clone())
.or_insert_with(BTreeMap::new)
.insert(
row.codepoint.value(),
if row.unicode1_name.clone().is_empty() && !row.name.clone().is_empty() {
row.name.clone()
} else if !row.unicode1_name.is_empty() && row.name.clone().is_empty() {
row.unicode1_name.clone()
} else if !row.name.clone().is_empty() && !row.unicode1_name.clone().is_empty() {
row.unicode1_name.clone()
} else {
String::from("Unspecified")
},
);
});
Probably not the best way of doing things, but I couldn't really think of another way of resolving this issue. Is there another way I'm unaware of?