Complete programming noob looking for a mentor

OK, so here's the deal: I'm in my early twenties, and the closest I ever got to programming is daisy-chaining if functions in Excel.

About a year ago I started seriously toying with the idea of learning to program. So I began lurking around Github, following the programs I use, even submitting some non-programming-related pull requests, and reading about higher-level programming concepts (separation of concerns, abstraction principle, algorithms, heuristic, etc.), to get a general sense of how things work. The biggest breakthrough I had was when I watched this talk on computer basics by Richard Feynman. That's the point when it all snapped into place. All of a sudden, programming stopped seeming like magic and became obvious to the point where I started being able to predict how things might work, which gave me a huge confidence boost.

Also, during this time, I learned HTML and CSS, and by extension, from spending a lot of time in a text editor, regular expressions.

So a few months ago, I was finally gonna take the plunge, get the K&R book and start learning C, because I like doing things the hard way. But then I came across Rust. Reading about it being a modern alternative to C and C++ made me think I might learn it instead, and then finding out about the documentation, the Playground and being the most loved programming language in the StackOverflow survey kind of sealed the deal.

Since then, I've read the Rust Book and found it pretty easy to follow, and now I would like to dive in and actually try to write a simple project. What I need is someone to help me figure out the general structure of the program, offer comments on the code and give me some pointers when I get stuck.

Any masochists volunteers?

1 Like

While it might be relatively hard to find Yoda for your Luke :wink: you might consider starting with some small contributions to existing projects.

There are at least two sources of issues to fix/implement that are guaranteed to be mentored.

TWIR call for participation
and
rust-cookbook examples . Admittedly the list in cookbook is a little empty now but I'll upload some issues next week.

Hope it helps!

1 Like

I'd be willing to offer mentorship in Rust. I can also help you contribute to Redox's Ion Shell, if that kind of project interests you.

1 Like

Thanks for the kind offers, guys. :smiley:

mmstick, if you don't mind, I'd like to build something from scratch. I've got a simple command line tool in mind that would proportionally distribute a given duration to multiple subtitles.

Well, if you can elaborate on that with as much detail as possible, I could point you in the right direction, as far as which features to use and how to organize your codebase.

What I usually do is start by writing down all the requirements of the program, and logically grouping them together. Don't sweat the implementation details until you've got a complete map of the program in your mind.

1 Like

Yea, I know planning comes first. I've already got the whole thing pretty much worked out in my head; I just need to write it down.

Haha, I guess there's always more to it than you expect. Well, it took a bit more than I thought, but I think I've finally got all the details figured out. Sorry if it's a mess; I had a patchwork of paper notes, and I hope I've managed to make some kind of narrative out of it.

For start, here's the flowchart of how I think the program could function:

Summary

Input

The program needs to take a subtitle file that has more than one group of lines per timed block, and then distribute durations of those blocks to their internal groups of lines, esentially converting them to standard subtitles. The idea is to make subtitling of text-heavy intertitles easier by not having to manually time every subtitle within the intertitle; instead, you'd just time the start and end of each intertitle, and let the computer take care of those that contain multiple subtitles.

I'd like to be able to process two kinds of files: the standard SRT format, slightly modified to use multiple "sub-subtitles" (for lack of better word) per subtitle block, and a simple custom format that uses frame numbers instead of timecodes.

SRT format:

1                                             // subtitle number
00:00:15,601 --> 00:00:17,186                 // start & end timecode, zero padded, h:min:sec,millisec
Subtitle one, for first intertitle.           // subtitle text, can have multiple lines
Has multiple lines.
                                              // ends with a blank line
2
00:00:37,623 --> 00:00:38,790
Subtitle two, for second, wordy, intertitle,
so I'd like to split it into multiple subs,

each one getting the portion of total duration
relative to its character count.

Blank lines would be used to separate
parts of such multi-part subtitles.

3
00:00:39,041 --> 00:00:40,542
Subtitle three, for third intertitle.

Custom frame-based format:

0
100
Subtitle blocks aren't numbered like in SRT.
Text works the same way as in SRT.

Timings use frame number;
starting and ending frame no. are each on its own line.

150
300
Subtitle two, for second intertitle.

350
425
Subtitle three, for third intertitle.

Some more dummy text for the third one.

I'm guessing the first step will be to parse those files into some kind of data structure. Maybe a vector made of structs?

Also, the SRT files should have their timecodes converted to a more manageable format – probably milliseconds, since that's the smallest unit they use.

I'd also like to make a few command line arguments available:

  • output format: required only if different from input
  • framerate: program should try to detect if from filename; required if input and output formats differ (frame–timecode conversion); optional otherwise, and then used for alignment of milliseconds on frame boundaries
  • intersubtitle gap: optional; uses hardcoded value if none given

Distributing durations

I was thinking something like this, which will work both for frames and milliseconds:

  1. calculate the subtitle's duration
  2. count the number of text groups within it and take away 1 to get the number of gaps between them (they'll get either a user selected duration or a hardcoded one; two frames seems to be the norm)
  3. subtract the total length of gaps from the duration to get the available duration
  4. use the character count to distribute the duration, something like (chars in this text group / total chars) * duration and round down to nearest integer
  5. sum the discarded remainders of all text groups from the previous step, round down to nearest integer (which can at most be one less than the number of groups) and distribute between the groups, ranked by their individual remainders from the previous step, each group getting at most 1 (the "Largest remainder method"); in case of a tie in the ranking... do what?
  6. if we're storing the subs internally using start and end timings, convert the durations to timings

I'd also like to optionally be able to align milliseconds to frame boundaries. I think that could be handled after step 4 if we have the framerate available and the input was SRT format.

For character counting, we'll assume the text is in UTF-8, and use unicode-segmentation crate or similar. One thing to keep in mind is that SRT format uses basic <i>, <b>, <u> HTML tags for styling, which shouldn't be counted.

Output

The output converts the internal time format if needed and writes a file in either of the input formats. It should probably use the same name as the input and append some suffix.

Rather than a vector of structures, which would incur some memory overhead and latency, you could make your IR an Iterator type which returns structures upon each iteration -- perhaps two different Iterator types for the different parsing methods, but both returning the same structures.


/// Unit of measurement for the start and end times
type Millisecond = usize;

/// Contains a multi-part subtitle which we will use to create
/// an iterator that will iterate over parts.
struct SubtitleString<'a> {
    data: &'a str,
    read: usize,
}

/// A complete subtitle element that will be returned by our tokenizers.
struct Subtitle<'a> {
    start: Millisecond,
    end: Millisecond,
    parts: SubtitleString<'a>
}

/// A tokenizer for the SRT format.
struct SrtTokenizer<'a> {
    data: &'a str,
    read: usize,
}

/// The same, but for the other format
struct CustomTokenizer<'a> {
    data: &'a str,
    read: usize
}

impl<'a> Iterator for SrtTokenizer<'a> {
    type Item = Subtitle<'a>;
    fn next(&mut self) -> Option<Subtitle<'a>> {
        // code for parsing SRT
    }
}

impl<'a> Iterator for SubtitleString<'a> {
    type Item = &'a str;
    fn next(&mut self) -> Option<&'a str> {
        // Code for parsing the multi-part subtitles
    }
}

The benefit of this approach is that we've performed zero heap allocations thus far, and have no reliance on any types outside of Rust's core (means we are no_std compatible at the moment).

You could use similar approaches for calculating the subtitle's duration from the character count via that segmentation crate, and writing an iterator with it that accounts for the special html tags.

For your output, you could just have a single implementation of the Display trait on your Subtitle element that formats the internal representation accordingly.

impl<'a> Display for Subtitle<'a> {
    fn fmt(&self, f: &mut Formatter) -> fmt::Result {
        write!(f, "format stuff here")
    }
}

As for command-line arguments. They could be parsed in a number of ways. The quickest solution is to use one of the available argument parsing crates, such as clap.

tl;dr - You are setting yourself up for failure. Read SICP, widely considered one of the best books ever written on the topic of computers. Used as the programming textbook at MIT, Berkeley, Yale and dozens of other respected schools, primarily in the '90s and still regularly places in "Top 10 Best Book" lists for professional programmers. It maintains a spot among the top 8 "most mentioned" books on StackOverflow as well as StackOverflow's community wiki of "Greatest Programming Books".

Use Racket or Guile scheme if you want to be old-school, use a setup guide and take a look at Eli's answers when you're stuck.

Forget Rust & C for now.


Mark, is that you!?

Your story fits the profile of someone I met on a long plane ride not just vaguely, but exactly. Up to and including chaining excel if functions and Feynman's on Computing.

His name was Mark, he was a consultant at McKinsey. By telling you this I've told you a lot of what you'd need to know about him, in fact I might just be repeating myself if I told you he went to an Ivy. A very smart guy.

I saw he was watching a lecture on programming in C, from Coursera I think, we got to talking about why he was learning to program.

He wanted to write real, super duper programs. He didn't like programming in Excel and he sensed a better way. I talked to him a little bit about it and we ultimately decided to keep in touch.

I recently contacted him (unrelated to your post) and was not surprised to find out that he'd ultimately given up. A driven and intelligent man by any measure, ultimately failed to write meaningful programs of his own design. He got to that stage where nearly everyone bails on programming: "I understand the constructs, just not how they can make useful work". C exacerbates this problem, perhaps more so than any other commonly used language except perhaps it's cousin + daughter C++.

If you think this isn't your problem, think on this -- you've read Feynmann, you already know what a while, if and basic operators do which can both theoretically and practically construct all / nearly all programs (respectively). Things like volatile, buffer overflows, stdio.h, the Unix Philosophy, CHAR_BIT != 8 - these are all details, many programmers get by well without these individual morsels of knowledge -- so why aren't you already an expert programmer?

I submit to you it's because you don't know how to "assemble" these parts.

So why pick the language with the most opaque insights, the toughest to detect issues, the most features reflecting long-gone eras in computing, the most "raw" experience? This isn't an issue of "no pain no gain" - that concept doesn't really apply in programming.

C programmers like to think that because they've written C, they basically understand how the system works. I don't want this post to turn into a Tolstoy novel so I'll leave this at "not true" and move on.

I'll admit that I've been scarred by C and therefore have a sour view -- after writing wireless device drivers, rootkits, a Lisp implementation, tiling window manager and a hundred other just barely working programs in C, I cannot tell you how much starting with it has held back my development as a programmer. Heck, I've formally verified software written in C and I still got the feeling that it was "barely working"!

Rust is considerably better, both for novices and veteran programmers, but still far away from emphasizing the basic concepts of programming.

A quick note on Feynmann. I got the book after this plane ride (Feynmann Lecture on Computation), I haven't seen any video lectures of his, but at least that books provides a greater focus on the theoretical considerations of computing: information theory, physical limits of computation, quantum computing, etc. If you can learn to program this way, great, but I'd be surprised if it gave you a good review of what it takes to program well.

1 Like

Do you really want this topic to be derailed into a discussion of the relative pedagogical merits of programming languages? :slight_smile:

Years of accumulated programmer discussions suggest that the search for a single ideal first programming language is pretty much an undecidable problem, whose only commonly accepted solution is to acquire experience from as wide a set of languages as possible instead of focusing on a single one, in order to maximize exposure to the diverse points of view present in the programming community.

I would thus not be so harsh as to state that someone who picks a learning environment which you happen to find suboptimal is setting oneself up for failure and should start over from scratch with a completely different environment.

Well, exactly. All I can say that C is not suitable for initial learning, and neither is Scheme. Rust has the great advantage - you can find reusable components to do things like parse command line arguments. I would further say that worrying about using iterators rather than vectors would be premature optimization at this point - make it solid, make it correct, and then make it fast.

1 Like

I could not agree with you more. Starting with C or Rust is like making your first run on skis on the Nosedive at Stowe or taking your first flying lesson in an F-16. SICP is an excellent suggestion. The point is to learn the major concepts of how to properly structure a complex program using a garbage-collected language without being distracted by the grubby details of memory management that C and Rust require.