Thanks to the AVFoundation team I learned that both audio and video samples are supposed to be interleaved whenever media data is ready from either call to encode ready samples, and that fixes encoding this video encoding with x264 and ffmpeg.