!!! DRAFT DRAFT DRAFT !!! DRAFT USAGE / SEMANTICS / RATIONALE SECTIONS FOR NUT SPEC Overview of NUT Unlike many popular containers, a NUT file can largely be viewed as a byte stream, as opposed to having a global block structure. NUT files consist of a sequence of packets, which can contain global headers, file metadata, stream headers for the individual media streams, optional index data to accelerate seeking, and, of course, the actual encoded media frames. Aside from frames, all packets begin with a 64-bit startcode, the first byte of which is 0x4E, the ASCII character 'N'. In addition to identifying the type of packet to follow, these startcodes (combined with CRC) allow for reliable resynchronization when reading damaged or incomplete files. Packets have a common structure that enables a process reading the file both to verify packet contents and to bypass uninteresting packets without having to be aware of the specific packet type. In order to facilitate identification and playback of NUT files, strict rules are imposed on the location and order of packets and streams. Streams can be of class video, audio, subtitle, or user-defined data. Additional classes may be added in a later version of the NUT specification. Streams must be numbered consecutively beginning from 0. This allows simple and compact reference to streams in packet types where overhead must be kept to a minimum. Packet Structure Every NUT packet has a packet_header and a packet_footer. The packet_header consists of a 64-bit startcode, a forward_ptr, and an optional CRC on the packet header if the forward_ptr is larger than 4k. The forward_ptr gives the size of the packet from after the packet_header until the end of the packet_footer. The optional CRC is to prevent a demuxer from seeing a damaged startcode and forward_ptr with a high size, causing it to skip or buffer a large part of the file, only to find it is not really a NUT packet. The packet_footer consists of reserved_bytes, which is room for any reserved fields which can be skipped by old demuxers, and a CRC covering the packet from after the packet_header until the CRC itself. Variable Length Coding Almost all fields in NUT are coded using VLCs. VLCs allow for compact storage of small integers, while still being extendible to infinitely large integers. The syntax is of a VLC is, per byte, a 1 bit flag stating if there are more bits to the integer, and 7 bits to be prepended to the lsb of the integer, shifting the previous value to the left. Stuffing is allowed in VLCs by adding 0x80 bytes before the actual value, but a maximum of 8 bytes of stuffing is allowed in any VLC outside of a NUT packet. The only such fields are forward_ptr and fields in frame headers. This is to prevent demuxers from looping on a large of amount of damaged bytes of 0x80. Fields inside a NUT packet are protected by a CRC which can be checked before decoding. Header Structure A NUT file must begin with a magic identification string, followed by the main header and a stream header for each stream, ordered by stream id. No other packets may intervene between these header packets. For robustness, a NUT file needs to include backup copies of the headers. In the absence of valid headers at the beginning of the file, processes attempting to read a NUT file are recommended to search for backup headers beginning at each power-of-two byte offset in the file, and before end of file. Simple stop conditions are provided to ensure that this search algorithm is bounded logarithmically in file length. This stop condition is finding any valid NUT packet (such as a syncpoint) during the search, as no packets are allowed between a search start until a reapeted header set. Metadata - Info Packets The NUT main header and stream headers may be followed by metadata "info" packets, which contain (mostly textual, but other formats are possible) information on the file, on particular streams, or on particular time intervals ("chapters") of the file, such as: title, author, language, etc. One should note that info packets may occur at other locations in a file, particulatly in a file that is being generated/transmitted in real time; however, a process interpreting a NUT file should not make any attempt to search for info packets except in their usual location, i.e. following the headers. It is intended that processes presenting the contents of a NUT file will make automated responses to information stored in these packets, e.g. selecting a subtitle language based on the user's preferred list of languages, or providing a visual list of chapters to the user. Therefore, the format of info packets and the data they are to contain has been carefully specified and is aligned with International Standards for language codes and so forth. For this reason it is also important that info packets be stored in the correct locations, so that processes making automated responses to these packets can operate correctly. Index An index packet to facilitate O(1) seek-to-time operations may follow the headers. If an index packet does exist here, it should be placed after info packets, rather than before. Since the contents of the index depend on knowing the complete contents of the file, most processes generating NUT files are not expected to store an index with the headers. This option is merely provided for applications where it makes sense, to allow the index to be read without any seek operations on the underlying media when it is available. On the other hand, all NUT files except live streams (which have no concept of "end of file") must include an index at the end of the file, followed by a fixed-size 64-bit integer that is an offset backwards from end-of-file at which the final index packet begins. This is the only fixed-size field specified by NUT, and makes it possible to locate an index stored at the end of the file without resorting to unreliable heuristics. Streams A NUT file consists of one or more streams, intended to be presented simultaneously in synchronization with one another. Use of streams as independent entities is discouraged, and the nature of NUT's ordering requirements on frames makes it highly disadvantageous to store anything except the audio/video/subtitle/etc. components of a single presentation together in a single NUT file. Nonlinear playback order, scripting, and such are topics outside the scope of NUT, and should be handled at a higher protocol layer should they be desired (for example, using several NUT files with an external script file to control their playback in combination). With each stream, a single media encoding format is associated. The stream headers convey properties of the encoding, such as video frame dimensions, sample rates, and the compression standard ("codec") used (if any). Stream headers may also carry with them an opaque, binary object in a codec-specific format, containing global parameters for the stream such as codebooks. Both the compression format and whatever parameters are stored in the stream header (including NUT fields and the opaque global header object) are constant for the duration of the stream. Each stream has a last_pts context. For compression, every frame's pts is coded relatively to the last_pts. In order for demuxing to resume from arbitrary points in the file, all last_pts contexes are reset by syncpoints. Frames NUT is built on the model that video, audio, and subtitle streams all consist of a sequence of "frames", where the specific definition of frame is left partly to the codec, but should be roughly interpreted as the smallest unit of data which can be decoded (not necessarily independently; it may depend on previously-decoded frames) to a complete presentation unit occupying an interval of time. In particular, video frames correspond to the usual idea of a frame as a picture that is displayed beginning at its assigned timestamp until it is replaced by a subsequent picture with a later timestamp. Subtitle frames should be thought of as individual subtitles in the case of simple text-only streams, or as events that alter the presentation in the case of more advanced subtitle formats. Audio frames are merely intervals of samples; their length is determined by the compression format used. Frames need not be decoded in their presentation order. NUT allows for arbitrary out-of-order frame systems, from classic MPEG-1-style B frames to H.264 B pyramid and beyond, using a simple notion of "delay" and an implicitly-determined "decode timestamp" (dts). Out-of-order decoding is not limited to video streams; it is available to audio streams as well, and, given the right conditions, even subtitle streams, should a subtitle format choose to make use of such a capability. Central to NUT is the notion that EVERY frame has a timestamp. This differs from other major container formats which allow timestamps to be omitted for some or even most frames. The decision to explicitly timestamp each frame allows for powerful high-level seeking and editing in applications without any interaction with the codec level. This makes it possible to develop applications which are completely unaware of the codecs used, and allows applications which do need to perform decoding to be more properly factored. Keyframes NUT defines a "key frame" as any frame such that the frame itself and all subsequent (with regard to presentation time) frames of the stream can be decoded successfully without reference to prior (with regard to storage/decoding order) frames in the stream. This definition may sometimes be bent on a per-codec basis, particularly with audio formats where there is MDCT window overlap or similar. The concept of key frames is central to seeking, and key frames will be the targets of the seek-to-time operation. Representation of Time NUT represents all timestamps as exact integer multiples of a rational number "time base". Files can have multiple time bases in order to accurately represent the time units of each stream. The set of available time bases is defined in the main header, while each stream header indicates which time base the corresponding stream will use. Effective use of time bases both allows for compact representation of timestamps, minimizing overhead, and enriches the information contained in the file. For example, a process interpreting a NUT file with a video time base of 1/25 second knows it can convert the video to fixed-framerate 25 fps content or present it faithfully on a PAL display. The scope of the media contained in a NUT file is a single contiguous interval of time. Timestamps need not begin at zero, but they may not jump backwards. Any large forward jump in timestamps must be interpreted as a frame with a large presentation interval, not as a discontinuity in the presentation. Without conditions such as these, NUT could not guarantee correct seeking in efficient time bounds. Aside from provisions made for out-of-order decoding, all frames in a NUT file must be strictly ordered by timestamp. For the purpose of sorting frames, all timestamps are treated as rational numbers derived from a coded integer timestamp and the associated time base, and compared under the standard ordering on the rational numbers. Frame Coding Each frame begins with a "framecode", a single byte which indexes a table in the main header. This table can associate properties such as stream id, size, relative timestamp, keyframe flag, etc. with the frame that follows, or allow the values to be explicitly coded following the framecode byte. By careful construction of the framecode table in the main header, an average overhead of significantly less than 2 bytes per frame can be achieved for single-stream files at low bitrates. Framecodes can also be flagged as invalid, and seeing such a framecode indicates a damaged file. The frame code 0x4E ('N') is a special invalid framecode which marks that the next packet is a NUT packet, and not a frame. The following 7 bytes, combined with 'N', is the full startcode of the NUT packet. Syncpoints Syncpoints are mini-NUT packets, which serve for seeking, error recovery, and error checking. They contain a startcode like all NUT packets, a timestamp, a back_ptr, and a CRC on the packet itself. Syncpoints must be placed every 32kb (or whatever max_distance is set to in the main header, 64kb at most), unless between the 2 syncpoints is a single frame. Syncpoints must be followed by a frame, and must be placed after headers (except those at end of file). The timestamp coded in the syncpoint is a global timestamp, which is used to reset the last_pts context of all streams, and to find the appropriate syncpoint when seeking. Demuxing can only begin at syncpoints for proper last_pts context across all streams, including after seeking. A back_ptr points to a previous syncpoint in the file. The area between the previous syncpoint and this one must contain a keyframe for every stream, with a pts lower than or equal to the timestamp of this syncpoint. This back_ptr is used for optimal seeking in files without an index. For compression, the back_ptr is relative to this syncpoint, and is divided by 16. The reason for this is that the minimum size for a syncpoint is 16 bytes: 8 startcode + 1 forward_ptr + 1 timestamp + 1 back_ptr + 4 checksum + 1 frame_code End of Relevance EOR is a flag that can be attributed to a frame in any stream, and marks the end of relavence of a stream on presentation, such as a subtitle stream currently showing no subtitles on screen. EOR flag can only be given to zero byte frames, and must be set as keyframe as well. Once EOR is seen on such a stream, the stream is set EOR until the next keyframe on that stream. Streams which are set EOR are ignored by back_ptr in syncpoints until EOR is unset. The significance of EOR is to set the stream as irrelevant when seeking and searching for optimal keyframes to begin demuxing. Error Checking There are several ways to detect a damaged stream in NUT during demuxing: 1. Invalid framecode - If a framecode which has been marked as invalid in the main header is found as the framecode in a frame header, then the stream is damaged. For this reason, 0x00 and 0xFF are recommended to be set as invalid frame codes in NUT. 2. Bad CRC on a NUT packet, packet_header, or frame header - fairly obvious. 3. Decoded frame size causes a distance from the last syncpoint to be bigger than max_distance, and frame does not follow a syncpoint. 4. Decoded frame size is bigger than max_distance*2, and frame header does not have a CRC. 5. Decoded frame pts is more than max_pts_distance higher than last_pts, and frame header does not have a CRC. 6. A VLC is found with more than 8 bytes of stuffing in a frame header or forward_ptr. 7. Streams are found to not be strictly interleaved by comparing dts and pts - a precise formula for this check can be found in the spec. All these conditions make it impossible for a demuxer to read, skip or buffer a large amount of data from a file because of damaged data. Also max_pts_distance prevents an overly large pts caused by damaged data to cause a player to get stuck. Error Recovery The recommended method for recovering from errors once damage has been detected, is to linear search the file from current position to the closest syncpoint startcode found, and resume demuxing from there. If possible, before the linear search, rewind to the last syncpoint seen, in case a syncpoint was already skipped due to demuxing damaged data. Seeking An in depth explanation of an optimal seeking algorithm can be found in http://wiki.multimedia.cx/index.php?title=NUT