ILL - Data formats

Main page Previous --- Top 5/14 --- Next

Early data formats

With little processing power on the NICOLE and CARINE data acquisition systems initially there was little documented on the data formats on these systems.
The CARINE data were tailored in contents for each instrument. The Carine Manual (Kaiser et al) described the contents of each field.
The NICOLE data were multichannel data blocks, but included a few standard scaler values and a text block which usually held quite a good description of the measurement, but was hand-edited. The data were copied at the end of measurements sequentially to 9-track magnetic tape. In the very early days this was taken to Saclay and written to cards for re-use on the IBM system of the university, but, with the installation of the DEC-PDP10, they were then read each morning on this central computer, and stored on disk. The DECtape data were copied on to the DEC10 by the instrument scientists, and the data were not initially archived systematically, being conserved as separate files.

Archiving Data

The data disk was known as SPCT with a capacity of about 40 MB, holding data for each CARINE and NICOLE instrument appended into sequential files for each instrument. Depending on volumes of data these were copied to 800 bpi nine track tape. Each run included a ten character field NOMEXP, which was usually changed for each set of measurements, simplifying copying to tape without breaking a user's sequence of measurements. A sequential run number was associated with each measurement (NUMOR). On some instruments (e.g. D11) this was reset at the start of each year. For other instruments with smaller number of measurements the number was simply incremented.

Data access

Calcul Scientifique provided a Fortran callable routine to access data either from disk or tape. This routine, MEDOR, written in PDP10 assembler, which allowed return of different number of Fortran fields, depending on the instrument data. This searched the sequential file of multiple measurements looking for the run requested, like a good dog. On tape there were often several files, e.g. for the tape HDA, HDA01, HDA02, HDA03, etc., speeding up access by using the hardware skip-to-end-of-file tape operation. Although four tape drives were available, the limited disk space for current data (about 40 MB) lead to quite a demand for access to the data on tape.

A major improvement in efficiency followed the addition of directory files on disk indicating the starting block for each run. For tape access too, as the tape was read the start point for each measurement could be retained and bi-directional tape motion optimised. This later routine was known as MEDIR (Christian Rey).

These archive tapes were transcribed to 1600 bpi tapes in 1981, when the old NRZI tape drives were superseded by phase encoded tapes on the new machine. It was also an opportunity to rewrite the magnetic tapes before fading occurred. Subsequent copies were made to optical disk (VAX-VMS) and CD-ROM. For a small subset of data from D11 which contained the detector data in one file of 64x64x36 bits = 22480 bytes) the top-end of the spectrum was truncated because the file was copied using the default VAX-VMS block size of 17 kB. Fortunately most of these data were recorded in four 1024x36 bit records.

When NICOLE was replaced by individual PDP11 16-bit computers the data formats were modified. The original proposal (Hildebrandt) was to record separately the histogram data, the parameters and a text file, and use the RSX version number as the common identifying run number (being an octal number was a small complication). A unified, extensible format including this information starting on formal 512 byte sector boundaries was adopted (R. Ghosh ILL82GH05G, ILL89GH3) included these components in a single file with the layout description in the first sector. This was used for PN1, IN4, IN5, IN13, D7, D11, D17, D22, etc. Though 16 bit unsigned integer binary data was initially stored, the time of flight instruments stored a list of overflow addresses at the end. The data were sent by network (DECnet-I) to a concentrator (PDP11/55). Before copying to tape the data were expanded, if necessary, to 32 bit integer data. For D11 and D17 the higher 16 bit data were read and the address and contents of the first 128 non-zero values were stored in the datafile (exceeding this number was the consequence of a poorly set up instrument since the detector was likely to be saturated and damaged). The lower 16 bits of data were stored as a the histogram. Again on rereading the data for treatment or archiving it was easy to read the overflows before the main histogram and reconstruct in memory the 32 bit data. Given that the instruments could easily produce 200 files per day, half of the space was saved and transfer times were usefully reduced. Storage on the PDP10 was always in 36 bit integers.

As more instruments were connected to the central computer, standardised header information was included in the start of each data file simplifying archival and access. Initially this was in the form of an ASCII field similar to a Fortran format e.g. 156X 512I 512F 512F 32(1024I). These data were stored on a set of tapes with a series name AAA01, AAA02, etc. Data on the instruments using PDP11 computers were in separate files with names like 12345IN5.don.

Most instruments transferred data to the PDP10 using serial communications. To reduce overheads these were polled for activity hence no longer requiring one task per instrument. Typically these were launched either on a timed basis, or manually when data were required urgently. The SANS instruments continued to transfer at the end of each run.

Under VMS, the data were held on disk in similar files to the instruments though concatenated together. The access on the central system was using DEC Datatrieve, a relational data base software. Successor routines to MEDIR were provided to allow data access, which also transcribed the 36 bit PDP10 data from the existing archive (SPELIB, Philippe Blanchard). At this stage the data were identified with the year and cycle of the measurement e.g.. 861, 862, etc. The former PDP10 36 bit data were associated with the year/cycle 999. Since the 16 or 32 bit data were essentially in 512 byte blocks, concatenated in the archive file it was convenient to have a single reading function. On the instruments the file was identified by name. From the database Datatrieve routines would find the file (on disk, optical disk etc.) and also give the starting block offset. The same routines could then treat the subsequent data which had the same relative layout.

Conversion to Unix

The hardware and sofware aspects of this conversion are described in this page.

The data conversion itself was easily achieved with network data transfers; after data had been received on the central system the data were converted to ASCII and compressed. (VMSCOMPILL, Ghosh 1992) The TAPDAT file format (described below) was used without change. Using compression allowed data to be retrieved as correct VMS text files or stream format Unix files. Data were simply referred to as separate files using the run number (padded to 6 figures with leading zeros). The hierarchical data files acted to identify the cycle and instrument name (though, of course, this was held internally too). These filesystems were auto-mounted on each unix computers as /usr/illdata/cycle/instrument/numor.Z files. A major advantage of using cross-system compression/decompression was to reduce overheads of storage and network traffic in accessing the data. The counts information once compressed can occupy only ten per cent of the binary data, with no loss of precision. Unless the data read from the network are decompressed onto a local disk the network-load gains are lost. The whole database was progressively rewritten in ASCII files and stored on the SGI-IRIX server, though some errors and omissions required resorting again to the PDP10 tape images saved on CDROM (e.g. the instrument name D11A (1978-79) was unknown to the author of the Datatrieve installation!)

Exporting data

The standardised archive, and the continual pressure for data to be treated away from the ILL to conserve this computing resource lead to standard programs to format the binary archived data to IBM-EBCDIC, and later ASCII text tape files. TRANS (Didier Richard, 1977) included Fortran format statements to read successive data blocks. A simpler approach TAPDAT (Pater, Ghosh 1981) used 80 character records, in a fixed format of text, integer or floating point. Each data field was preceded by an AAAAAAAAAAA..., IIIIIIIIIIIII..., or FFFFFFFF... separator, followed by the number of data entities (and optionally the number of lines of descriptive text preceding the data). The layout was innately flexible and extensible, and subsequently was the basis for the ASCII data used on the Unix systems. The run number (six digits, padded with zeros, e.g. 00klmn) was used as filename since this (or the compressed file 00klmn.Z used) was compatible with all computer file systems, including MS-DOS.

HDF-NeXus binary data

A group of spallation source scientists from Argonne and other centres started discussing a common format for the large volumes of time of flight data that their new machines were creating. The SoftNESS meeting of 1996 centered on use of the Hierarchical Data Format, from the National Center for Supercomputing NCSC at the University at Urbana, Champagne. This had been adopted by NASA and Earth Scientists surveying for their large image datasets. It offered a range of storage and retrieval methods for large files. It was not designed to include much metadata, or MIME additions. Work started on a specification for Neutron Data. Several centres adopted the format, though quite often only partially, older home grown formats still predominated.
Meetings advanced with the adoption of HDF version 4 by several neutron centres. Unfortunately there was great emphasis on raw rather than treated data, hence it was difficult to match the metadata schemes for similar functioning but distinct instruments. If emphasis had been on treated data these differences would have been minimised by the initial treatment. In practice the initial reduction has always been developed locally.

By about 2000 the limitations of HD4 lead to the NCSA development group introducing HDF5. For the first time no backward compatibility was considered; the new design answered the many criticisms of HDF4, with better storage for metadata and better possibilities for including ascii data. Already a considerable volume of data had been created using HDF4. The neutron data scientists added a complete software layer to HDF to hide (and lose many of the new advantages) of he advances gained in HDF5. It would have been better to draw a line separating the two as NCSC had clearly decided to do. This filtering software gained the name NeXus, for Neutron and X-ray data; by this time synchrotron users were showing interest.

The working group finally coalesced into creating a specification once XML was adopted for providing a prescription of the data and meta-data. This also helped simulation techniques as finally the spectrometer elements could be itemised and assembled into distinct instruments. Each component could be given a relative location on the beam-line, terminating at the detectors. At the ILL the first instrument with such data was BRISP, a new CRG design with large datasets; this was later followed by other time of flight instruments then the SANS instruments. First D33, with several detectors, then D11 and D22, though there were many illogical differences between the nomenclature and layouts for these last two, despite them having shared identical ascii layouts for fifteen years. The principal advantage of HDF was that it was often included in commercial datatreatment packages (it was easy to compile and include the library routines in these 4th generation programs like IDL and MATLAB). In practice it was also quite easy to write a few utility routines to read the data with Fortran too! Several centres have worked on generic utility programs for visualising data, with significant investment arising from the Spallation Source at Oak Ridge being introduced into the more general community.

Whilst these software advantages are significant, especially dealing with very large datasets there remain a number of questions over treating event data, rather than image data. Sparse data are now being measured, which are ill-adapted to the storage methods. While ascii data compressed to a very high level (>90%) the binary HDF data (which includes possibilities for compression) offers a much poorer ratio. Again the storage space for meta-data, while better in HDF5, is not well adapted, for example, for step scanning instruments like 3-axis spectrometers. In fact the XML description for NeXus files offers a useful alternative for including data storage directly, hence sharing the development work of the NeXus group in defining components. These data remain hence in ASCII.

Main page Previous --- Top 5/14 --- Next