Creating the Alternate Collation Routines

Each alternative collation sequence requires a set of four user-created routines--gtm_ac_xform_1 (or gtm_ac_xform), gtm_ac_xback_1 (or gtm_ac_xback), gtm_ac_version, and gtm_ac_verify. The original and transformed strings are passed between GT.M and the user-created routines using parameters of type gtm_descriptor or gtm32_descriptor. An "include file" gtm_descript.h, located in the GT.M distribution directory, defines gtm_descriptor (used with gtm_ac_xform and gtm_ac_xback) as:

typedef struct
{
    short len;
    short type;
    void *val;
} gtm_descriptor;
[Note] Note

On 64-bit UNIX platforms, gtm_descriptor may grow by up to 8 bytes as a result of compiler padding to meet platform alignment requirements. gtm_descriptor is 4 bytes on 32-bit UNIX platforms.

gtm_descript.h defines gtm32_descriptor (used with gtm_xc_xform_1 and gtm_xc_xback_2) as:

typedef struct
{
    unsigned int len;
    unsigned int type;
    void *val;
} gtm32_descriptor;

where len is the length of the data, type is set to DSC_K_DTYPE_T (indicating that this is an M string), and val points to the text of the string.

The interface to each routine is described below.

Transformation Routine (gtm_ac_xform_1 or gtm_ac_xform)

gtm_ac_xform_1 or gtm_ac_xform routines transforms subscripts to the alternative collation sequence.>

If the application uses strings use strings longer than 32,767 (but less than 1,048,576) bytes, the alternative collation library must contain the gtm_ac_xform_1 and gtm_ac_xback_1 routines. Otherwise, the alternative collation library should contain gtm_ac_xform and gtm_ac_xback.

The syntax of this routine is:

#include "gtm_descript.h"
int gtm_ac_xform_1(gtm32_descriptor* in, int level, gtm32_descriptor* out, int* outlen);

Input Arguments

The input arguments for gtm_ac_xform are:

in: a gtm32_descriptor containing the string to be transformed.

level: an integer; this is not used currently, but is reserved for future facilities.

out: a gtm32_descriptor to be filled with the transformed key.

Output Arguments

return value: A long word status code.

out: A transformed subscript in the string buffer, passed by gtm32_descriptor.

outlen: A 32-bit signed integer, passed by reference, returning the actual length of the transformed key.

The syntax of gtm_ac_xform routine is:

#include "gtm_descript.h"
long gtm_ac_xform(gtm_descriptor *in, int level, gtm_descriptor *out, int *outlen)
Input Arguments

The input arguments for gtm_ac_xform are:

in: a gtm_descriptor containing the string to be transformed.

level: an integer; this is not used currently, but is reserved for future facilities.

out: a gtm_descriptor to be filled with the transformed key.

Output Arguments

The output arguments for gtm_ac_xform are:

return value: a long result providing a status code; it indicates the success (zero) or failure (non-zero) of the transformation.

out: a gtm_descriptor containing the transformed key.

outlen: an unsigned long, passed by reference, giving the actual length of the output key.

Example:

#include "gtm_descript.h"
#define MYAPP_SUBSC2LONG 12345678
static unsigned char xform_table[256] =
{
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15,
 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
 64, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93,
 95, 97, 99,101,103,105,107,109,111,113,115,117,118,119,120,121,
122, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94,
 96, 98,100,102,104,106,108,110,112,114,116,123,124,125,126,127,
128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,
144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,
160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,
176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,
192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,
208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,
224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,
240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
};

long
gtm_ac_xform (in, level, out, outlen)
     gtm_descriptor *in;    /* the input string */
     int level;            /* the subscript level */
     gtm_descriptor *out;    /* the output buffer */
     int *outlen;        /* the length of the output string */
{
  int n;
  unsigned char *cp, *cout;
/* Ensure space in the output buffer for the string. */
  n = in->len;
  if (n > out->len)
    return MYAPP_SUBSC2LONG;
/* There is space, copy the string, transforming, if necessary */
  cp = in->val;            /* Address of first byte of input string */
  cout = out->val;        /* Address of first byte of output buffer */
  while (n-- > 0)
    *cout++ = xform_table[*cp++];
  *outlen = in->len;
  return 0;
}

Transformation Routine Characteristics

The input and output values may contain <NUL> (hex code 00) characters.

The collation transformation routine may concatenate a sentinel, such as <NUL>, followed by the original subscript on the end of the transformed key. This permits the inverse transformation routine to simply retrieve the original subscript rather than calculating its value based on the transformed key.

If you prefer not to append the entire original subscript, GT.M allows you to concatenate a sentinel plus a predefined code so the original subscript can be easily retrieved by the inverse transformation routine, but still assures a reformatted key that is unique.

Inverse Transformation Routine (gtm_ac_xback or gtm_ac_xback_1)

This routine returns altered keys to the original subscripts. The syntax of this routine is:

#include "gtm_descript.h"
long gtm_ac_xback(gtm_descriptor *in, int level, gtm_descriptor *out, int *outlen)

The arguments of gtm_ac_xback are identical to those of gtm_ac_xform.

The syntax of gtm_ac_xback_1 is:

#include "gtm_descript.h"
long gtm_ac_xback_1 ( gtm32_descriptor *src, int level, gtm32_descriptor *dst, int *dstlen)

The arguments of gtm_ac_xback_1 are identical to those of gtm_ac_xform_1.

Example:

#include "gtm_descript.h"
#define MYAPP_SUBSC2LONG 12345678
static unsigned char inverse_table[256] =
{
0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
64, 65, 97, 66, 98, 67, 99, 68,100, 69,101, 70,102, 71,103, 72,
104, 73,105, 74,106, 75,107, 76,108, 77,109, 78,110, 79,111, 80,
112, 81,113, 82,114, 83,115, 84,116, 85,117, 86,118, 87,119, 88,
120, 89,121, 90,122, 91, 92, 93, 94, 95, 96,123,124,125,126,127,
128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,
144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,
160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,
176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,
192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,
208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,
224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,
240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
};


long gtm_ac_xback (in, level, out, outlen)
     gtm_descriptor *in;    /* the input string */
     int level;            /* the subscript level */
     gtm_descriptor *out;    /* output buffer */
     int *outlen;        /* the length of the output string */
{
  int n;
  unsigned char *cp, *cout;
/* Ensure space in the output buffer for the string. */
  n = in->len;
  if (n > out->len)
    return MYAPP_SUBSC2LONG;
/* There is enough space, copy the string, transforming, if necessary */
  cp = in->val;            /* Address of first byte of input string */
  cout = out->val;        /* Address of first byte of output buffer */
  while (n-- > 0)
    *cout++ = inverse_table[*cp++];
  *outlen = in->len;
  return 0;
}

Version Control Routines (gtm_ac_version and gtm_ac_verify)

Two user-defined version control routines provide a safety mechanism to guard against a collation routine being used on the wrong global, or an attempt being made to modify a collation routine for an existing global. Either of these situations could cause incorrect collation or damage to subscripts.

When a global is assigned an alternative collation sequence, GT.M invokes a user-supplied routine that returns a numeric version identifier for the set of collation routines, which was stored with the global. The first time a process accesses the global, GT.M determines the assigned collation sequence, then invokes another user-supplied routine. The second routine matches the collation sequence and version identifier assigned to the global with those of the current set of collation routines.

When you write the code that matches the type and version, you can decide whether to modify the version identifier and whether to allow support of globals created using a previous version of the routine.

Version Identifier Routine (gtm_ac_version)

This routine returns an integer identifier between 0 and 255. This integer provides a mechanism to enforce compatibility as a collation sequence potentially evolves. When GT.M first uses an alternate collation sequence for a database or global, it captures the version and if it finds the version has changed it at some later startup, it generates an error. The syntax is:

int gtm_ac_version()

Example:

int gtm_ac_version()
{ 
    return 1;
}

Verification Routine (gtm_ac_verify)

This routine verifies that the type and version associated with a global are compatible with the active set of routines. Both the type and version are unsigned characters passed by value. The syntax is:

#include "gtm_descript.h"
int gtm_ac_verify(unsigned char type, unsigned char ver)

Example:

Example:
#include "gtm_descript.h"
#define MYAPP_WRONGVERSION 20406080    /* User condition */

gtm_ac_verify (type, ver)
     unsigned char type, ver;
{
  if (type == 3)
    {
      if (ver > 2)        /* version checking may be more complex */
    {
      return 0;
    }
}
  return MYAPP_WRONGVERSION;
}

Using the %GBLDEF Utility

Use the %GBLDEF utility to get, set, or kill the collation sequence of a global variable mapped by the current global directory. %GBLDEF modifies the collation sequence for neither a global containing data nor a global whose subscripts span multiple regions. To change the collation sequence for a global variable that contains data, extract the data, KILL the variable, change the collation sequence, and reload the data. Use GDE to modify the collation sequence of a global variable that spans regions.

Assigning the Collation Sequence

To assign a collation sequence to an individual global use the extrinsic entry point:

set^%GBLDEF(gname,nct,act)

where:

  • The first argument, gname, is the name of the global. If the global name appears as a literal, it must be enclosed in quotation marks (" "). The must be a legal M variable name, including the leading caret (^).

  • The second argument, nct, is an integer that determines whether numeric subscripts are treated as strings. The value is FALSE (0) if numeric subscripts are to collate before strings, as in standard M, and TRUE (1) if numeric subscripts are to be treated as strings (for example, where 10 collates before 9).

  • The third argument, act, is an integer specifying the active collation sequence– from 0, standard M collation, to 255.

[Note] Note

set^%GBLDEF(gname) returns global specific characteristics, which can differ from collation characteristics defined for the database file at MUPIP CREATE time from settings in the global directory. Region collation may be seen by using the DSE DUMP -FILEHEADER command, implicitly in the case of M standard collation, as in that case no collation information is displayed.

If the global contains data, this function returns a FALSE (0) and does not modify the existing collation sequence definition.

If the global's subscripts span multiple regions, the function returns a false (0). Use the global directory (GBLNAME object in GDE) to set collation characteristics for a global whose subscripts span multiple regions.

Always execute this function outside of a TSTART/TCOMMIT fence. If $TLEVEL is non-zero, the function returns a false(0).

Example:

GTM>kill ^G

GTM>write $select($$set^%GBLDEF("^G",0,3):"ok",1:"failed")
ok
GTM>

This deletes the global variable ^G, then uses the $$set%GBLDEF as an extrinsic to set ^G to the collation sequence number 3 with numeric subscripts collating before strings. Using $$set%GBLDEF as an argument to $SELECT provides a return value as to whether or not the set was successful. $SELECT will return a "FAILED" message if the collation sequence requested is undefined.

Examining Global Collation Characteristics

To examine the collation characteristics currently assigned to a global use the extrinsic entry point:

get^%GBLDEF(gname[,reg])

where gname specifies the global variable name. When gname spans multiple regions, reg specifies a region in the span.

This function returns the data associated with the global name as a comma delimited string having the following pieces:

  • A truth-valued integer specifying FALSE (0) if numeric subscripts collate before strings, as in standard M, and TRUE (1) if numeric subscripts are handled as strings.

  • An integer specifying the collation sequence.

  • An integer specifying the version, or revision level, of the currently implemented collation sequence.

[Note] Note

A "0" return from $$get^%gbldef(gname[,reg]) indicates that the global has no special characteristics and uses the region default collation, while a "0,0,0" return indicates that the global is explicitly defined to M collation.

Example:

GTM>Write $$get^%GBLDEF("^G")
1,3,1

This example returns the collation sequence information currently assigned to the global ^G.

Deleting Global Collation Characteristics

To delete the collation characteristics currently assigned to a global, use the extrinsic entry point:

kill^%GBLDEF(gname)
  • If the global contains data, the function returns a false (0) and does not modify the global.

  • If the global's subscript span multiple regions, the function returns a false (0). Use the global directory (GBLNAME object in GDE) to set collation characteristics for a global whose subscripts span multiple regions.

  • Always execute this function outside of a TSTART/TCOMMIT fence. If $TLEVEL is non-zero, the function returns a false (0).

Example of Upper and Lower Case Alphabetic Collation Sequence

This example is create an alternate collation sequence that collates upper and lower case alphabetic characters in such a way that the set of keys "du Pont," "Friendly," "le Blanc," and "Madrid" collates as:

  • du Pont

  • Friendly

  • le Blanc

  • Madrid

This is in contrast to the standard M collation that orders them as:

  • Friendly

  • Madrid

  • du Pont

  • le Blanc

[Important] Important

No claim of copyright is made with respect to the code used in this example. Please do not use the code as-is in a production environment.

Please ensure that you have a correctly configured GT.M installation, correctly configured environment variables, with appropriate directories and files.

Seasoned GT.M users may want download polish.c used in this example and proceed directly to Step 5 for compiling and linking instructions. First time users may want to start from Step 1.

  1. Create a new file called polish.c and put the following code:

    #include <stdio.h>
    #include "gtm_descript.h"
    #define COLLATION_TABLE_SIZE     256
    #define MYAPPS_SUBSC2LONG        12345678
    #define SUCCESS     0
    #define FAILURE     1                
    #define VERSION     0
    
    static unsigned char xform_table[COLLATION_TABLE_SIZE] =
              {
              0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
              16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
              32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
              48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
              64, 65, 67, 69, 71, 73, 75, 77, 79, 81, 83, 85, 87, 89, 91, 93,
              95, 97, 99,101,103,105,107,109,111,113,115,117,118,119,120,121,
              122, 66, 68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90, 92, 94,
              96, 98,100,102,104,106,108,110,112,114,116,123,124,125,126,127,
              128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,
              144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,
              160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,
              176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,
              192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,
              208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,
              224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,
              240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
              };
    
    static unsigned char inverse_table[COLLATION_TABLE_SIZE] =
              {
              0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
              16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31,
              32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47,
              48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63,
              64, 65, 97, 66, 98, 67, 99, 68,100, 69,101, 70,102, 71,103, 72,
              104, 73,105, 74,106, 75,107, 76,108, 77,109, 78,110, 79,111, 80,
              112, 81,113, 82,114, 83,115, 84,116, 85,117, 86,118, 87,119, 88,
              120, 89,121, 90,122, 91, 92, 93, 94, 95, 96,123,124,125,126,127,
              128,129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,
              144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,
              160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,
              176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,
              192,193,194,195,196,197,198,199,200,201,202,203,204,205,206,207,
              208,209,210,211,212,213,214,215,216,217,218,219,220,221,222,223,
              224,225,226,227,228,229,230,231,232,233,234,235,236,237,238,239,
              240,241,242,243,244,245,246,247,248,249,250,251,252,253,254,255
              };

    Elements in xform_table represent input order for transform. Elements in inverse_table represent reverse transform for x_form_table.

  2. Add the following code for the gtm_ac_xform transformation routine:

    long gtm_ac_xform ( gtm_descriptor *src, int level, gtm_descriptor *dst, int *dstlen)
          {
              int n;
              unsigned char  *cp, *cpout;
          #ifdef DEBUG
              char input[COLLATION_TABLE_SIZE], output[COLLATION_TABLE_SIZE];
          #endif
              n = src->len;
              if ( n > dst->len)
                 return MYAPPS_SUBSC2LONG;
              cp  = (unsigned char *)src->val;
          #ifdef DEBUG
              memcpy(input, cp, src->len);
              input[src->len] = '\0';
          #endif
              cpout = (unsigned char *)dst->val;
              while ( n-- > 0 )
                 *cpout++ = xform_table[*cp++];
              *cpout = '\0';
              *dstlen = src->len;
          #ifdef DEBUG
              memcpy(output, dst->val, dst->len);
              output[dst->len] = '\0';
              fprintf(stderr, "\nInput = \n");
              for (n = 0; n < *dstlen; n++ ) fprintf(stderr," %d ",(int )input[n]);
              fprintf(stderr, "\nOutput = \n");
              for (n = 0; n < *dstlen; n++ ) fprintf(stderr," %d ",(int )output[n]);
          #endif
              return SUCCESS;
          }
       3. Add the following code for the gtm_ac_xback reverse transformation routine:
          long gtm_ac_xback ( gtm_descriptor *src, int level, gtm_descriptor *dst, int *dstlen)
          {
              int n;
              unsigned char  *cp, *cpout;
          #ifdef DEBUG
              char input[256], output[256];
          #endif
    
              n = src->len;
              if ( n > dst->len)
              return MYAPPS_SUBSC2LONG;
              cp  = (unsigned char *)src->val;
              cpout = (unsigned char *)dst->val;
              while ( n-- > 0 )
                 *cpout++ = inverse_table[*cp++];
              *cpout = '\0';
              *dstlen = src->len;
          #ifdef DEBUG
              memcpy(input, src->val, src->len);
              input[src->len] = '\0';
              memcpy(output, dst->val, dst->len);
              output[dst->len] = '\0';
              fprintf(stderr, "Input = %s, Output = %s\n",input, output);
          #endif
    
              return SUCCESS;
          }
  3. Add code for the version identifier routine (gtm_ac_version) or the verification routine (gtm_ac_verify):

    int gtm_ac_version ()
          {
              return VERSION;
          }
    
          int gtm_ac_verify (unsigned char type, unsigned char ver)
          {
                  return !(ver == VERSION);
          }
  4. Save and compile polish.c. On x86 GNU/Linux (64-bit Ubuntu 10.10), execute a command like the following:

    gcc -c polish.c -I$gtm_dist
    [Note] Note

    The -I$gtm_dist option includes gtmxc_types.h.

  5. Create a new shared library or add the above routines to an existing one. The following command adds these alternative sequence routines to a shared library called altcoll.so on x86 GNU/Linux (64-bit Ubuntu 10.10).

    gcc -o altcoll.so -shared polish.o
  6. Set $gtm_collate_1 to point to the location of altcoll.so.

  7. At the GTM> prompt execute the following command:

    GTM>Write $SELECT($$set^%GBLDEF("^G",0,1):"OK",1:"FAILED")
          OK

    This deletes the global variable ^G, then sets ^G to the collation sequence number 1 with numeric subscripts collating before strings.

  8. Assign the following value to ^G.

    GTM>Set ^G("du Pont")=1
    GTM>Set ^G("Friendly")=1
    GTM>Set ^G("le Blanc")=1
    GTM>Set ^G("Madrid")=1
  9. See how the subscript of ^G order according to the alternative collation sequence:

    GTM>ZWRite ^G
    ^G("du Pont")=1
    ^G("Friendly")=1
    ^G("le Blanc")=1
    ^G("Madrid")=1

Example of Collating Alphabets in Reverse Order using gtm_ac_xform_1 and gtm_ac_xback_1

This example creates an alternate collation sequence that collates alphabets in reverse order. This is in contrast to the standard M collation that collates alphabets in ascending order.

[Important] Important

No claim of copyright is made with respect to the code used in this example. Please do not use the code as-is in a production environment.

Please ensure that you have a correctly configured GT.M installation, correctly configured environment variables, with appropriate directories and files.

  1. Download col_reverse_32.c from http://tinco.pair.com/bhaskar/gtm/doc/books/pg/UNIX_manual/col_reverse_32.c. It contain code for transformation routine (gtm_ac_xform_1), reverse transformation routine (gtm_ac_xback_1) and version control routines (gtm_ac_version and gtm_ac_verify).

  2. Save and compile col_reverse_32.c. On x86 GNU/Linux (64-bit Ubuntu 10.10), execute a command like the following:

    gcc -c col_reverse_32.c -I$gtm_dist
    [Note] Note

    The -I$gtm_dist option includes gtmxc_types.h.

  3. Create a new shared library or add the routines to an existing one. The following command adds these alternative sequence routines to a shared library called altcoll.so on x86 GNU/Linux (64-bit Ubuntu 10.10).

    gcc -o revcol.so -shared col_reverse_32.o
  4. Set the environment variable gtm_collate_2 to point to the location of revcol.so. To set the local variable collation to this alternative collation sequence, set the environment variable gtm_local_collate to 2.

  5. At the GTM prompt, execute the following command:

    GTM>Write $SELECT($$set^%GBLDEF("^E",0,2):"OK",1:"FAILED")
    OK
  6. Assign the following value to ^E.

    GTM>Set ^E("du Pont")=1
    GTM>Set ^E("Friendly")=1
    GTM>Set ^E("le Blanc")=1
    GTM>Set ^E("Madrid")=1
  7. Notice how the subscript of ^E sort in reverse order:

    GTM>zwrite ^E
    ^G("le Blanc")=1
    ^G("du Pont")=1
    ^G("Madrid")=1
    ^G("Friendly")=1