[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. mbl: Manchester basics library.

Chapter summary:
General utilities used by several mul libraries.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 Data Collectors, and Data Wrappers.

The STL iterators are great general purpose devices - however they force authors of new algorithms to write heavily templated code. This can prove difficult for the user, and impossible for some compilers in the case of extra-templated member functions. The mbl classes are templated, but on data-type, and this template parameter doesn't have the same potential to confuse.

The purpose of the mbl_data_collector and mbl_data_wrapper are to provide similar purpose as STL iterators, but using polymorphism, instead of templating to handle different types of iterator. The descendants of mbl_data_collector provide something similar to an STL output iterator. The descendants of mbl_data_wrapper provide something similar to an STL input iterator.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1.1 Data Wrappers.

When reading data from an mbl_data_wrapper, you do not need to know which polymorphic concrete type is being used. This way a client can pass you data from any source. Reading data from an mbl_data_wrapper is straight forward.

 
    vnl_vector<double> my_sum(mbl_data_wrapper<vnl_vector<double> >& data)
    {
      data.reset();
      vnl_vector<double> sum = data.current();
      while (data.next())
        sum += data.current();
    }

You can do random access, using the set_index(int n) method, but there is no guarantee that the underlying iterator supports this efficiently, so you should use sequential access where possible. The wrappers themselves do not support IO, because they are only wrappers - they do not own the wrapped data.

Currently there is only one concrete wrapper, mbl_data_array_wrapper which wraps C-style data arrays. This can be used to wrap vcl_vectors. Wrappers have in the past been written to read files of data directly from disk.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1.2 Data Collectors.

When writing data to an mbl_data_collector, you do not need to know which polymorphic concrete type is being used. This way you can pass a client data in any way they choose. Writing data to an mbl_data_collector is straight forward.

 
    void my_random_data(mbl_data_collector<vnl_vector<double> >& data)
    {
      // Create sampler
      vpdfl_axis_gaussian gaussian;
      gaussian.set(vnl_vector<double>(20,0.0),
                   vnl_vector<double>(20,1.0));
      vpdfl_sampler_base *sampler = gaussian.new_sampler();

      // set up stuff
      data.clear();
      vnl_vector<double> x;

      // put random samples in collector
      for (unsigned i=0 i<2000; ++i)
      {
        sampler->sample(x);  // generate random sample
        data.record(x);     // record sample
      }
    }

You can store a collector using vsl, although if you want to store a polymorphic collector variable, you will need to cast the pointer to a mbl_data_collector_base(1).

You can create a mbl_data_wrapper from an mbl_data_collector by calling collector.data_wrapper().

There are two concrete data collectors - mbl_data_collector_list and mbl_stochastic_data_collector. The former stores all the data given to it, whilst the latter stores a fixed size complete random subset of all the data given to it. This is useful if you do not know how much data an algorithm is going to produce, but for space or time reasons you can only store and use subset of it.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 Random Number Generator

Rather than use vnl_sample(), vnl_random, provides a superior random number generator.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2.1 The randomness of vnl_random (Advanced Topic)

I was using a vpdfl_axis_gaussian_sampler to produce some test data, and there seemed to be too many outliers. I was worried about the randomness of the vnl_random::normal(), so I ran some tests.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2.1.1 Tim's rank diagram test

mbl_mz_random_graph1

This seems normal.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2.1.2 Testing specifically for outliers.

For a seed of 123456, the numbers of outliers at 1, 2, 3, and 4 s.d. were all greater than the expected value. This bias was reduced but still evident when the number of samples was increased from 10^4 to 10^5 and 10^6.

The results for a key of 9667566 are

 
Got 3183 samples outside 1 sd- expected3174
Got 463 samples outside 2 sd- expected455
Got 28 samples outside 3 sd- expected27.5
Got 0 samples outside 4 sd- expected0.6

These seem more normal.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2.1.3 Testing distribution of outliers separations.

One further possibility is that there might be correlation between adjacent samples. So I counted the separations (number of samples) between outliers (those that exceeded 2 standard deviations), and plotted the distribution of intervals, along with the expected distribution. The first graph is for a seed of 123456, and the second for a seed of 9667566. (Note: The key on the following two graphs is reversed.)

mbl_mz_random_graph2 mbl_mz_random_graph3


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2.1.4 Conclusions

There does not appear to be a persistent bias in the number of outliers as the seed is modified.

A seed of 123456 does seem to give more outliers than expected, and in particular gives more adjacent outliers that expected - which can have significant effects on 2-D test data set generation.

Use a different seed 9667566

Ian Scott 29 November 2000


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.3 Miscellaneous

mbl_cloneable_ptr

Almost the opposite of a vbl_smart_ptr, this class acts like a pointer, but performs a deep copy of the referenced object when the pointer itself is copied. Supports IO via vsl.

mbl_chord

Describes a horizontal line on a image.

mbl_priority_bounded_queue

This acts like a vcl_priority_queue, but only stores the n least values, where n is the configurable bound.

mbl_stats_1d, mbl_sum_1d

Calculate, store, and manipulate running statistics and sums.

mbl_lda

Calculate, manipulate and use a Linear Discriminant Analysis of data.

mbl_gamma.h

Complete and incomplete gamma functions

mbl_k_means.h

Perform k means clustering.

mbl_matrix_products.h, mbl_matxvec.h

Perform various linear algebra operations.

mbl_print.h

Commands that debuggers can run easily on demand. Include this file in your application, and you call these functions from a debugger to display common containers.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4 Further Development


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.1 Other Algorithms.

It would be useful to have several random number generators all with a similar interface to vnl_random. Adding an abstract base class, would allow the use of the strategy pattern - for instance swapping in a more random generator.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.2 Input/Output

Since the random number generator has a real state, it could be useful to be able to save that state to a vsl_b_istream.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4.3 Move contents to VXL

With a bit more development some of this code could be usefully moved into @VXL . e.g.

 
mbl_chord -> vil1_chord,
mbl_gamma.h -> vnl_gamma.h,
mbl_priority_bounded_queue.h -> vbl_priority_bounded_queue,
etc.

[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated on May, 1 2013 using texi2html 1.76.