Your Ad Here

Monday, May 24, 2010

New Requirements by NSF for Data Management

The American National Science Foundation (NSF) has recently announced new requirements for handling and sharing data from projects funded by the agency.

"Science is becoming data-intensive and collaborative," noted Ed Seidel, acting assistant director for NSF's Mathematical and Physical Sciences directorate. "Researchers from numerous disciplines need to work together to attack complex problems; openly sharing data will pave the way for researchers to communicate and collaborate more effectively."

"This is the first step in what will be a more comprehensive approach to data policy," added Cora Marrett, NSF acting deputy director. "It will address the need for data from publicly-funded research to be made public."

The new requirements, although publicly stated to encourage better sharing of data within the science community, are almost certainly related to the recent Climategate fiasco.  It should be seen as a reminder to us all to follow proper procedures when writing software, especially in the area of source code control.

For more information on Climategate see the articles in Wikipedia, Wallstreet Journal, the original files related to the scandal, and an analysis of the emails.

For more information on the subjects that are more directly computer related see the articles in Wikipedia for LIMS, data management, and source code control.

Friday, May 21, 2010

Roman Numeral to Decimal Conversion - Algorithm 2

A second method for converting Roman numerals to decimal numbers has been posted on the Computer Science 101 website.

According to the article on that site, the algorithm boils down to this:

  1. Set the character pointer to point at the right-most character.
  2. Set the value of the summing variable to be the value of the current character.
  3. Move the character pointer to the left one step.
  4. Compare the value of the current character to the previous character:
  5. If the current character's value is the same or larger than the previous character, ADD its value to the sum.
  6. Otherwise, if the current character's value is smaller than the previous character, SUBTRACT its value from the sum.
  7. Repeat steps 3-6 until the left-most character has been evaluated.

 
You can get a full listing of a function employing this algorithm written in Visual Basic on the Computer Science 101 website.
 
 

Thursday, May 13, 2010

Decimal Number to Roman Numeral Conversion Version 3

Two previous posts showed how to convert a decimal value to a Roman numeral.  The first method was very clunky and elementary, the second method is considerably more efficient.

It ocurred to me, after designing and implementing the second algorithm, that if you're willing to sacrifice a bit more memory space for a table, then there's a faster way to convert from decimal values to Roman numeral notation.  The following code shows an implementation of the algorithm in VB .NET.  I call it the MCXI algorithm due to the values of the four rows in the table. There are some funny things going on between the 0 based indexing used in accessing array data and 1 based indexing used in the MID function used for parsing the input string that you may need to address this if you use a different programming language.


    Private Sub btnConvert_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnConvert.Click

        ' Algorithm MCXI
        ' Written by: Erik Oosterwal
        ' Written on: 1-21-2008
        '
        ' Routine to convert Decimal values to Roman Numeral notation.  This algorithm uses a table to store
        '   all the standard Roman Numeral notation for each digit in each Base-10 digit position (ones, tens,
        '   hundreds, thousands).  Each Decimal character is then evaluated to find its magnitude and extract
        '   the corresponding Roman Numeral notation for that value and prepend it to the output string.
        '
        ' This routine does not check for proper usage - the input is not checked for valid numeral characters
        '   and it does not check to verify that the values are positive values between 1 and 4000.  To make
        '   function robust, you check put checks in place for these conditions.

        'Store the Roman codes for each Decimal digit.
        Dim strRoman(,) As String = {{"", "I", "II", "III", "IV", "V", "VI", "VII", "VIII", "IX"}, _
                                     {"", "X", "XX", "XXX", "XL", "L", "LX", "LXX", "LXXX", "XC"}, _
                                     {"", "C", "CC", "CCC", "CD", "D", "DC", "DCC", "DCCC", "CM"}, _
                                     {"", "M", "MM", "MMM", "MMMM", "", "", "", "", ""}}

        Dim strOutput As String = Nothing               ' Initialize the output string to blank.
        Dim intLoopI As Integer = Nothing               ' Counter used for the Loop
        Dim intStrLength As Integer = Nothing           ' The length of the input string.

        intStrLength = Len(Me.txtInput.Text)            ' Find out how many digits are in the input number (magnitude).

        For intLoopI = intStrLength To 1 Step -1        ' MID function uses 1 based indexing for the string pointer.

            strOutput = strRoman(intStrLength - intLoopI, Val(Mid$(Me.txtInput.Text, intLoopI, 1))) & strOutput

        Next

        Me.txtOutput.Text = strOutput

    End Sub


As you can see, algorithm number three (MCXI Algorithm) executes the loop a maximum of 4 times and uses no division or multiplication. This should speed up the run time a great deal over the previous two listings.

Tuesday, May 11, 2010

Fizz Buzz Busted!?

On the site Computer Science 101, Erik Oosterwal presents two different solutions to the FizzBuzz problem and demonstrates why it is important for developers on larger software teams to write code that is developer-friendly even when the developer-friendly code is less efficient.

Here's a preview of one of the two solutions:


void FizzBuzz2(void)
{
    int i = 0;
    char output[10];
    for(i = 1; i<=100; i++)
    {
        strcpy(output, "");
        if(i%3==0)
        {
            strcpy(output, "Fizz");
        }
        if(i%5==0)
        {
            strncat(output, "Buzz", 4);
        }
        if(strcmp(output, "")==0)
        {
            printf("%d\n", i);
        }
        else
        {
            printf("%s\n", output);
        }
    }
}

Monday, May 10, 2010

Roman Numeral to Decimal Conversion

You can find algorithms for creating Roman Numerals from decimal values here and here

In this post we'll take a look at the first of two algorithms for accepting a Roman Numeral and converting it back into a decimal value.

The basic rules for this algorithms are:

Start at the left-most character in the Roman Numeral string and work your way to the last character one character at a time. Add the value of the current character to the total. If the current character has a larger value than the previous character, then subtract 2x the value of the previous character from the total.

For instance the Roman Numeral XIX has a decimal value of 19.  We start at the first character, 'X', and add 10 to the running total; our total is now 10.  We evaluate the second character, 'I' and add 1 to our running total, which now becomes 11.  We evaluate the third character, 'X', and add 10 to our total; our total is now 21, but...  Because the current character has a larger value than the previous character we must subtract 2x the value of the previous character, 1, from the total bringing the new total to 19.  There are no more characters so we stop.


Function Roman2Decimal(strInput as string) as integer

   Dim intValues() as integer                ' Reserve space for an array.
   Dim intLoopI as integer = nothing         ' Initialize the loop counter to 0.
   Dim intTotal as integer = nothing         ' Clear the total.
   Dim intPreviousPointer as integer = 99999 ' Set the value of the previous pointer to some MAX value.
   Dim intCurrentPointer as integer = nothing  ' This will hold the value of the current Roman Numeral character.

   intValues(M) = 1000             ' This array addressing method
   intValues(D) = 500              '  assumes that the language you
   intValues(C) = 100              '  are using allows you to use
   intValues(L) = 50               '  values other than integer values
   intValues(X) = 10               '  to access data in an array.
   intValues(V) = 5
   intValues(I) = 1

   For intLoopI = 0 to (len(strInput)-1)   ' Assuming 0 indexing...

      intCurrentPointer = intValues(mid$(strInput,intLoopI,1)) ' Get the value of the current Roman Numeral character
      intTotal = intTotal + intCurrentPointer                  ' Add the value to the total.

      If intCurrentPointer > intPreviousPointer then       ' If the current value is larger than the previous, then
         intTotal = intTotal - (2 * intPreviousPointer)    '  subtract twice the value of the previous character from the total.
      EndIf

      intPreviousPointer = intCurrentPointer      ' Move the value of the current character to be the
                                                  '  value of the previous character and get then next one.

   Next

   Roman2Decimal = intTotal

End Function

Spam

Due to the high number of spam comments recently, the ability to leave anonymous comments has been turned off. You can still leave comments by signing in with your Google or OpenID passwords.

Thanks for your patience in this matter.

Thursday, May 6, 2010

Infix to RPN Notation

Here is a limited algorithm for converting infix notation to RPN notation. The algorithm has no way of handling unary minuses (indicating negative numbers) nor does it deal with functions such as SIN(), COS(), SQR(), etc. I describe one way of handling unary minus in the comments of the program listing below and will alter the code to handle those as well as sines, cosines, square roots, etc. at a later date. In the mean time if you need an infix to RPN converter, this should handle most of your needs.

Parsing from left to right, any numeric value we encounter or any alphanumeric string that could be interpreted as a variable is sent directly to the output, including decimal points.

When we encounter anything other than numbers or letters (or a decimal point), we do the following:

Append a token separator to the output string (in our case a comma.)

If we encounter an open parenthesis, that parenthesis is pushed to a FILO stack.

If we encounter a closing parenthesis, we discard it and pop operators from the stack to the end of the output string until we come to an open parenthesis. The open parenthesis is then also discarded.

If we encounter a low priority operator (+ or -) then we must first pop all operators (+, -, *, /, ^) from the stack to the end of the output string until the stack is empty or we encounter an open parenthesis and then push the low priority operator on to the stack.

If we encounter a medium priority operator (* or /) then we must first pop any medium or high priority operators (*, /, ^) from the stack to the end of the output string until the stack is empty or we encounter a low priority operator (+ or -) or we encounter an open parenthesis, then push the medium priority operator on to the stack.

If we encounter a high priority operator (^) we pop all other high priority operators from the stack to the end of the output string then push the new high priority operator to the stack.

Once all the input character have been processed, we pop operators off the stack until it's empty.


The listing below shows a Visual Basic implementation of the algorithm:


'
'   Infix to RPN - program to convert infix notation to
'       RPN notation.  Operands are separated by commas, to
'       use a different separation character change the
'       line where txtSeparator is initialized.  The operator
'       stack has been set to an upper limit of 100 operators,
'       if you expect to have input strings with more than 100
'       operators (heaven forbid...) increase the size of
'       txtStack in the Dim statement.
'
'       This routine does not handle unary minus signs.  To allow
'       unary minus signs two additional changes must be made.  In the
'       Low Operator case (+, -) we must check to see if the previous
'       input character was an alphanumeric.  If it IS alphanumeric
'       we treat the minus as a normal binary operator.  If it is NOT
'       alphanumeric, then we append a zero and token separator to the
'       output string, and push a special unary minus character to the
'       stack, which has an operator value equal to ^.  You can use the
'       tilde "~" as the unary minus character since it's not being used
'       for anything else.  I'll make this changes to the routine at a
'       later date.
'
'   Written by -    Erik Oosterwal
'   Started on -    Nov. 14, 2005
'   Completed on -  Nov. 15, 2005

    Dim txtSeparator, txtStack(100), txtCharacter As String     ' Declare all variables
    Dim intPointer, intStackPointer As Integer                  ' used in the routine

    txtSeparator = ","      ' Set the token separator to a comma
    intPointer = 1          ' Start by looking at the first character in the input string
    intStackPointer = 1     ' Start the operator stack at 1
    txtOutput.Text = ""     ' Clear the output sting

    While intPointer <= Len(txtInput.Text)                      ' As long as there's characters in the string
    
        txtCharacter = Mid$(txtInput.Text, intPointer, 1)       ' get the next character.
        
        Select Case txtCharacter
        
            Case "0" To "9", ".", "A" To "Z", "a" To "z"        ' Numbers and variables get appended
                txtOutput.Text = txtOutput.Text + txtCharacter  '   directly to the output string.
                
            Case "("
                txtStack(intStackPointer) = txtCharacter        ' Open parentheses get pushed on the stack
                intStackPointer = intStackPointer + 1
                
            Case ")"                                                    ' If we find a closing parenthesis
                While intStackPointer > 1 And _
                        txtStack(intStackPointer - 1) <> "("            ' Clear the stack until we
                    txtOutput.Text = txtOutput.Text + txtSeparator + _
                        txtStack(intStackPointer - 1)                   ' get to the opening parenthesis.
                    intStackPointer = intStackPointer - 1
                Wend
                intStackPointer = intStackPointer - 1       ' Decrease the stack pointer to overwrite the opening parenthesis.
                
            Case "+", "-"                                               ' Lowest operators
                txtOutput.Text = txtOutput.Text + txtSeparator          ' Append a token separator to the output string.
                
' If there are any operators on the stack AND they're higher or equal valued than + and -, then...
'   pop them off the stack and append them to the output string.
                While intStackPointer > 1 And _
                        txtStack(intStackPointer - 1) <> "("
                    txtOutput.Text = txtOutput.Text + _
                        txtStack(intStackPointer - 1) + txtSeparator
                    intStackPointer = intStackPointer - 1
                Wend
                txtStack(intStackPointer) = txtCharacter            ' Push the low operator on the stack.
                intStackPointer = intStackPointer + 1
                
            Case "*", "/"                                           ' Medium operators
                txtOutput.Text = txtOutput.Text + txtSeparator      ' Append a token separator to the output string.
                
' If there are any operators on the stack and they're higher or equal valued than * and /, then
'   pop them off the stack and append them to the output string.
                While intStackPointer > 1 And _
                        (txtStack(intStackPointer - 1) = "*" Or _
                        txtStack(intStackPointer - 1) = "/" Or _
                        txtStack(intStackPointer - 1) = "^")
                    txtOutput.Text = txtOutput.Text + _
                        txtStack(intStackPointer - 1) + txtSeparator
                    intStackPointer = intStackPointer - 1
                Wend
                txtStack(intStackPointer) = txtCharacter        ' Push the medium operator on the stack.
                intStackPointer = intStackPointer + 1
                
            Case "^"                                            ' High operators
                txtOutput.Text = txtOutput.Text + txtSeparator  ' Append a token separator to the output string.
                While intStackPointer > 1 And _
                        txtStack(intStackPointer - 1) = "^"
                    txtOutput.Text = txtOutput.Text + _
                        txtStack(intStackPointer - 1) + txtSeparator
                    intStackPointer = intStackPointer - 1
                Wend
                txtStack(intStackPointer) = txtCharacter        ' Push the high operator on the stack.
                intStackPointer = intStackPointer + 1
                
' There is no default case included in this routine.  Any characters not explicitly taken care
'   of by the cases will be ignored as whitespace.  Some characters, such as x, #, and others
'   that are used to denote binary, hex, octal, etc. numbers could be taken care of in the first
'   case where numeric and alpha tokens are handled.  It's ok to ignore spaces, tabs and other
'   white space.
        
        End Select
    
        intPointer = intPointer + 1             ' Set the pointer to look at the next character
        
    Wend
    
'   All the input characters have been taken care of, now it's time to clear the stack.
'
    While intStackPointer > 1                               ' As long as there's still operators on the stack
        txtOutput.Text = txtOutput.Text + txtSeparator + _
            txtStack(intStackPointer - 1)                   ' Take one off and append it to the output string
        intStackPointer = intStackPointer - 1               ' look at the previous operator
    Wend

End Sub

Wednesday, May 5, 2010

A Replacement for the Bubble Sort

For reference, here's a listing of the Bubble Sort algorithm in Visual Basic:


' Bubble Sort

    Dim I, J, IntArray(), Temp as Integer ' Declare Variables

    For I = 1 To Length - 1 ' Supposed to be the smaller value
        For J = I To Length ' Supposed to be the larger value
            If IntArray(I) > IntArray(J) Then ' Compare Cells
                Temp = IntArray(I) ' \
                IntArray(I) = IntArray(J) ' | Swap Cells
                IntArray(J) = Temp ' /
            End If
        Next
    Next

There's not much good to say about the Bubble Sort except that it's small and compact size-wise. Performance-wise it's a whole other animal.

So what do you do when you need a sorting algorithm that's comparable to a Quick Sort in terms of performance but still only takes up about as much space as the Bubble Sort? Consider implementing the O-Sort. Here's a listing of the O-Sort in Visual Basic:


' O-Sort   (Oosterwal Sort)  Version 3

    SngPhi = 0.78     ' Define phi value
    SngFib = Length * SngPhi    ' Set initial real step size
    Step = Int(SngFib)     ' Set initial integer step size
    
    While (Step > 0)
        
        For I = 1 To Length - Step   ' Range of lower cell
            If IntArray(I) > IntArray(I + Step) Then ' Compare cells
                Temp = IntArray(I)   ' \
                IntArray(I) = IntArray(I + Step) ' | Swap Cells
                IntArray(I + Step) = Temp  ' /
            End If
        Next
            
        SngFib = SngFib * SngPhi   ' Decrease the Real step size
        Step = Int(SngFib)    ' Set the integer step value
        
    Wend

Here's how it works.

The O-Sort is a type of Comb Sort algorithm. Comb Sort algorithms use a step size or interval between the two values being compared that starts off large, compared to the number of values being sorted, and becomes smaller for each subsequent pass of the data. The idea behind the algorithm is to get chunks of the larger values to the end of the list and chunks of the smaller values to the beginning of the list quickly, then with each subsequent pass of the data look at values that are closer together and moving smaller chunks of data shorter distances until during the last pass neighboring data points are compared.

In practice, with 10,000 elements of completely randomized data the O-Sort takes about 1.6 times longer to sort the data than the Quick Sort. For comparison, the Shell Sort takes 36 times as long and the Bubble Sort takes over 200 times as long.

When you have data that is less randomized, such as what you would get when appending lists of sorted data, the Quick Sort is less quick. When 10% of the data is randomized the O-Sort requires 0.96 as much time to completely sort the data as the Quick Sort. When only 5% of the data is randomized, the O-Sort takes only 0.61 as much time to completely sort the data as the Quick Sort.

When you need a fast in-place sorting algorithm that doesn't require much code space consider using the O-Sort routine.