Maximum subarray problem

Visualization of how sub-arrays change based on start and end positions of a sample. Each possible contiguous sub-array is represented by a point on a colored line. That point's y-coordinate represents the sum of the sample. Its x-coordinate represents the end of the sample, and the leftmost point on that colored line represents the start of the sample. In this case, the array from which samples are taken is [2, 3, -1, -20, 5, 10].

In computer science, the maximum subarray problem is the task of finding the contiguous subarray within a one-dimensional array, a[1...n], of numbers which has the largest sum, where,

The list usually contains both positive and negative numbers along with 0. For example, for the array of values −2, 1, −3, 4, −1, 2, 1, −5, 4; the contiguous subarray with the largest sum is 4, −1, 2, 1, with sum 6.

Some properties of this problem are:

  1. If the array contains all non-positive numbers, then the solution is the number in the array with the smallest magnitude.
  2. If the array contains all non-negative numbers, then the problem is trivial and the maximum sum is the sum of all the elements in the list.
  3. An empty set[clarification needed] is not valid.
  4. There can be multiple different sub-arrays that achieve the same maximum sum to the problem.

There will be discussions on different algorithm techniques that are used to solve this problem which include: brute force algorithm, divide and conquer, and dynamic programming. This analysis will compare the time complexity of the different algorithms for the MSP.

History

The problem was first posed by Ulf Grenander of Brown University in 1977 as a simplified model for maximum likelihood estimation of patterns in digitized images. Grenander was looking to find the maximum sum of an m * n rectangular region of real numbers. He designed an algorithm that ran in O(n6) time. His algorithm was too slow and the problem was too complex to be solved, so he simplified it to one dimension. This made the problem easier to comprehend and as a result, he was able to solve the original problem with a faster running time of O(n3). Jay Kadane of Carnegie Mellon University (Bentley 1984) soon after designed a linear time algorithm.[clarification needed]

Algorithms to solve a two dimensional version of the problem were later designed and run in O(n3) time,[clarification needed] using Kadane's algorithm or a divide and conquer approach. Overall the optimal running time for a one dimensional array is O(n), while the two dimensional array has been improved to a sub-cubic time by the efforts of Tamaki and Tokuyama.[citation needed]

Using the Bird–Meertens formalism, an O(n) algorithm has been derived by algebraic manipulation of the brute-force algorithm in a functional programming language.[1]

Applications

Maximum subarray algorithms are used for data analysis in many fields, such as genomic sequence analysis, computer vision, and data mining.

Genomic sequence analysis employs maximum subarray algorithms to identify important biological segments of protein sequences and the information for the purpose of predicting outcomes[clarify]. As an example, specific information of a protein sequence can be organized into a linear function which can be used to understand the structure and function of a protein.[clarify] Biologists find this approach to be efficient and easy to analyze their data.[weasel words]

The score-based technique of finding the segment with the highest total sum is used in many problems similar to the MSP. In genomic sequence analysis, these problems include conserved segments, GC-rich regions, tandem repeats, low-complexity filter, DNA binding domains, and regions of high charge.[citation needed]

For computer vision , maximum subarray algorithms are used in bitmap images to detect the highest score subsequence which represents the brightest area in an image. The image is a two dimensional array of positive values that corresponds to the brightness of a pixel. The algorithm is evaluated after normalizing every value in the array by subtracting them from the mean brightness.

Data mining is an application of maximum subarray algorithms with numerical attributes. To understand the role of the maximum subarray problem in data mining it is important to be familiar with the association rule and its parts. The association rule is an if/then statement that creates relationships between unrelated pieces of data. The two parts of the association rule are the antecedent (if statement) and the consequent (then statement). In business applications of data mining, association rules are important in examining and foreseeing customer's actions/behavior. Data mining is used for companies to anticipate common trends, by representing problems with maximum subarray problem into an sub-array to be normalized and solved. The highest result[clarify] of the array will give companies information of what customers are responding well to and will influence the companies' actions as a result.[citation needed]

Algorithm 1: Brute Force

Brute-force search is an exhaustive method that checks all possibilities for the solution to the problem. This algorithm is simple to implement and is guaranteed to find a solution. Brute force is slow and inefficient when the problem size is large, but is often used when the size is small and manageable. If simplicity is more important than speed, then brute force is a reasonable approach.

When using brute force to solve the maximum subarray problem, there are two for loops that enumerate all pairs of positions i, j in the array A and find the sum of the array A[i..j] by calculating the sum of A[i..j-1]+A[j].

 1 int maxSumSoFar=0;
 2 int maxISoFar=0;
 3 int maxJSoFar=-1;
 4 for(int i=0; i<n; i++){
 5     int sum=0;
 6     for(int j=i; j<n; j++){
 7         sum+=A[j]; //sum is that of A[i..j]
 8         if(sum>maxSumSoFar){
 9             maxSumSoFar=sum;
10             maxISoFar=i;
11             maxJSoFar=j;
12         }
13     }
14 }

The running time of this algorithm is O(n2).

Algorithm 2: Divide and Conquer

Divide and Conquer is a technique that uses recursion to break a problem into sub-problems until they can be solved directly. The solutions are then put together to solve the original problem.

This algorithm can be used to solve MSP by dividing the initial array in halves until there is one element left, then comparing the maximum sums of each subsequence until there is only one maximum sum. While we search for the maximum sum of the left half and the right half of the list, it is also important to consider the maximum sum that overlaps the middle of the list, as displayed below.

The following is pseudo code:

procedure MaxSum(f,t) begin
//Finds maximum sum in a[f..t]
if f = t then return (a[f], sum[f − 1], sum[f]) //One-element array
c ← (f + t − 1)/2 //Halves array. left is a[f..c], right is a[c+1..t]
(mLeft, minPLeft, maxPLeft) ← MaxSum(f,c)
(mRight, minPRight, maxPRight) ← MaxSum(c + 1,t)
minP ← MIN {minPLeft, minPRight} //Min prefix sum in sum[f−1..t−1]
maxP ← MAX {maxPLeft, maxPRight} //Max prefix sum in sum[f..t]
mCenter ← maxPRight − minPLeft //Solution for the center problem
M ← MAX {mLeft, mRight, mCenter}
return (M, minP, maxP)
end

The time complexity can be described as:

T(n)= 2T(n/2) + O(1), T(1) = O(1)

Where 2T(n/2) represent the two recursion calls and the division of the list, and O(1) represents the time for calculating mCenter at each recursion. Then as a result T(n)= O(n).

Thus the total running time of this algorithm is O(n).

Kadane's algorithm (Algorithm 3: Dynamic Programming)

A bit of a background: Kadane's algorithm is based on splitting up the set of possible solutions into mutually exclusive (disjoint) sets. We exploit the fact that any solution (i.e., any member of the set of solutions) will always have a last element (this is what is meant by "sum ending at position "). Thus, we simply have to examine, one by one, the set of solutions whose last element's index is , the set of solutions whose last element's index is , then , and so forth to . It turns out that this process can be carried out in linear time.

Kadane's algorithm begins with a simple inductive question: if we know the maximum subarray sum ending at position (call this ), what is the maximum subarray sum ending at position (equivalently, what is )? The answer turns out to be relatively straightforward: either the maximum subarray sum ending at position includes the maximum subarray sum ending at position as a prefix, or it doesn't (equivalently, , where is the element at index ).

Thus, we can compute the maximum subarray sum ending at position for all positions by iterating once over the array. As we go, we simply keep track of the maximum sum we've ever seen. Thus, the problem can be solved with the following code, expressed here in Python:

1 def max_subarray(A):
2     max_ending_here = max_so_far = A[0]
3     for x in A[1:]:
4         max_ending_here = max(x, max_ending_here + x)
5         max_so_far = max(max_so_far, max_ending_here)
6     return max_so_far

Note: with a bit of reasoning you will see that max_so_far is equal to .

The algorithm can also be easily modified to keep track of the starting and ending indices of the maximum subarray (when max_so_far changes) as well as the case where we want to allow zero-length subarrays (with implicit sum 0) if all elements are negative.

Because of the way this algorithm uses optimal substructures (the maximum subarray ending at each position is calculated in a simple way from a related but smaller and overlapping subproblem: the maximum subarray ending at the previous position) this algorithm can be viewed as a simple/trivial example of dynamic programming.

The runtime complexity of Kadane's algorithm is .

Generalizations

Similar problems may be posed for higher-dimensional arrays, but their solutions are more complicated; see, e.g., Takaoka (2002). Brodal & Jørgensen (2007) showed how to find the k largest subarray sums in a one-dimensional array, in the optimal time bound .

The Maximum sum k-disjoint subarrays can also be computed in the optimal time bound .[2]

K-Overlapping Maximum Subarray Problem

This problem finds k-th the largest sum in sorted order. Here we explore various techniques to speed up the computation.

Definition: M= MAX(K,L), where , such that MAX selects K- largest elements in list L.[clarification needed]

Example: If a={3, 2, 5, -1}, then the amount of subarrays that exist are 10( 4(4+1)/2 =10). Solving for overlapping subarrays, the first maximum subarray is 10, a[1]+ a[2]+ a[3]. The second subarray is 9, a[1]+ a[2]+ a[3]+ a[4]. Then the third subarray is 7, a[2]+ a[3].

O(Kn) Time Algorithm

The following is pseudo code:

C ← ∅
for k ← 1 to MIN {K, n} do
    min[k] ← ∞
end for
sum[0] ← 0, min[1] ← 0, M[1] ← 0
for i ← 1 to n do
    sum[i] ← sum[i − 1] + a[i]
    for k ← 1 to MIN {K, i} do append sum[i] − min[k] to C
    insert sum[i] into min
end for
M ← extract(K, C)
sort M

In this algorithm, C represents an empty set and each set[clarify] is appended to C. Then K largest candidates are chosen from C and sorted. It is mandatory that the solution is sorted. For K≤ n, the time is O(Kn), though in the extreme case where K=n(n+1)/2, the total time is O(n2 log n) due to sorting.

O(n logK + K2) Time Algorithm

This algorithm minimizes the amount of candidates before selecting K final elements. This is done by limiting the amount of computations. Since it is assumed that K ≤ n , the number of candidates decreases from Kn to K2. The list of candidates is sorted in decreasing order.

The following is pseudo code:

//Initialization
for k ← 1 to K do min[k] ← ∞, M[k] ← −∞
sum[0] ← 0, min[1] ← 0, M[1] ← 0
//Pre-process: Sampling
for i ← 1 to n do
    sum[i] ← sum[i − 1] + a[i]
    //sample for initial K large values
    sample[i] ← sum[i] − min[1]
    if sum[i] < min[1] then min[1] ← sum[i]
end for
KthSample ← K-th max of sample[1..n]
//Candidate Generation
min[1] ← 0
for i ← 1 to n do
    if sum[i] − min[1] ≥ KthSample then
        //Part I
        for k ← 1 to K do cand[k] ← sum[i] − min[k]
        M ← extract(K, merge(M, cand))
    end if
    //Part II
    insert sum[i] into min
end for

In the pre-process section of this algorithm, the array is visited and computes the prefix sum in O(n) time.

When developing a candidate list in Part I., a balanced tree is used to access the list of minimum prefix sums in O(log K) time. It then takes O(K) time to read all the leaf nodes using DFS[clarify]. Part II finds the position for new entry and performs insertion for n iterations. Part I has a total time of O(K2) for K iterations while Part II has a total time of O(n logK) and together their time is O(n log K + K2).

This algorithm relates to dynamic programming

Thus, the total time of the algorithm as a whole is O(n log K + K2).[citation needed]

O( n log K) Time Algorithm

This algorithm applies an efficient selection algorithm by Frederickson and Johnson to the K- maximum subarray problem. The selection algorithm finds the k-th smallest element of an n x m array with sorted columns in O(m +p log( k/p )) time for p=min{ k, m}. This algorithm both discards unnecessary items until O(k log k) items are left, and reduces it further to O(k) items. Lastly, the k-th smallest element can then be selected by a linear time selection algorithm.

Frederickson and Johnson's algorithm explanation: After selecting the K-th largest element of the first row, the columns are organized so that the elements greater than the selected value is moved to the left of the selected value and the elements smaller will be located to the right. The items on the right side are then discarded since they are considered unnecessary items. For the second row, K/2-th largest element is selected, doubling the row number at each operation, and the columns are reorganized to follow this process.

Below you will see how array construction can be incorporated with the selection algorithm.

The following is pseudo code:

//Initialization
for k ← 1 to K do min0[k] ← ∞
for i ← 1 to n do u[i] ← K
sum[0] ← 0
//Pre-process
for i ← 1 to n do
    insert sum[i − 1] into mini−1, which creates mini
    delete mini[K + 1] //deletes from mini to keep size K
    sum[i] ← sum[i − 1] + a[i]
end for
//Sampling/re-indexing
q ← n, q′ ← K, p ← 0, idx ← [1, 2, ...n]
while 2^p ≤ K do
    //Compute A[2^p][1..q], contained in A
    for i ← 1 to q do A[i] ← sum[idx[i]] − minidx[i][2^p]
    l ← q′-th max of A[2^p][1..q]
    Partition A into (A1, A2, A3), where
    A1 = {x|x ∈ A, x > l}, A2 = {x|x ∈ A, x = l}, A3 = {x|x ∈ A, x < l}
    Copy prefix sum indices of elements in (A1, A2, A3) to idx[1..q]
    for i ← 1 to q do u[i] ← 2^p]− 1
    q ← q′, p ← p + 1, q′ ← ⌈K/2p^]
end while
//Candidate Generation
C ← ∅
for i ← 1 to K do
    //u[i]: number of candidates to generate in col. i of A
    for k ← 1 to u[i] do append sum[idx[i]] − minidx[i][k] to C
end for
//Final Selection of K maxima
M ← extract(K, C)
30: sort M

The initialization of min1[1, K] with K elements is O(K).

The pre-process involves handling the prefix sums and storing the minimum prefix sums in a balanced tree. This takes O(n log K) time.

There are two cases in the sampling section, the first iteration (p=0) and the p-th iteration (p≥1). Together the total time is defined O(n log K) for K≤ n.

In the candidate generation section, the total time for generating O(K log K) remaining elements is O(K log K).

Thus, the total time of the algorithm as a whole is O(n log K).[citation needed]

See also

References

  1. ^ Richard S. Bird (1989). "Algebraic Identities for Program Calculation" (PDF). The Computer Journal. 32 (2): 122—126.  Here: Sect.8, p.126
  2. ^ Bengtsson, Fredrik; Chen, Jingsen (2007). "Computing maximum-scoring segments optimally" (PDF). Luleå University of Technology (3). 

External links