Aspects of frequency-selection and stochastic optimization

Figure 3.8: Frequency-selection encoding results. Each grey spot denotes a mapping between its two coordinates $ \omega_j$ and $ s_i$ . Upper panels show the mappings accrued over 3 iterations, while lower panels show those over 31 iterations. A bright spot means there are mapping points colliding in history. Earlier mappings are of decayed charges, explaining the varying grey scales of the grey points.
\includegraphics[width=5in]{fwi_fig/freq2Src_assign}
In step 4, for the mfs, I propose a Quasi-Monte Carlo frequency-source encoding strategy, motivated as follows. It is desirable that every source has a chance of evenly sampling the frequency components assigned to it over the many iterations of fwimfs. An example of 10 frequency components and 4 iteration steps for a particular source might select the indices like $ 9, 2, 5, 7$ , where each number denotes a frequency index; 4 iterations give 4 numbers. An undesirable example is $ 3, 2, 2, 1$ , which over-represents the low frequency index 2, omitting the medium and high-frequency components. Moreover, nearby sources tend to illuminate an overlapping region of subsurfaces. Therefore if their frequency contents differ, then they as a whole would cover a wider range of frequency components. Back to the preferable example, its neighbor emitting sine waves of frequency with indices like $ 2, 9, 5, 7$ would be less desirable than emitting a wider band of frequencies with indices $ 1, 10, 3, 8$ .

To achieve this end, I introduce repellent Coulomb forces between 2D point charges, each point $ (\omega_j,s_i)$ denoting a mapping from $ \omega_j$ to source $ s_i$ . This electrostatic system then settles, simulated through greedy optimization, into a low-energy configuration, in which all charged points spread out as much as possible. Examples of this encoding strategy are shown in the left column of Figure 3.8, which appear more uniformly distributed than the counterparts of the standard random permutation shown in the right column.

As the specific frequency-selection code changes over iterations of FWI, this falls in the realm of stochastic optimization (Spall, 2003). While the convergence of a line search in stochastic optimization is still a research problem, I adopt a hybrid approach. Run a gradient descent method with line search for the first $ K_{0}$ iteration steps of FWI, then switch to a stochastic gradient descent method, where the step size $ \propto 1/k$ , $ k$ being the iteration step index. This is a lightweight and robust algorithm that converges almost surely to a local minimum (Kiwiel, 2001). The problem remains in choosing the appropriate constant coefficient for this step size formula. The recipe is, first, identify the smallest $ \eta$ , $ \eta_{min}$ , resulting from the first $ K_{0}$ steps of line search, expressed in the form

$\displaystyle \vert\vert\Delta {\bf {x}}\vert\vert = \eta \vert\vert\nabla J({\bf {x}})\vert\vert.$ (3.2)

Here, $ {\bf {x}}$ is the unknown parameter vector, such as the velocity model; $ \Delta {\bf {x}}$ is the update of $ {\bf {x}}$ at a step; and $ \nabla J$ represents the gradient of the objective function. Then, the constant coefficient of the step size formula can be fixed accordingly, as

$\displaystyle \Delta {\bf {x}}^k = - \eta_{min}\frac{K_{0}}{k} \nabla J({\bf {x}}^k), ~~~~\textrm{for}~~k = K_{0}+1, \ldots.$ (3.3)

A similar recipe of determining the step size of stochastic gradient descent is suggested in Bottou and Bousquet (2011).

As mentioned at the end of Section 3.2, 5 function evaluations are required on average by Brent's method for line search. In contrast, none is required in a stochastic gradient descent method. Consequently, per iteration step of the latter method, the computational cost is only due to the gradient computation, which requires, with the transient-reduction scheme, $ 5nt$ FDTD propagation steps. This compares to $ 8nt$ in the standard approach, which, taking advantage of the CG method, needs an accurate line search, and therefore explains the associated overhead. This lightweight stochastic gradient descent approach translates to more iterations and therefore more frequency-selection codes in use.

To further reduce the amount of stochasticity in the gradient, we empirically adopt averaging of two successive gradient calculations. Namely, perform the multisource encoding and gradient computation twice, then stack the gradients. This will double the computational cost. So a tradeoff exists between the size of the number of gradients in the average and the convergence speed of the stochastic gradient descent. This averaging (also known as the mini-batch) scheme applies to all 8 supergathers, which are formed by dividing up the 496 shot gathers.

I start the inversion with the data bandpass filtered from 0-6 Hz. The initial velocity model is decimated to a grid size of $ nz \times nx = 51 \times 376$ . At later iterations the band is widened to 15 Hz, and the model is upsampled (with interpolation) to $ nz \times nx = 101 \times 752$ . Finally the frequency band covers 0 to 25 Hz, and the velocity model is of grid size $ nz \times nx = 201 \times 1504$ . I only use standard FWI for the first two cases, because the amount of computation is negligible compared to the third case. For example, the second case has a quarter of the model size, half of the time samples (because of doubled time sampling interval), and half the number of shots (due to downsampling). Therefore the computational load per iteration of the second case is only $ 1/16$ of the third.

Yunsong Huang 2013-09-22