This article was originally written as a Solar Orbiter nugget of science to summarise my Laker et al. 2023 paper

Space weather has the potential to significantly disrupt human technology both in space and on the ground, including loss of satellites, damage to power grids and communication blackouts^{1}. Fortunately, many of these effects can be mitigated with preventative measures should the arrival time and severity of the storm be known in advance. Since the largest of the geomagnetic storms are driven by coronal mass ejections (CMEs), this makes timely and accurate predictions of CME arrival times and internal structure extremely important.

Unlike many natural hazards, it is relatively straightforward, through continuous monitoring of coronal images, to spot when a large CME has been launched from the Sun. However, simulating the CMEs journey through the solar wind is a more challenging task. Many sophisticated solar wind and CME models still deal with arrival time errors of $\pm 10$ hours, due to uncertain estimates of the CME’s initial parameters^{2} and the complex interaction of the CME with the ambient solar wind that can lead to significant deflections, deformations and rotations^{3,4}.

Information regarding the orientation of the CME’s magnetic field, a primary indicator of geo-effectiveness, is often as valuable as the predictions of arrival time^{5}.
Knowing the potential impact of the CME, rather than just its arrival time, can limit the number of false positives and make predictions more useful for those commercial applications where there is a high cost of mitigation, e.g., putting a spacecraft into a safe mode.

Currently, a number of spacecraft (ACE, Wind and DSCOVR) monitor the real time plasma conditions, providing us with a valuable early warning system. However, on a heliospheric scale, these spacecraft are located too close to Earth, at the L1 Lagrange point, only giving us a warning of less than 45 minutes for potential space weather events.

Our understanding of CMEs and the solar wind is now being pushed forward by a new generation of spacecraft sent to investigate the inner heliosphere, including Solar Orbiter along with Parker Solar Probe and even BepiColombo as it makes its way to Mercury. Together, this constellation of spacecraft can give us insights that would not be possible with a single spacecraft^{6}. Several authors have used multi-point observations to better understand the complex morphology of CMEs^{7,8,9}, especially as they interact with the solar wind and other CMEs. In the context of space weather prediction, several case studies have shown how a space weather monitor far further upstream that L1 could be beneficial in predicting the arrival time and geo-effectiveness of CMEs^{10,11,12}. For instance, Amerstorfer et al. 2018^{12} made use of a 2010 CME to show how measurements from the MESSENGER spacecraft could reduce the arrival time error at STEREO-B (located around 1au), which acted as a proxy for Earth.

Now, as recently reported in Laker et al. 2023^{13}, Solar Orbiter has been able to replicate these case studies but in real time, acting as a space weather monitor in March 2022. Solar Orbiter’s low latency data products were used to provide real time information about the magnetic field in the solar wind, along with a fortuitous trajectory that crossed the Sun-Earth line in March 2022, allowed the authors to temporarily transform Solar Orbiter into a real time space weather monitor at more than 0.5au from the Earth.

Although originally intended to help point Solar Orbiter’s telescopic arsenal, the low-latency data product is of lower time resolution and quality than the scientific grade data typically released 90 days after download. To overcome these challenges, the Solar Orbiter MAG team^{14} created a pipeline to remove over 50 different heater signals and interference from other instruments aboard the spacecraft. The result was a cleaned product that could be available only 12 minutes after it was taken in the solar wind, which includes the 4 minutes it takes for light to travel such a large distance.

As Solar Orbiter crossed the Sun-Earth line, two CMEs were observed by the spacecraft, at the times depicted as red dots in Figure 1. Both case studies were tracked along their journey with the STEREO-A Heliospheric imager (HI) and subsequently measured in situ by Wind at L1.

*Figure 1: Trajectory of Solar Orbiter in the ecliptic plane of the Sun centred GSE frame between 1 February and 31 March 2022. The spacecraft crossed the Sun-Earth line on 6 March and then subsequently measured the two CME events on 7 March and 11 March, respectively. Each point represents the Solar Orbiter position at the start of each day. The Sun is represented by the orange filled circle, the Earth as a blue dot.*

Similar to the Amerstorfer et al. 2018^{12} study, the authors used data from the upstream spacecraft to constrain the arrival time predictions from the ELEvoHI model. Under normal operation, this model uses an average of 210 ensemble members to estimate the arrival time of a CME at Earth. However, by only keeping those ensemble members that were within $\pm4$ hours of the observed arrival time at Solar Orbiter (red dots in Figure 2), they improved both the accuracy and precision of the model. Specifically, the mean absolute error was reduced from 10.4 to 2.5 hours and 2.7 to 1.1 hours in the two case studies. Therefore, for the first time, a numerical model was constrained with data from 0.5 au to produce an updated prediction before the actual arrival of the CMEs at Earth.

*Figure 2: Difference between simulated and true arrival time at Solar Orbiter (x axis) and Earth (y axis) for the 210 ensemble member runs from ELEvoHI. By constraining the model with the Solar Orbiter arrival time $\pm4$ hours (red dashed lines), only 99 and 85 ensemble members were kept for the two cases, respectively (red scatter points).*

Comparing measurements from Solar Orbiter and Wind, in Figure 3, revealed that the magnetic structure of the CME sheath and flux rope were remarkably similar for the second case study, despite being separated by 0.5 au. Crucially, the magnetic response at Earth (D$_{ST}$ and SYM/H indices) was consistent with the shape of the $B_z$ signature at Solar Orbiter, displaying three dips before the flux rope slowly rotated northward. So, at least for this event, the magnetic storm indicators at Earth were strongly correlated to the magnetic structure seen at 0.5 au around 40 hours prior. While the other case study was a more complex interaction between two CMEs, these results are still encouraging and suggest that a future upstream space weather monitor could not only predict the arrival of a CME, but the sub-structure of the flux rope and sheath region on an hourly time scale.

*Figure 3: In situ measurements for a CME case study in the GSE coordinate system, where vertical dashed lines show the start times of the event. Top panel shows the magnetic field components at Solar Orbiter, which have been shifted by 38.2 hours to match with the shock front at Wind in the panel below. The proton density, $Np$, at Solar Orbiter has been scaled by $1/r^{2}$. Both spacecraft observe three regions of negative $B_z$, followed by a flux rope with a positive $B_z$. This led to a geomagnetic storm at Earth, which also exhibited three dips in the D$_{ST}$ and SYM/H indices.*

In future, more models can make use of this data, either to constrain the output or to initiate a CME simulation away from the complex environment near the Sun^{3,15}.
Fortunately, the opportunity to use Solar Orbiter for this purpose repeats once a year. When the next Carrington level event barrels towards Earth, maybe the Solar Orbiter MAG team will be the first to alert the world about its internal structure.

Timelapse in Google Earth is the largest video on the planet, of our planet. We hope that this perspective of the planet will ground debates, encourage discovery and shift perspectives about some of our most pressing global issues. Link

I am obviously aware that humans are irreversibly changing the planet, but the speed of the change is still striking - as this video of the Amazon rainforest demonstrates.

On a more light hearted note, it is fascinating to watch large scale infrastructure being built, such as Hong Kong’s airport

and the infamous world island project

Finally putting my Geography AS level to work, it is really interesting to watch this river in Peru meander through the flood plain leaving oxbow lakes in its wake. Obviously you see diagrams of this process in Geography lessons, but it is so cool to see it happening for real.

This video of the Massachusetts coastline also does a great job at showing the power of coastal erosion and spit formation.

Not just pretty visuals, Google Earth engine can be used to monitor crop health, air quality levels and algae blooms. Their aim is to streamline the data processing of large geospatial datasets, allowing scientists and policy makers to focus on the impactful research.

Much of this data is provided free by the United States Geological Survey, NASA and ESA through their fleet of Earth observation satellites, such as Sentinel.

In this video, they demonstrate how you can monitor crop health for individual fields over time!

]]>As demonstrated in this talk, Hiplot is a tool that allows you to visualise high dimensional tabular data. Not only is this useful for exploring a new data set, but it can help you manually find cuts in the data to create a sort of dumb decision tree - useful as a target to beat with a fancier machine learning model later in the project.

The speaker (Vincent Warmerdam), who is also the creator of the amazing calmcode project, demonstrates that such a tool can be used to evaluate the best hyperparameters in a grid search.

Try exploring how the house price in London varies with postcode, property type and number of bedrooms. Narrow down the range of each column, by dragging a box vertically, to see how the price changes (as represented by the colour and right most column)

]]>One of the benefits of spending large amounts of money on a Fujifilm camera, is that there are enough images to create a travel photobook. We thought it would be fun to create a themed cover to the photobook, so that we can distinguish between all the locations and years on our bookshelf.

Here is our effort for the Lake District trip we enjoyed in December 2023, where we travelled from Penrith to Windermere, Ullswater and Blea Tarn.

]]>I have created an interactive visualisation below to explain the underlying concepts and make tools like prophet seem less like magic… For example, try adjusting the coefficients *below* to model weekly seasonality with one of three linear models (dummy, radial basis function or Fourier). For further explanation, see the seasonality section

*This is actually running Python in your browser! Should take about 30 seconds to load, see this previous post or Shinylive*

Before researching for this blog post, I naively assumed that linear regression is restricted to a straight line fit, of the form:

\[y = m \times x + c\]where $y$ is our target variable, $x$ is the feature, $c$ is the intercept and $m$ is the coefficient for $x$.

However, it turns out that the “linear” in linear regression actually refers to the relationship between the target and feature variables, i.e. each feature variable has a single constant coefficient to describe its relationship to $y$. This means that we can rewrite our model in terms of vectors and matrices, allowing us to extend the straight line to many dimensions:

\[\vec{y} = \mathbf{X} \vec{\beta}\]where $\vec{\beta}$ is a vector of $n$ coefficients ($n \times 1$):

\[\vec{\beta} = \begin{pmatrix} \beta_{0} \\ \vdots \\ \beta_{n} \end{pmatrix}\]and $\mathbf{X}$ is a ($ i \times n$) matrix containing the feature variables:

\[\mathbf{X} = \begin{pmatrix} 1 & x^{(0)}_{1} & \cdots & x^{(0)}_{n-1} \\ \vdots & \vdots & \ddots & \vdots \\ 1 & x^{(i)}_{1} & \cdots & x^{(i)}_{n-1} \end{pmatrix}\]When we “fit” our model to the data, we are changing the values of the coefficients ($\vec{\beta}$) with the aim of minimising the error between the observed data and our predictions. This is normally represented by the mean squared error (MSE).

With a bit of linear algebra, we can solve this equation analytically, leading to $\hat{\beta}$ minimising the mean squared error:

\[\hat{\beta} = (\mathbf{X}^{T} \mathbf{X})^{-1} \mathbf{X}^{T} \vec{y}\]Later on, we will see how we can avoid overfitting by modifying the loss function to include some form of regularisation.

After fitting our model, the coefficients can instantly tell us the effect of each of our feature variables. For example, we could say that for every $1^{\circ}$ temperature increase we expect the chance of rain to decrease by X amount. Such a simple statement is notoriously hard to make when using more complicated models, like neural networks.

But what if our data shows some structure?

For example, I have created some fake data that has a clear weekly seasonality. How can we possibly model this with a straight line? Well, with some clever feature engineering tricks, we can create a whole host of new features to model complex situations like this.

*Demonstration of dummy variables with the figure at the top of the page*

Using our intuition, we might think it is sensible to try and calculate the contribution from each day of the week. This can be represented by creating a dummy variable for each day, where $x_{\textrm{Monday}}$ is only equal to $1$ on a Monday and $0$ everywhere else. We can then scale this new feature with a coefficient, $\beta_{\textrm{Monday}}$, that represents **the average $y$ on a Monday**.

This new feature can be created with the following code:

```
def dummy(x, start, width = 1):
# repeat every 7 days
x_mod = x % 7
# Create a boolean array where True is set for elements within the specified range
condition = (x_mod >= start) & (x_mod < start + width)
# Convert the boolean array to an integer array (True becomes 1, False becomes 0)
return condition.astype(int)
```

While these dummy variables are useful for demonstrating that coefficients effectively represent the height of each feature, the final result does not look natural. The step-like shape means that we are expecting a significant change as soon as it goes 1 minute past midnight!

To create a smoother seasonality pattern, we can replace the step function with a repeating Gaussian distribution centered around each day of the week. This is referred to as a radial basis function:

\[y = e^{- (x - \textrm{center})/(2 \times \textrm{width})}\]This now lets the influence of a single day seep into adjacent days, leading to a much more pleasing final fit.

We can create the new features with the code below:

```
def rbf(x, width, center):
# repeat every 7 days
x_mod = x % 7
center_mod = center % 7
# Original Gaussian
gauss = np.exp(-((x_mod - center_mod)**2) / (2 * width))
# Gaussian shifted by +7
gauss_plus = np.exp(-((x_mod - (center_mod + 7))**2) / (2 * width))
# Gaussian shifted by -7
gauss_minus = np.exp(-((x_mod - (center_mod - 7))**2) / (2 * width))
# Sum the contributions
return gauss + gauss_plus + gauss_minus
```

You may have noticed that we know have an extra variable `width`

, which controls the width of our individual Gaussian functions. This variable is not actually part of the Linear regression, but is used to generate the features that the model learns from. This is therefore a hyperparameter that we can tune.

For example, try changing the width parameter in the top figure to see how it affects the MSE

We can also model seasonality with Fourier components, which less prone to overfitting than the radial basis functions above.

This trick works because any function can be represented as a sum of infinitely many sine and cosine waves added together, known as a Fourier transform, as demonstrated below.

Therefore, we can create a new feature for each of these sine and cosine waves, with each coefficient representing the amplitude of that individual sine wave. This is equivalent to the coefficients that can be found by performing a Fourier transformation. The only difference is that we cannot have infinitely many sine waves in our linear regression, however, this is actually helpful to avoid overfitting!

The figure below displays the first and second order Fourier components that are used in the interactive figure

What if I don’t trust old data as much as recently recorded data?

Many fitting libraries, such as scikit-learn, allow you to specify an `importance`

to each of the data points. Therefore, we can give exponentially decaying importance to older measurements as a way of ignoring potentially misleading historic data. The amount of “forgetfulness” then becomes another `hyperparameter`

in our model.

what if the seasonality changed over time?

In the figure below, the amplitude of the seasonality component is increasing each week. Not to worry, we can just create several extra features representing the interaction of day of the week with time. However, this kind of feature engineering can become tedious, hindering our flow. We now want to describe our models with a statistical language. For example, a straight line fit ($y = m \times x + c$) would be written as

```
y ~ 1 + x
```

with the `1`

representing the intercept, with a coefficent being automatically gerenated for each variable, `x`

. This might seem like overkill for such a simple model, but the real beauty becomes apparent with more complex problems…

Returning to the changing seasonality problem, we might start by saying there is an interaction between time, `t`

, and the first Fourier cosine component, `cos_1`

:

`y ~ 0 + t*cos_1`

The `*`

operator will automatically generate coefficients for `t`

, `cos_1`

and the interaction between `t:cos_1`

Note: In the notebook, the patsy package is used to convert these statistical formulas into a feature vector

After fitting this formula, we get three coefficients that we can interpret as:

- For every 1 unit of
`t`

, y will reduce by 0.01 (negative slope straight line) - The base amplitude for the
`cos_1`

variable is -0.41 - For every unit of
`t`

this amplitude will reduce by another -0.04 (making the amplitude larger)

Including the full interaction terms is as easy amending the formula like so:

`y ~ t * (cos_1 + sin_1 + cos_2 + sin_2)`

which will autogenerate 9 features and coefficients

However, there is no free lunch in modelling. Although we can now create hundreds of features using this statistical language, we now need to keep control of them…

One of the hardest parts of fitting any model, either simple or complex, is making sure we don’t overfit. We want to capture the general shape of the observed data, but do not want to perfectly predict each point, as this will just be fitting the noise rather than learning the underlying pattern.

One way to combat overfitting is with a technique called regularisation, which adds a penalty term to the regression formula. Since the model will be penalised for each variable it adds, only the most useful features are included. This can be seen by comparing the figures above (regular fit) and below (with regularisation).

The exact form of the penalty can change, with the two main being LASSO and RIDGE regression having a linear and quadratic penalty term, respectively.

An alpha parameter is often used to specify the trade-off between the model’s performance on the training set and its simplicity. So, increasing the alpha value simplifies the model by shrinking the coefficients.

At this point, you might be wondering what are the uncertainties on these coefficients?

While we can solve linear regression with matrix operations (as shown earlier), this is not the best way to understand the sensitivity in our coefficients. Instead, we can fit our model within a Bayesian framework. Not only will this automatically provide an uncertainty estimate in the form of a posterior distribution, but we can actually incorporate domain knowledge into our fitting (through the prior distribution).

To explain how, we first need a quick crash-course in Bayes’ theorem (or read this introduction).

Bayes’ theorem allows us to mathematically update the probability of an event happening, $P(E)$, based on some new data, D.

\[P(E|D) = \frac{P(D |E) \times P(E)}{P(D)}\]For example, we might be playing a game of heads or tails where I have to guess the coin face. I give you the benefit of the doubt and initially believe that there is a 50/50 chance of the coin being fair, $P(Fair) = 0.5$, which is called my `prior`

assumption.

If you get three heads (3H) to start the game, do I still believe the coin is fair?

If the coin is fair, then on each coin flip the probability of getting heads is 50%. So the probability of $n$ heads in a row is equal to

\[P(H|\textrm{Fair}) = 0.5^{n}\]We will assume that if the coin is not fair the probability of $n$ heads is

\[P(H|\textrm{Not Fair}) = 0.75^{n}\]On each coin flip, I can update my beliefs to get a `posterior`

probability, i.e. $P(Event|Data)$. Doing the maths, we can say that:

So, maybe we need to have a chat…

Using this analogy, we start off by assuming a prior distribution for each of our linear regression coefficients (usually just a Gaussian). We can then use Bayes’ theorem to update our beliefs of these coefficients, given the observed data. With each new data point, we get a better understanding of the posterior distribution for each function!

This gets even more fun when you realise we don’t have to use the standard Gaussian priors. We can actually set up the priors to include knowledge we already have about the problem, i.e. we might already have some idea of what the seasonality should be, allowing a fit that might not be normally possible with limited data.

The above figure shows the final coefficients along with the 94% confidence interval. This allows us to quote things as “we are 94% confident that Sunday gives an extra 0.746 to 1.071 units”. We can also make statements such as “we are the least sure about the 7th day”. Interestingly, the model is very confident on the amount of noise in the data (the `y_sigma`

parameter), probably because I made this fake data with a single noise parameter…

The prophet model from Facebook provides many of these functionalities straight out of the box, and does such a good job of abstracting this complexity away that it kind of seems like magic. However, it is important to realise that prophet is just using some of the tricks from earlier:

- Is a linear model
- Models weekly seasonality with radial basis function dummy variables
- Models yearly seasonality with Fourier components (which you can change the order of with the
`yearly_seasonality`

parameter) - Can fit in a Bayesian framework, giving uncertanties in the final predictions

To be fair, prophet also has a few tricks up its sleeves: it can model increases in sales on holidays in that region; handle logistic growth (along with changepoints) as well as deal with gaps in the data.

I would personally use prophet to rapidly experiment with seasonality, holidays and extra regressors. I would then re-create the model in PyMC to add more custom features, see this example of how to do this.

As an demonstration of how simple prophet is, we can condense most of the ideas in this blog post to the following code:

```
m = Prophet(
weekly_seasonality = True,
yearly_seasonality=5
)
m.add_country_holidays(country_name='USA')
m.fit(df)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast[["ds", "yhat", "yhat_lower", "yhat_upper"]].tail()
```

We can also break down each of these components to extract more information

For a more detailed tutorial, see the official documentation

This post, and associated notebook, is heavily inspired by this fantastic talk by Vincent Warmerdam, along with this notebook investigating cycling patterns in Seattle.

]]>As this article explains, a traditional recommendation algorithm would just say “you should watch these shows”. But Netflix go one step further and ask themselves “how can we convince you that this show is worth watching?”

A simple example of this concept is showing you film posters with actors you recognise:

Instead using a traditional A/B test to trial a new algorithm such as this, the authors describe how they use a contextual bandit algorithm to balance data exploration against providing the best experience for the user.

As they explain, there is a lot of nuance to this problem:

- you need to understand the causal effect of suggesting an artwork, i.e. would a user have clicked that show regardless of your proposed artwork?
- maintaining a recognisable image for that show. What if a user cannot find that artwork they saw last week?
- optimise across the whole screen. A poster of the main actor will not stand if a similar artwork is shown for all shows
- how to quickly learn the rules for a new title launch?

In this 2017 article, the authors say that a team of artists and designers create a wide range of images for each show. It would be interesting to learn how much of this has now been taken by generative AI.

]]>Now, I could be disciplined and make sure the workspace is kept tidy, but frankly that requires a lot of effort (that I simply don’t have the patience for). Besides, the flurry of data exploration and creative problem solving is all part of the process that we should encourage.

As you have probably guessed, some smart *cookie* has already *made* tools to solve this problem… read on to understand just how terrible those puns are.

We will be using two tools to create our reproducible workflow: cookiecutter to stamp out our folder structure and then make our data processing and analysis automated. You can use my cookiecutter with the following (which is based on this original):

```
cookiecutter https://github.com/rlaker/cookiecutter-data-science
```

The first tool, cookiecutter, allows us to define the folder structure and include any boilerplate code that can streamline the process of creating the project. For example, with cookiecutter we can install and set up:

- a conda environment for this project with an
`environment.yaml`

file. Simply run`make environment`

to install some common packages, e.g. pandas or matplotlib - pre-commit hooks to be installed automatically - ensuring our formatting is consistent
- a python package for each project (which lives in the
`src`

folder). This means you can install the custom package into any notebook in the project without having to worry about paths to a local folder (see these blog posts here and here). This also makes the project more portable when it comes to productionising the model. - include custom style files, such as the one from my previous post
- a place to store the raw
`data`

,`notebooks`

,`models`

,`figures`

etc.

As well as helping us remember where all the necessary files are stored, such a consistent folder structure can also allow us to write all the necessary boilerplate in a `makefile`

. Essentially, this file holds the terminal commands needed for make to reproduce the outputs of the project. Not only are makefiles **both human and machine readable documentation**, but they can specify the dependencies for each part of the project, i.e. the Latex report is dependent on the code to make the figures, which depends on the data processing script. By looking at the file timestamps, make can also infer which files need to be updated, only running the necessary parts of the pipeline.

While storing pre-processed versions of data files can save disk space, I learnt the hard way why this is a bad idea. **Do not** manually edit or change the raw data, you will forget what you did in a few weeks/months. Instead, create a pipeline and document the steps in the `makefile`

. Even if you forget any of the steps to your implementation, you only need to run `make data`

.

If you follow convention, then a completely new user could recreate any of your projects by just typing

`make all`

The power of make, combined with a consistent folder structure, can also enable you to automatically create documentation for your project with a single command. For example, running `make docs_html`

in the newly created project will use pdoc to automatically generate a static website from the docstrings within the code!

If you want a more in depth review of this type of workflow:

]]>Who knew that margarine consumption is correlated with the divorce rate in Maine? There is even a *very scientific* paper on the subject.

This is just one of thousands of spurious correlations from Tyler Vigen’s hilarious demonstration of data dredging (his figures are even included on the associated Wikipedia page). This is when you take many variables, say 25,237 like on his website, and blindly accept statistically significant correlations.

Turns out this is a major problem in the more statistical sciences, so much so that they now have a pre-registration format to describe what the study will investigate before any data is investigated.

This project also provides a great example of generating realistic looking content, in the form of *scientific* papers, from LLMs. Each paper shows the sequence of prompts that were used to create it.

The author does point out that:

The silliness of the papers is an artifact of me (1) having fun and (2) acknowledging that realistic-looking AI-generated noise is a real concern for academic research (peer reviews in particular).

The papers could sound more realistic than they do, but I intentionally prompted the model to write papers thatlookreal butsoundsilly.

Although, I’m sure you could convince some people that Anne Hathaway films are responsible for the number of votes for Republican senators…

]]>For example, try adjusting the coefficients *below* to model weekly seasonality. For further explanation, see the main blog post

*This is actually running Python in your browser! Inspect the console for this page and you will see that Python packages have been installed.*

As demonstrated by Shiny’s website, Python packages could run interactive demos of each function in their documentation, without the user actually installing the package. Users could also see how they can modify the example by editing the documentation directly!

The only downside is that only a few select packages have been converted to work with Pyodide, but the data science big hitters are there: numpy, matplotlib and pandas

After writing your shiny app, export it with the shinylive package

```
shinylive export app_folder site
```

You can then serve the site locally with

```
python -m http.server --directory site 8008
```

All we need to do now is upload this folder to our GitHub Pages site. I decided to put it in the files folder, since I do not need to show the `index.html`

app file directly. Instead, I include the app in posts by using an `iframe`

```
<iframe src="/files/shiny_linear_model/shiny_linear_model.html" width="100%" height="1000px" style="border:none;"></iframe>
```

Clearly, I want to still use Python to create my figures, but want to match the professional look of other’s plots. This also means that I won’t get any push back from my graphs being in a different format.

Matplotlib stylesheets provide a way to achieve a consistent styling to your figures, e.g.:

The different elements of the stylesheet are described below, along with some helpful snippets for legends and tick formatting.

This part contains the base seaborn style

```
# Seaborn common parameters
# .15 = dark_gray
# .8 = light_gray
figure.facecolor: white
text.color: .15
axes.labelcolor: .15
legend.frameon: False
legend.numpoints: 1
legend.scatterpoints: 1
xtick.direction: out
ytick.direction: out
xtick.color: .15
ytick.color: .15
axes.axisbelow: True
image.cmap: Greys
font.family: sans-serif
font.sans-serif: Arial, Liberation Sans, DejaVu Sans, Bitstream Vera Sans, sans-serif
grid.linestyle: -
lines.solid_capstyle: round
lines.linewidth : 2
lines.markersize : 10
# Seaborn whitegrid parameters
axes.grid: True
axes.facecolor: white
grid.color: .8
xtick.major.size: 4
ytick.major.size: 0
xtick.minor.size: 2
ytick.minor.size: 0
```

Cycle through the branded colours of my organisation (Trainline):

```
#cycler
axes.prop_cycle: cycler('color', ['00a88f','ff9da1','160078','ffc508','ff6120','004ff9','ac3200'])
```

Set the grid style and labels:

```
# grid
axes.grid.axis: y # which axis the grid should apply to
axes.grid.which: major # grid lines at {major, minor, both} ticks
#font size
font.size : 18
axes.titlesize : 24
figure.titlesize: 24
axes.labelsize : 20
xtick.labelsize : 16
ytick.labelsize : 16
legend.fontsize : 16
# label pad
axes.labelpad: 8.0 # space between label and axis
```

Following the Excel style, only show the bottom spine

```
# spines
axes.spines.left: False # display axis spines
axes.spines.bottom: True
axes.spines.top: False
axes.spines.right: False
```

Set the date format

```
# DATES
date.autoformatter.year: %y
date.autoformatter.month: %m/%y
date.autoformatter.day: %d/%m/%y
date.autoformatter.hour: %m-%d %H
date.autoformatter.minute: %d %H:%M
date.autoformatter.second: %H:%M:%S
date.autoformatter.microsecond: %M:%S.%f
```

Format y axis to have comma separated numbers, e.g. $$100,000$

```
import matplotlib.ticker as ticker
# Just put a , between 000
axs.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:,.0f}'))
# % symbol
axs.yaxis.set_major_formatter(ticker.StrMethodFormatter('{x:.1f}%'))
# currency
axs.yaxis.set_major_formatter(ticker.StrMethodFormatter('£{x:,.2f}'))
```

or

```
from matplotlib.ticker import FuncFormatter
import matplotlib.pyplot as plt
def millions(x, pos):
'The two args are the value and tick position'
return f'£{x*1e-6:,.1f}m'
formatter = FuncFormatter(millions)
ax.yaxis.set_major_formatter(formatter)
```

from here

Legend on top of the plot (guide)

```
ax.legend(bbox_to_anchor=(0, 1, 1, 0), loc="lower left", mode="expand", ncol=2)
```

Legend on the right

```
ax.legend(bbox_to_anchor=(1, 1), loc="upper left")
```