using Statistics
using Distributions
using CairoMakie #for plotting
using Random #to set a seed
Random.seed!(0408)
TaskLocalRNG()
Using Julia to generate a dataset with a given correlation
EE
September 8, 2022
This is going to be a short one, but I saw a comment on Twitter recently about an interview question where someone was asked to generate a dataset with variables X and Y that are correlated at r = .8. So I figured I’d write out some code that does this as a way to practice in Julia a little bit more.
First we load our packages
using Statistics
using Distributions
using CairoMakie #for plotting
using Random #to set a seed
Random.seed!(0408)
TaskLocalRNG()
The approach here is going to be to define a covariance (correlation) matrix and a vector of means, then define a multivariate normal distribution parameterized by these things. We’ll then use this distribution to generate our data.
First we’ll define \(\Sigma\), which is our covariance matrix. Since we’re generating a dataset with only 2 variables, this will be a 2x2 matrix, where the diagonals will be 1 and the off-diagonals will be .8, which is the correlation we want between X and Y.
Then we’ll define a mean vector. This will be a 2-element vector (one for each variable), but we don’t actually care what the values are here, so let’s just make them 0.
2-element Vector{Float64}:
0.0
0.0
Now we can define a distribution given \(\Sigma\) and \(\mu\)
And then we can draw a sample from this distribution
2×200 Matrix{Float64}:
-1.40556 0.469524 -1.19092 -0.40408 … -0.244792 0.874835 -0.719764
-0.595655 1.01141 -1.84189 -0.550097 0.250661 1.72269 -0.862095
To confirm this works like expected, we can plot the sample
It looks like a .8 correlation to me. But to do a final check, we can get the correlation matrix of our sample.
2×2 Matrix{Float64}:
1.0 0.769654
0.769654 1.0
Close enough. Our correlation won’t be exactly equal to .8 using this approach since we’re sampling from a distribution, but there’s really no difference (imo) between a .77 correlation and a .80 correlation.
@online{ekholm2022,
author = {Ekholm, Eric and , EE},
title = {Generating {Data} with a {Given} {Correlation}},
date = {2022-09-08},
url = {https://www.ericekholm.com/posts/cor-generate-data},
langid = {en}
}