diff --git a/README.md b/README.md
index d63a6a1..187e68b 100644
--- a/README.md
+++ b/README.md
@@ -1,11 +1,64 @@
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture,
 Project 1 - Flocking**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Stephen Lee
+  * [LinkedIn](https://www.linkedin.com/in/stephen-lee-bb5a40163/)
+* Tested on: Windows 10, i7-9750H @2.60GHz, RTX 2060 6GB (personal laptop)
 
-### (TODO: Your README)
+# Project Goals
+This project was an investigation on the performance implications of several algorithms to simulate the movement of Boid particles with some slight adaptations from [Conrad Parker's](http://www.vergenet.net/~conrad/boids/pseudocode.html) notes. More information on this can be found [here](INSTRUCTION.md). The project encompassed implementing and testing three algorithms: a naive implenetation, an algorithmic optimizationt to reduce the search space, and finally code optimizations to reduce the number of memory accesses.
+ 
+# Boid Basics
+Boids are meant to imitate the flocking behaviors of migrating birds or schools of fish. As such, boid move around in this simulation according to these three rules:
 
-Include screenshots, analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+1) Boids try to fly towards the center of mass of neighboring voids.
+2) Boids try to keep a small distance away from other objects (this includes other boids)
+3) Boids try to match the velocity of nearby boids.
+
+The application of these rules lends itself to a high-degree of parallelization, thus they have been implemented across threads on GPU cores. The pseudocode for implementing these rules was provided in the [instructions](INSTRUCTION.md).
+
+Early on in execution the boids are scattered randomly throughout the space
+<img src="images/Beginning.PNG">
+
+As the simulation progresses, boids begin to travel in flocks with their neighbors forming groups
+<img src="images/End.PNG">
+
+This is how the boids look moving around in their flocks
+<img src="images/FlockingDemo.gif">
+
+## Naive Implementation
+The algorithm implemented for boid simulation involved naively checking all the other boids in the simulation to determine how each boid should change its velocity. This is highly inefficient since boids are only affect by boids that are within a close proximity, thus most boids in the simulation that are being checked using this implementation do not contribute anything.
+
+## Uniform Grids
+In this algorithm for boid simulation we fix the issue discussed in the Naive Implementation where we are checking many boids that have no contribution to velocity changes for every boid. This algorithm leverages this fact and shrinks the search space for boids down to a more localized area around the current boid so that we only need to check boids that are nearby.
+
+We accomplish this by establishing a uniform grid of cells across the simulation space, breaking it up into cubic regions. Now for each boid we must determine which grid cell contains the boid. Once this information is gathered, we can leverage it by performing a key value sort with the newly computed cells as keys and an array of indices of boids as values. As a final preprocessing step, we can walk through the newly sorted grid cell value array to compute starting and ending positions for each cell that is useful when performing the boid computations.
+
+Finally, we can use all this information that we have computed to shrink our search space down. This is accomplished by only checking boids who belong to grid cells neighboring our current cell rather than having to check every boid in all of the cells which we did in the naive implementatoin. We can accompish this by determining where in the grid cell the boid is relative to the center of the grid cell, and checking the neighboring grid cells in that direction. 
+
+## Memory Coherency
+While shrinking the search space down to only check boids that could reasonably affect the velocity of each boid results in a significant performance increase, one issue with how this was implemented was that we use a middleman array to keep track of where we are storing position and velocity data after performing our key value sort. This is problematic, because using this middleman array to access position and velocity requires many redirections and random accesses in memory. This final optimization accounts for this issue by cutting out this middleman array so that position and velocity data are arranged sequentially after performing the key value sort to minimize redirections and random accesses.
+
+# Performance Analysis
+Frames per second (FPS) was used as the primary metric to analyze the performance of each of these algorithms. Frame rate was uncapped and with v-sync off, meaning that in each case, the simulation was running at full speed.
+
+I added some code to [main.cpp](src/main.cpp) to easily log this information to the terminal in 1 second intervals. FPS values were then averaged over a 25 second execution.
+
+<img src="images/varyLoad.png">
+
+Simulations were ran with a few different parameters tweaked that are visible in the bar graph above. First, we account for the three primary algorithms implemented in this project: naive, uniform, and coherent. For uniform and coherent, further investigation was made based on whether the grid was defined to have grid cell widths that were twice the maximum possible distance a boid could be affected by a neighbor or simply this maximum possible distance. These are denoted with suffixes 2 and 1 respectively. The different colored bars represent the different testing loads applied to the simulations, with blue being a load of 30000 boids and orange ten times that value at 300000 boids.
+
+It is immediately apparent that the naive implementation performs very poorly. With the lighter 30000 boid load, it could only output a small fraction of the other two methods and the heavier load hardly even ran. The lighter load however failed to produce conclusive distinctions between the standard uniform grid approach and the coherent memory optimization in both grid cell widths tested.
+The lack of separation in perfomance under the lighter load is likely due to some bottleneck that is not reliant on the computation only performance of the algorithm. Probably some hardware limitation that data cannot travel any faster to be processed.
+
+With the heaver 300000 boid load however, we begin to see much more interesting and promising results. The coherent memory access approach is far superior to the standard uniform grid method, and we can also see that reducing the cell width to be one times the maximum distance a boid could affect its neighbor also markedly improved performance.
+
+<img src="images/varyBlock.png">
+
+Further investigation was done on this 300000 boid load case where the number of threads in a block (BlockSize) was doubled to be 256 rather than 128. It was found that this did not have much impact however.
+
+# Questions
+1) Increasing the number of boids decreased the performance of the simulation for all of the algorithms tested. Adding more boids to the simulation means more boids have to be checked, meaning more computation has to be done for each boid. This increased computation time is likely what causes this performacne dip.
+2) Chaning the block count and block size did not really affect performance at all. This can be attributed to the fact that changing these values only changes how the kernels get divided up across the SMs on the GPU. Each SM can only run a single warp at a time, so this does not actually change the number of operations being done at a time, it just distributes them differently.
+3) As discussed in performance analysis, the coherent uniform grid demonstrated great performance increases for heavier loads of boids. We likely did not see any change in performance for the lighter load because we are bottlenecked by something other than computational power, and experienced a great performance increase with a heavier load since we reduce costly redirections and random accesses in memory. This was the outcome that I expected for these exact reasons.
+4) Changing the cell width and checking 27 neighors vs 8 neighors improved performance. While on the surface it might seem that checking more cells would be more costly, but with a closer look we see that the volume represnted by those 27 cells is in fact smaller than that of the 8 cells. Since we are dealing with unit lengths in the 27 cell case, and double unit lengths in the 8 cell case, the side length of the cube we are checking is only 3 units vs 4 units for the 8 cell case. Since we are checking a smaller volume for boids, it is likely that we will run into fewer boids to check, thus decreasing the amount of computation that needs to be done sequentially for each boid.
\ No newline at end of file
diff --git a/images/Beginning.PNG b/images/Beginning.PNG
new file mode 100644
index 0000000..e6d6394
Binary files /dev/null and b/images/Beginning.PNG differ
diff --git a/images/Data.xlsx b/images/Data.xlsx
new file mode 100644
index 0000000..2e98f77
Binary files /dev/null and b/images/Data.xlsx differ
diff --git a/images/End.PNG b/images/End.PNG
new file mode 100644
index 0000000..d863b02
Binary files /dev/null and b/images/End.PNG differ
diff --git a/images/FlockingDemo.gif b/images/FlockingDemo.gif
new file mode 100644
index 0000000..2b15541
Binary files /dev/null and b/images/FlockingDemo.gif differ
diff --git a/images/FlockingDemo.mp4 b/images/FlockingDemo.mp4
new file mode 100644
index 0000000..5b2fe2f
Binary files /dev/null and b/images/FlockingDemo.mp4 differ
diff --git a/images/varyBlock.png b/images/varyBlock.png
new file mode 100644
index 0000000..905e010
Binary files /dev/null and b/images/varyBlock.png differ
diff --git a/images/varyLoad.png b/images/varyLoad.png
new file mode 100644
index 0000000..a4b13e3
Binary files /dev/null and b/images/varyLoad.png differ
diff --git a/src/kernel.cu b/src/kernel.cu
index 74dffcb..e46f8d8 100644
--- a/src/kernel.cu
+++ b/src/kernel.cu
@@ -17,6 +17,8 @@
 
 #define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)
 
+#define WIDTHMODE 1 // toggle between 1 and 2 for cell widths
+
 /**
 * Check for CUDA errors; print and exit if there was a problem.
 */
@@ -60,13 +62,15 @@ void checkCUDAError(const char *msg, int line = -1) {
 
 int numObjects;
 dim3 threadsPerBlock(blockSize);
+int currBuff = 1;
 
 // LOOK-1.2 - These buffers are here to hold all your boid information.
 // These get allocated for you in Boids::initSimulation.
 // Consider why you would need two velocity buffers in a simulation where each
 // boid cares about its neighbors' velocities.
 // These are called ping-pong buffers.
-glm::vec3 *dev_pos;
+glm::vec3 *dev_pos1;
+glm::vec3 *dev_pos2;
 glm::vec3 *dev_vel1;
 glm::vec3 *dev_vel2;
 
@@ -74,7 +78,7 @@ glm::vec3 *dev_vel2;
 // pointers on your own too.
 
 // For efficient sorting and the uniform grid. These should always be parallel.
-int *dev_particleArrayIndices; // What index in dev_pos and dev_velX represents this particle?
+int *dev_particleArrayIndices; // What index in dev_pos1 and dev_velX represents this particle?
 int *dev_particleGridIndices; // What grid cell is this particle in?
 // needed for use with thrust
 thrust::device_ptr<int> dev_thrust_particleArrayIndices;
@@ -83,9 +87,6 @@ thrust::device_ptr<int> dev_thrust_particleGridIndices;
 int *dev_gridCellStartIndices; // What part of dev_particleArrayIndices belongs
 int *dev_gridCellEndIndices;   // to this cell?
 
-// TODO-2.3 - consider what additional buffers you might need to reshuffle
-// the position and velocity data to be coherent within cells.
-
 // LOOK-2.1 - Grid parameters based on simulation parameters.
 // These are automatically computed for you in Boids::initSimulation
 int gridCellCount;
@@ -142,8 +143,8 @@ void Boids::initSimulation(int N) {
 
   // LOOK-1.2 - This is basic CUDA memory management and error checking.
   // Don't forget to cudaFree in  Boids::endSimulation.
-  cudaMalloc((void**)&dev_pos, N * sizeof(glm::vec3));
-  checkCUDAErrorWithLine("cudaMalloc dev_pos failed!");
+  cudaMalloc((void**)&dev_pos1, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos1 failed!");
 
   cudaMalloc((void**)&dev_vel1, N * sizeof(glm::vec3));
   checkCUDAErrorWithLine("cudaMalloc dev_vel1 failed!");
@@ -153,11 +154,17 @@ void Boids::initSimulation(int N) {
 
   // LOOK-1.2 - This is a typical CUDA kernel invocation.
   kernGenerateRandomPosArray<<<fullBlocksPerGrid, blockSize>>>(1, numObjects,
-    dev_pos, scene_scale);
+    dev_pos1, scene_scale);
   checkCUDAErrorWithLine("kernGenerateRandomPosArray failed!");
 
   // LOOK-2.1 computing grid params
-  gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  if (WIDTHMODE == 1) {
+      gridCellWidth = std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+
+  }
+  else if (WIDTHMODE == 2) {
+      gridCellWidth = 2.0f * std::max(std::max(rule1Distance, rule2Distance), rule3Distance);
+  }
   int halfSideCount = (int)(scene_scale / gridCellWidth) + 1;
   gridSideCount = 2 * halfSideCount;
 
@@ -168,7 +175,21 @@ void Boids::initSimulation(int N) {
   gridMinimum.y -= halfGridWidth;
   gridMinimum.z -= halfGridWidth;
 
-  // TODO-2.1 TODO-2.3 - Allocate additional buffers here.
+  cudaMalloc((void**)&dev_particleArrayIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleArrayIndices failed!");
+
+  cudaMalloc((void**)&dev_particleGridIndices, N * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_particleGridIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellStartIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellStartIndices failed!");
+
+  cudaMalloc((void**)&dev_gridCellEndIndices, gridCellCount * sizeof(int));
+  checkCUDAErrorWithLine("cudaMalloc dev_gridCellEndIndices failed!");
+
+  cudaMalloc((void**)&dev_pos2, N * sizeof(glm::vec3));
+  checkCUDAErrorWithLine("cudaMalloc dev_pos2 failed!");
+
   cudaDeviceSynchronize();
 }
 
@@ -210,8 +231,8 @@ __global__ void kernCopyVelocitiesToVBO(int N, glm::vec3 *vel, float *vbo, float
 void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities) {
   dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
 
-  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_pos, vbodptr_positions, scene_scale);
-  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> >(numObjects, dev_vel1, vbodptr_velocities, scene_scale);
+  kernCopyPositionsToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_pos1, vbodptr_positions, scene_scale);
+  kernCopyVelocitiesToVBO << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_vel1, vbodptr_velocities, scene_scale);
 
   checkCUDAErrorWithLine("copyBoidsToVBO failed!");
 
@@ -229,22 +250,50 @@ void Boids::copyBoidsToVBO(float *vbodptr_positions, float *vbodptr_velocities)
 * Compute the new velocity on the body with index `iSelf` due to the `N` boids
 * in the `pos` and `vel` arrays.
 */
-__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3 *pos, const glm::vec3 *vel) {
-  // Rule 1: boids fly towards their local perceived center of mass, which excludes themselves
-  // Rule 2: boids try to stay a distance d away from each other
-  // Rule 3: boids try to match the speed of surrounding boids
-  return glm::vec3(0.0f, 0.0f, 0.0f);
+__device__ glm::vec3 computeVelocityChange(int N, int iSelf, const glm::vec3* pos, const glm::vec3* vel) {
+
+    glm::vec3 perceived_center = glm::vec3(0.f, 0.f, 0.f);
+    glm::vec3 c = glm::vec3(0.f, 0.f, 0.f);
+    glm::vec3 perceived_velocity = glm::vec3(0.f, 0.f, 0.f);
+    glm::vec3 out = glm::vec3(0.f, 0.f, 0.f);
+    float neighborCountR1 = 0.f, neighborCountR3 = 0.f;
+    for (int i = 0; i < N; i++) {
+        float distance = glm::distance(pos[iSelf], pos[i]);
+        if (i != iSelf) {
+            if (distance < rule1Distance) {
+                perceived_center += pos[i];
+                neighborCountR1 += 1.0;
+            }
+            if (distance < rule2Distance) {
+                c -= (pos[i] - pos[iSelf]);
+            }
+            if (distance < rule3Distance) {
+                perceived_velocity += vel[i];
+                neighborCountR3 += 1.0;
+            }
+        }
+    }
+
+    if (neighborCountR1 > 0) out += (perceived_center / (float)neighborCountR1 - pos[iSelf]) * rule1Scale;
+    out += c * rule2Scale;
+    if (neighborCountR3 > 0) out += (perceived_velocity / (float)neighborCountR3) * rule3Scale;
+
+    return vel[iSelf] + out;
 }
 
 /**
-* TODO-1.2 implement basic flocking
-* For each of the `N` bodies, update its position based on its current velocity.
+* Update velocity of each boid in parallel by checking every other boid in the simulation
 */
 __global__ void kernUpdateVelocityBruteForce(int N, glm::vec3 *pos,
   glm::vec3 *vel1, glm::vec3 *vel2) {
   // Compute a new velocity based on pos and vel1
   // Clamp the speed
   // Record the new velocity into vel2. Question: why NOT vel1?
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) return;
+
+    glm::vec3 velocity = computeVelocityChange(N, index, pos, vel1);
+    vel2[index] = glm::length(velocity) > maxSpeed ? glm::normalize(velocity) * maxSpeed : velocity;
 }
 
 /**
@@ -285,10 +334,11 @@ __device__ int gridIndex3Dto1D(int x, int y, int z, int gridResolution) {
 __global__ void kernComputeIndices(int N, int gridResolution,
   glm::vec3 gridMin, float inverseCellWidth,
   glm::vec3 *pos, int *indices, int *gridIndices) {
-    // TODO-2.1
-    // - Label each boid with the index of its grid cell.
-    // - Set up a parallel array of integer indices as pointers to the actual
-    //   boid data in pos and vel1/vel2
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+    glm::vec3 grid = glm::floor((pos[index] - gridMin) * inverseCellWidth);
+    gridIndices[index] = gridIndex3Dto1D(grid.x, grid.y, grid.z, gridResolution); // - Label each boid with the index of its grid cell.
+    indices[index] = index; // parallel array of integer indices as pointers to the actual boid data in pos and vel1/vel2
 }
 
 // LOOK-2.1 Consider how this could be useful for indicating that a cell
@@ -302,10 +352,20 @@ __global__ void kernResetIntBuffer(int N, int *intBuffer, int value) {
 
 __global__ void kernIdentifyCellStartEnd(int N, int *particleGridIndices,
   int *gridCellStartIndices, int *gridCellEndIndices) {
-  // TODO-2.1
-  // Identify the start point of each cell in the gridIndices array.
-  // This is basically a parallel unrolling of a loop that goes
-  // "this index doesn't match the one before it, must be a new cell!"
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+    gridCellStartIndices[index] = -1, gridCellEndIndices[index] = -1;
+    int pIndex = particleGridIndices[index];
+    if (index > 0) {
+        if (particleGridIndices[index - 1] != particleGridIndices[index]) {
+            gridCellStartIndices[particleGridIndices[index]] = index;
+        }
+    }
+    if (index < N - 1) {
+        if (particleGridIndices[index + 1] != particleGridIndices[index]) {
+            gridCellEndIndices[particleGridIndices[index]] = index;
+        }
+    }
 }
 
 __global__ void kernUpdateVelNeighborSearchScattered(
@@ -314,14 +374,77 @@ __global__ void kernUpdateVelNeighborSearchScattered(
   int *gridCellStartIndices, int *gridCellEndIndices,
   int *particleArrayIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.1 - Update a boid's velocity using the uniform grid to reduce
-  // the number of boids that need to be checked.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) return;
+
+    // Compute center to divide grid into octants
+    glm::vec3 positionSelf = pos[index];
+    glm::vec3 grid = glm::floor((positionSelf - gridMin) / cellWidth);
+    glm::vec3 center = ((grid * cellWidth) + gridMin) + glm::vec3(cellWidth / 2.f, cellWidth / 2.f, cellWidth / 2.f);
+    glm::vec3 norm = (positionSelf - center) / cellWidth; // distance from center to the boid normalized to be -0.5 to 0.5 on all axes
+
+    // Determine neighbors to check based on octants
+    glm::vec3 negBound = glm::vec3(0.f, 0.f, 0.f);
+    glm::vec3 posBound = glm::vec3(0.f, 0.f, 0.f);
+    negBound.x = (norm.x <= 0.0f && norm.x > -0.5f) ? 1.f : 0.f;
+    negBound.y = (norm.y <= 0.0f && norm.y > -0.5f) ? 1.f : 0.f;
+    negBound.z = (norm.z <= 0.0f && norm.z > -0.5f) ? 1.f : 0.f;
+    posBound.x = (norm.x > 0.0f && norm.x < 0.5f) ? 1.f : 0.f;
+    posBound.y = (norm.y > 0.0f && norm.y < 0.5f) ? 1.f : 0.f;
+    posBound.z = (norm.z > 0.0f && norm.z < 0.5f) ? 1.f : 0.f;
+
+    // Apply boid rules to compute new velocity using neighboring grids
+    glm::vec3 velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_center = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 c = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+    float neighborCountR1 = 0.f, neighborCountR3 = 0.f;
+    for (int z = grid.z - negBound.z; z <= grid.z + posBound.z; z++) {
+        for (int y = grid.y - negBound.y; y <= grid.y + posBound.y; y++) {
+            for (int x = grid.x - negBound.x; x <= grid.x + posBound.x; x++) {
+                int neighborCell = gridIndex3Dto1D(x, y, z, gridResolution);
+                if (gridCellStartIndices[neighborCell] == -1) continue;
+                for (int currBoid = gridCellStartIndices[neighborCell]; currBoid <= gridCellEndIndices[neighborCell]; currBoid++) {
+                    glm::vec3 positionNeighbor = pos[particleArrayIndices[currBoid]];
+                    if (particleArrayIndices[currBoid] != index) {
+                        float distance = glm::distance(positionNeighbor, positionSelf);
+                        if (distance < rule1Distance) {
+                            perceived_center += positionNeighbor;
+                            neighborCountR1++;
+                        }
+                        if (distance < rule2Distance) {
+                            c -= (positionNeighbor - positionSelf);
+                        }
+                        if (distance < rule3Distance) {
+                            perceived_velocity += vel1[particleArrayIndices[currBoid]];
+                            neighborCountR3++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    // Combine rule velocities
+    if (neighborCountR1 > 0) velocity += (perceived_center / (float)neighborCountR1 - positionSelf) * rule1Scale;
+    velocity += c * rule2Scale;
+    if (neighborCountR3 > 0) velocity += (perceived_velocity / (float)neighborCountR3) * rule3Scale;
+
+    // Combine, clamp, and set
+    glm::vec3 out = vel1[index] + velocity;
+    vel2[index] = glm::length(out) > maxSpeed ? maxSpeed * glm::normalize(out) : out;
+}
+
+__global__ void kernRearrangeIndices(int N, const int *particleArrayIndices, const glm::vec3 *pos1, const glm::vec3 *vel1,
+    glm::vec3 *pos2, glm::vec3 *vel2) {
+    // 1. find the particleArrayIndex for this thread
+    // 2. look up the pos + vel values for this index
+    // 3. set this thread's index in other pos/vel arrays to be these values
+    int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+    if (index >= N) return;
+    int pIndex = particleArrayIndices[index];
+    pos2[index] = pos1[pIndex];
+    vel2[index] = vel1[pIndex];
 }
 
 __global__ void kernUpdateVelNeighborSearchCoherent(
@@ -329,72 +452,140 @@ __global__ void kernUpdateVelNeighborSearchCoherent(
   float inverseCellWidth, float cellWidth,
   int *gridCellStartIndices, int *gridCellEndIndices,
   glm::vec3 *pos, glm::vec3 *vel1, glm::vec3 *vel2) {
-  // TODO-2.3 - This should be very similar to kernUpdateVelNeighborSearchScattered,
-  // except with one less level of indirection.
-  // This should expect gridCellStartIndices and gridCellEndIndices to refer
-  // directly to pos and vel1.
-  // - Identify the grid cell that this particle is in
-  // - Identify which cells may contain neighbors. This isn't always 8.
-  // - For each cell, read the start/end indices in the boid pointer array.
-  //   DIFFERENCE: For best results, consider what order the cells should be
-  //   checked in to maximize the memory benefits of reordering the boids data.
-  // - Access each boid in the cell and compute velocity change from
-  //   the boids rules, if this boid is within the neighborhood distance.
-  // - Clamp the speed change before putting the new speed in vel2
+
+    int index = threadIdx.x + (blockIdx.x * blockDim.x);
+    if (index >= N) return;
+
+    // Compute center to divide grid into octants
+    glm::vec3 positionSelf = pos[index];
+    glm::vec3 grid = glm::floor((positionSelf - gridMin) / cellWidth);
+    glm::vec3 center = ((grid * cellWidth) + gridMin) + glm::vec3(cellWidth / 2.f, cellWidth / 2.f, cellWidth / 2.f);
+    glm::vec3 norm = (positionSelf - center) / cellWidth; // distance from center to the boid normalized to be -0.5 to 0.5 on all axes
+
+    // Determine neighbors to check based on octants
+    glm::vec3 negBound = glm::vec3(0.f, 0.f, 0.f);
+    glm::vec3 posBound = glm::vec3(0.f, 0.f, 0.f);
+    negBound.x = (norm.x <= 0.0f && norm.x > -0.5f) ? 1.f : 0.f;
+    negBound.y = (norm.y <= 0.0f && norm.y > -0.5f) ? 1.f : 0.f;
+    negBound.z = (norm.z <= 0.0f && norm.z > -0.5f) ? 1.f : 0.f;
+    posBound.x = (norm.x > 0.0f && norm.x < 0.5f) ? 1.f : 0.f;
+    posBound.y = (norm.y > 0.0f && norm.y < 0.5f) ? 1.f : 0.f;
+    posBound.z = (norm.z > 0.0f && norm.z < 0.5f) ? 1.f : 0.f;
+
+    // Apply boid rules to compute new velocity using neighboring grids
+    glm::vec3 velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_center = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 c = glm::vec3(0.0f, 0.0f, 0.0f);
+    glm::vec3 perceived_velocity = glm::vec3(0.0f, 0.0f, 0.0f);
+    float neighborCountR1 = 0.f, neighborCountR3 = 0.f;
+    for (int z = grid.z - negBound.z; z <= grid.z + posBound.z; z++) {
+        for (int y = grid.y - negBound.y; y <= grid.y + posBound.y; y++) {
+            for (int x = grid.x - negBound.x; x <= grid.x + posBound.x; x++) {
+                int neighborCell = gridIndex3Dto1D(x, y, z, gridResolution);
+                if (gridCellStartIndices[neighborCell] == -1) continue;
+                for (int currBoid = gridCellStartIndices[neighborCell]; currBoid <= gridCellEndIndices[neighborCell]; currBoid++) {
+                    glm::vec3 positionNeighbor = pos[currBoid];
+                    if (currBoid != index) {
+                        float distance = glm::distance(positionNeighbor, positionSelf);
+                        if (distance < rule1Distance) {
+                            perceived_center += positionNeighbor;
+                            neighborCountR1++;
+                        }
+                        if (distance < rule2Distance) {
+                            c -= (positionNeighbor - positionSelf);
+                        }
+                        if (distance < rule3Distance) {
+                            perceived_velocity += vel1[currBoid];
+                            neighborCountR3++;
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    // Combine rule velocities
+    if (neighborCountR1 > 0) velocity += (perceived_center / (float)neighborCountR1 - positionSelf) * rule1Scale;
+    velocity += c * rule2Scale;
+    if (neighborCountR3 > 0) velocity += (perceived_velocity / (float)neighborCountR3) * rule3Scale;
+
+    // Combine, clamp, and set
+    glm::vec3 out = vel1[index] + velocity;
+    vel2[index] = glm::length(out) > maxSpeed ? maxSpeed * glm::normalize(out) : out;
 }
 
 /**
 * Step the entire N-body simulation by `dt` seconds.
 */
 void Boids::stepSimulationNaive(float dt) {
-  // TODO-1.2 - use the kernels you wrote to step the simulation forward in time.
-  // TODO-1.2 ping-pong the velocity buffers
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    if (currBuff == 1) {
+        kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos1, dev_vel1, dev_vel2);
+        kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos1, dev_vel1);
+        currBuff = 2;
+    }
+    else if (currBuff == 2) {
+        kernUpdateVelocityBruteForce <<<fullBlocksPerGrid, blockSize>>> (numObjects, dev_pos1, dev_vel2, dev_vel1);
+        kernUpdatePos <<<fullBlocksPerGrid, blockSize>>> (numObjects, dt, dev_pos1, dev_vel2);
+        currBuff = 1;
+    }
 }
 
 void Boids::stepSimulationScatteredGrid(float dt) {
-  // TODO-2.1
-  // Uniform Grid Neighbor search using Thrust sort.
-  // In Parallel:
-  // - label each particle with its array index as well as its grid index.
-  //   Use 2x width grids.
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, dev_pos1, dev_particleArrayIndices, dev_particleGridIndices);
+
+    dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+    dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices,
+        dev_gridCellEndIndices);
+
+    kernUpdateVelNeighborSearchScattered << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_particleArrayIndices,
+        dev_pos1, dev_vel1, dev_vel2);
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos1, dev_vel1);
+
+    std::swap(dev_vel1, dev_vel2);
 }
 
+int num = 0;
+
 void Boids::stepSimulationCoherentGrid(float dt) {
-  // TODO-2.3 - start by copying Boids::stepSimulationNaiveGrid
-  // Uniform Grid Neighbor search using Thrust sort on cell-coherent data.
-  // In Parallel:
-  // - Label each particle with its array index as well as its grid index.
-  //   Use 2x width grids
-  // - Unstable key sort using Thrust. A stable sort isn't necessary, but you
-  //   are welcome to do a performance comparison.
-  // - Naively unroll the loop for finding the start and end indices of each
-  //   cell's data pointers in the array of boid indices
-  // - BIG DIFFERENCE: use the rearranged array index buffer to reshuffle all
-  //   the particle data in the simulation array.
-  //   CONSIDER WHAT ADDITIONAL BUFFERS YOU NEED
-  // - Perform velocity updates using neighbor search
-  // - Update positions
-  // - Ping-pong buffers as needed. THIS MAY BE DIFFERENT FROM BEFORE.
+    dim3 fullBlocksPerGrid((numObjects + blockSize - 1) / blockSize);
+    kernComputeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, dev_pos1, dev_particleArrayIndices, dev_particleGridIndices);
+
+    dev_thrust_particleArrayIndices = thrust::device_ptr<int>(dev_particleArrayIndices);
+    dev_thrust_particleGridIndices = thrust::device_ptr<int>(dev_particleGridIndices);
+    thrust::sort_by_key(dev_thrust_particleGridIndices, dev_thrust_particleGridIndices + numObjects, dev_thrust_particleArrayIndices);
+
+    kernIdentifyCellStartEnd << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleGridIndices, dev_gridCellStartIndices,
+        dev_gridCellEndIndices);
+
+    kernRearrangeIndices << <fullBlocksPerGrid, blockSize >> > (numObjects, dev_particleArrayIndices, dev_pos1, dev_vel1, dev_pos2, dev_vel2);
+    kernUpdateVelNeighborSearchCoherent << <fullBlocksPerGrid, blockSize >> > (numObjects, gridSideCount, gridMinimum,
+        gridInverseCellWidth, gridCellWidth, dev_gridCellStartIndices, dev_gridCellEndIndices, dev_pos2, dev_vel2, dev_vel1);
+    kernUpdatePos << <fullBlocksPerGrid, blockSize >> > (numObjects, dt, dev_pos2, dev_vel2);
+    
+    std::swap(dev_pos2, dev_pos1);
 }
 
 void Boids::endSimulation() {
   cudaFree(dev_vel1);
   cudaFree(dev_vel2);
-  cudaFree(dev_pos);
-
-  // TODO-2.1 TODO-2.3 - Free any additional buffers here.
+  cudaFree(dev_pos1);
+  cudaFree(dev_particleArrayIndices);
+  cudaFree(dev_particleGridIndices);
+  cudaFree(dev_gridCellStartIndices);
+  cudaFree(dev_gridCellEndIndices);
+  cudaFree(dev_pos2);
 }
 
 void Boids::unitTest() {
-  // LOOK-1.2 Feel free to write additional tests here.
-
   // test unstable sort
   int *dev_intKeys;
   int *dev_intValues;
diff --git a/src/main.cpp b/src/main.cpp
index b82c8c6..b82e524 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -14,11 +14,11 @@
 
 // LOOK-2.1 LOOK-2.3 - toggles for UNIFORM_GRID and COHERENT_GRID
 #define VISUALIZE 1
-#define UNIFORM_GRID 0
-#define COHERENT_GRID 0
+#define UNIFORM_GRID 1
+#define COHERENT_GRID 1
 
 // LOOK-1.2 - change this to adjust particle count in the simulation
-const int N_FOR_VIS = 5000;
+const int N_FOR_VIS = 300000;
 const float DT = 0.2f;
 
 /**
@@ -216,6 +216,7 @@ void initShaders(GLuint * program) {
     double fps = 0;
     double timebase = 0;
     int frame = 0;
+    int print = 0;
 
     Boids::unitTest(); // LOOK-1.2 We run some basic example code to make sure
                        // your CUDA development setup is ready to go.
@@ -241,6 +242,11 @@ void initShaders(GLuint * program) {
       ss << " fps] " << deviceName;
       glfwSetWindowTitle(window, ss.str().c_str());
 
+      if (time > print) {
+          std::cout << fps << std::endl;
+          print++;
+      }
+
       glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);
 
       #if VISUALIZE