Projects — Dean Willis

Project 01

SEC EDGAR Financial Data Pipeline

Python · pandas · zipfile · requests · SQLAlchemy · MySQL

ETL pipeline that bulk downloads 10 years (1.82GB) of SEC EDGAR financial data across 40 quarterly ZIP files. Files are extracted and renamed locally, filtered for R&D, Net Income, and Revenue, then loaded into MySQL for downstream analysis.

Pipeline Steps

1

Bulk download 40 quarterly ZIP files from SEC EDGAR

2

Extract and rename num.txt and sub.txt from each zip

3

Filter for R&D, Net Income, and Revenue

4

Clean and type-convert data

5

Load into MySQL — financial_data and metadata tables

How to Run

1

Create a MySQL database called aggregated_stock_data

2

Create tables using schema in queries_edgar.sql

3

Fill in your MySQL credentials in the script

4

Run edgbatch_DL&EXT_SQL.py

See it in action →

Data Source

SEC EDGAR Financial Statement Data Sets (2014 Q1 – 2023 Q4)

sec.gov/data-research →

Tools Used

Python — pandas, zipfile, requests, SQLAlchemy
MySQL

Project 02

R&D Expense Spikes Predict Short-Term Price Declines

Python · yfinance · zipfile · requests · SQLAlchemy · MySQL

R&D expense spikes predict short-term price declines with 55–68% accuracy across 1,404 public companies. 1.82GB SEC EDGAR data, ETL from zip to MySQL to Python, enriched with yfinance price data.

Definitions

01

A spike is identified when exactly one quarter's R&D expense exceeds the average of all three quarters for that year.

02

Spike percent is defined by the percentage increase in spending relative to the three-quarter average.

03

Spikes are sorted into buckets: 0–10%, 10–20%, 20–50%, 50–100%, 100%+.

04

Price decreases are measured using stock prices taken 30 days before and after the spike date.

Findings

R&D Spike Magnitude	Companies	Drop Rate
0 – 10%	~280	55.6%
10 – 20%	~280	61.1%
20 – 50%	~280	67.3%
50 – 100%	~280	68.5%
100%+	~284	66.1%

Tracking post-spike short-term price behavior across all bands suggests a systematic and repeatable market inefficiency.

Analytical Queries

Ranking the first 10 companies by spike percent
Sorting all companies' average spike percentage into 4 buckets
Comparing the spike percentage to the prior filing date's spike percentage
Finding the running average of spike percentage per company
Finding the max spike percentage per company

Limitations

Stock price window of 30 days may predate 10-K filings
Only 3 quarters of data available — Q4 not filed in 10-K
Data range limited to 2014–2023, not live
Some data lost due to ticker symbol resolution issues

A version addressing all listed limitations is planned.

See it in action →

Data Source

SEC EDGAR Financial Statement Data Sets (2014 Q1 – 2023 Q4)

sec.gov/data-research →

Tools Used

Python — yfinance, zipfile, requests, SQLAlchemy
MySQL

Project 03

Personal Finance Tracker

Python · FastAPI · SQLAlchemy · SQLite · APScheduler

Showcasing a fulling working SQLite Database with a mobile app.

A mobile expense and subscription tracker with a FastAPI backend. Full CRUD for expenses and subscriptions, automatic recurring expense detection, and spending summaries broken down by day, month, year, category and yearly projections.

Actual App Running on Expo Go

Features

Full CRUD for expenses and subscriptions
Automatic subscription management
Spending summaries by day, month, year, and all-time
Category breakdowns across all transactions
Recurring expense detection with annualized projections

API Endpoints

1

GET /expenses — retrieve all recorded expenses

2

POST /expenses — add a new expense entry

3

PUT /expenses/{id} — update an existing expense

4

DELETE /expenses/{id} — remove an expense

5

GET /subscription · POST · PUT · DELETE — full subscription management

See it in action →

Tools Used

Python — FastAPI, SQLAlchemy, APScheduler
SQLite

Project 04

S&P 500 Anomaly Detection

Python · scikit-learn · FastAPI · React · Claude AI · yfinance

Real-time anomaly detection across all S&P 500 stocks using Z-Score and Isolation Forest, with AI-generated summaries explaining the likely cause of each anomaly using recent news. Deployed as a full-stack application with a FastAPI backend and React frontend.

How It Works

1

Pulls 90 days of price and volume data for all 500 S&P 500 constituents via yfinance

2

Flags anomalous days using Z-Score (|z| > 3) on both daily return and volume

3

Runs Isolation Forest as a second detection layer high confidence where both models agree

4

Fetches relevant news headlines for each anomaly via NewsAPI

5

Claude AI generates summaries explaining each anomaly

6

Results served via FastAPI and rendered in a filterable, searchable React dashboard

Detection Methods

01

Z-Score measures how many standard deviations a stock's return or volume deviates from its own 90-day history. Threshold set at |z| > 3.

02

Isolation Forest detects outliers by measuring how easily a data point can be isolated from the rest. Catches complex patterns missed by Z-Score alone.

03

High confidence anomalies are rows flagged by both models — approximately 1% of all observations across 500 stocks.

More Info →

Live App

The dashboard shows all high confidence anomalies with their return and volume Z-scores, signal type, and AI-generated summary. Click any row to expand the analysis.

Launch LIVE APP →

(Allow one minute to load)

Data Source

Yahoo Finance via yfinance — 90-day rolling window, all S&P 500 constituents

GitHub →

Tools Used

Python — pandas, yfinance, scikit-learn
FastAPI · Render
React · Vercel
Claude AI (Anthropic)
NewsAPI