About me

I am Zirui Huang , I go by Ray. I am currently a Data Scientist of Cambridge Systematics, Inc., a premier transportation consulting company.

I specialize in bridging the gap between robust data engineering and analytical science. I have experience transforming complex, large-scale data challenges—primarily within traffic and transit systems—into high-performance, intuitive systems. Using a modern tech stack, I build scalable ELT pipelines that turn raw records into production-ready assets.

My focus is on architecting reliable infrastructure to power everything from BI dashboards to Machine Learning models. I am particularly interested in the intersection of data and AI, developing RAG-based chatbots and MCP servers for intelligent data interaction. I ensure complex backend workflows deliver clear, actionable strategies that bring data to life.

Skillset

Data Engineering & Infrastructure

  • Pipeline Orchestration

    Dagster, Airflow

    design icon
    design icon
  • Extraction

    dlt (data load tool), Fivetran

    design icon
    design icon
  • Transformation

    dbt

    design icon
  • Storage

    DuckDB, PostgresSQL

    design icon
    design icon
  • Cloud Platforms

    GCP, AWS, MotherDuck, Snowflake

    design icon
    design icon
    design icon
    design icon

Data Science & Analytics

  • BI Dashboards

    Power BI, Tableau, Evidence.env

    design icon
    design icon
    design icon
  • Web Interface

    Streamlit

    design icon
  • Machine Learning

    Scikit-learn, PyTorch

    design icon
    design icon

Generative AI & Agents

  • LLM Integration

    MCP servers, LangChain, Hugging Face

    design icon
    design icon
    design icon
  • Vector Databases

    Chroma, Pinecone

    design icon
    design icon

Development & Workflow

  • Version Control

    Git, GitHub, GitHub Actions (CI/CD)

    design icon
    design icon
    design icon
  • DevOps

    Virtual Environments (uv), Docker

    design icon
    design icon

Clients

Resume

Experience

  1. Associate

    Cambridge Systematics, Inc. 2023 — Present New York, NY, USA

    Architecting end-to-end data systems and AI frameworks to optimize urban mobility:

    • • Scalable Data Engineering: Developing robust ELT pipelines and CI/CD workflows for multi-source trip records, transit telemetry, simulation outputs, and high-resolution geospatial data.
    • • Applied AI & LLM Systems: Deploying LLMs for pattern recognition and engineering RAG-based chatbots via MCP to streamline access to transportation domain knowledge.
    • • Predictive Modeling: Designing and machine learning models to forecast transportation demand and optimize infrastructure utilization.
    • • Data Visualization & BI: Building interactive dashboards and Business Intelligence tools for real-time monitoring and translating complex analytical insights into strategic business decisions.
  2. Research & Development Intern

    Metropia, Inc. 2022 — 2023 Houston, TX, USA

    Key contributor to high-impact projects funded by the USDOT, TxDOT, and Taiwan’s MOTC, specializing in AI-driven multimodal transportation systems:

    • • Algorithmic Development: Engineered a cycling routing engine for ConnectSmart and intermodal trip planning algorithms for Taiwan’s MaaS Platform.
    • • Predictive Modeling: Leveraged AI techniques to predict traffic states and incident impacts for improved system management.
    • • System Optimization: Refined system-optimal algorithms for BART Perks 2 to mitigate transit crowding through data-driven incentives.

Education

  1. University of Arizona 🐻⬇️

    Ph.D. 2018 — 2023 Tucson, AZ, USA

    Major: Transportation (Advisors: Dr. Yi-Chang Chiu & Dr. Yao-Jan Wu)
    Minor: Computer Science (Advisor: Dr. Chicheng Zhang)
    Dissertation: "Investigating Incident-Induced Congestion for Personalized Travel Demand Management"

  2. Tongji University

    B.E. 2014 — 2018 Shanghai, China

    Major: Transportation Telematics Thesis: "Traffic Density Estimation based on Vehicles Speed Profile Data" 🏆 Best Undergraduate Thesis Papers

NYC Taxi Insights

NYC Taxi project

Project Overview

I built a production-grade data ecosystem to transform millions of raw NYC Taxi (TLC) records into a scalable analytics platform. This project implements a full-lifecycle ELT pipeline, from automated ingestion and Medallion-style transformations to executive BI insights and an AI-integrated semantic layer for natural language querying.

The core mission was to bridge the gap between complex engineering and business operations. By implementing a semantic layer via an MCP Server, It enabled non-technical stakeholders to "chat" with their data—asking complex questions and receiving real-time answers without ever writing a line of SQL or code.

The Tech Stack

  • Orchestration

    Dagster for end-to-end pipeline visibility.

    Dagster
  • Ingestion

    dlt (data load tool) for "Pythonic" extraction and loading.

    dlt
  • Transformation

    dbt + MotherDuck/DuckDB to power a Bronze-Silver-Gold Medallion architecture.

    dbt
    DuckDB
  • AI Integration

    MCP Server to enable LLMs to query the data warehouse directly.

    MCP
  • Consumption

    Evidence for BI-as-code and Streamlit for interactive data apps.

    Streamlit
    Evidence

Travel Demand Model Chatbot

Travel Demand Model Chatbot

Project Overview

Transportation planners often need to identify network bottlenecks or evaluate scenario performance instantly, but are frequently siloed by the need for complex SQL queries across massive relational databases. I built Travel Model Assistant to democratize this data, creating an Agentic Semantic Layer that allows analysts to "talk" to their models.

The core innovation is a specialized Context Layer designed to handle industry-specific jargon like "V/C Ratios" and "PCE Volumes." Unlike standard chatbots, this system utilizes Functional Tool-Calling; the AI agent autonomously invokes a custom spatial engine to generate thematic Folium maps, applying hex-code color scales to visualize network performance and scenario deltas in real-time.

Example Use Case

User Prompt:

"What are the top 10 TAZs based on the percentage difference in originating vehicle trips between the baseline scenario and the 2030 Horizon scenario (No build)? Report the positive and negative differences separately."

Screenshot of TAZ Delta Query
The Agent in Action: The assistant autonomously performed a self-join on the trip tables, calculated percent deltas while filtering for statistical significance, and invoked the Spatial Tool to render a divergent color-scale map (Green for growth, Red for decline) alongside a structured report.

The Tech Stack

  • LLM Orchestration

    LangChain + Gemini for agentic reasoning and complex SQL tool-calling.

    LangChain
    Gemini
  • Context Engineering

    Engineered a Domain-Specific Prompt Layer that resolves transportation anaphora and ambiguous intent into precise SQL operations across DuckDB.

    Context Engineering
  • Frontend & Visualization

    Streamlit for the UI and Folium for reactive geospatial mapping.

    Streamlit
    Folium

AI Sign Reader & CDS Integration

AI Sign Reader & CDS Integration

Project Overview

Managing the "curb" is one of the most data-intensive challenges in modern urban mobility. At my current firm, I engineered an automated Computer Vision Pipeline that transforms raw Street View imagery into machine-readable parking regulations, fully compliant with the Curb Data Specification (CDS).

By utilizing Gemini as the multimodal intelligence engine, I bypassed traditional OCR limitations to interpret complex, overlapping parking signs in varying environmental conditions. To bridge the gap between LLM creativity and database reliability, I implemented a Pydantic-enforced Semantic Layer, ensuring that every extracted regulation strictly adheres to the CDS nested JSON schema.

Multimodal Extraction Example

To ensure regulatory compliance, the pipeline must translate visual cues (colors, arrows, time-windows) into strict Curb Data Specification (CDS) attributes. Below is an example of the agent identifying a multi-regulation sign:

Raw Street View Input
Raw parking sign image from street view
AI-Extracted Regulations
Screenshot of extracted CDS JSON format
From Pixels to Policy: Gemini interprets the sign's logic, the Pydantic layer then maps these to the CDS format, ensuring the final output is ready for the urban management database.

The Tech Stack

  • Intelligence & Extraction

    Gemini for multimodal sign interpretation and Pydantic for enforcing strict Curb Data Specification (CDS) schemas.

    Gemini
    Pydantic
  • Data Infrastructure

    Google Cloud Storage (GCS) for scalable raw image hosting and PostgreSQL for querying extracted regulatory metadata.

    GCP
    PostgreSQL
  • QAQC & Monitoring

    Streamlit internal dashboard for human-in-the-loop verification, accuracy scoring, and rapid prompt iteration.

    Streamlit

Off Duty

Life in Motion

Interests & Lifestyle

I am a climbing psychopath. Climbing has entirely shaped my lifestyle; it drives me to stay disciplined with my diet, mobility training, weightlifting, and flexibility. It brings out the absolute best in me, both physically and mentally, and I have made a lot of amazing friends along the way.

Red Rock Climbing
Red Rocks, Nevada
RRG Rock Climbing
Red River Gorge, Kentucky

Beyond the crag, I am an enthusiast for anything that involves being outside, whether it is backpacking, hiking, kayaking, or camping. Exploring the wilderness is how I recharge away from the work and the screen.

Hiking
Waipi'o Valley, Hawaii
Kayaking
Kealakekua Bay, Hawaii

My favorite destination so far is Havasu Falls on the Havasupai Indian Reservation. Promise me that since you’ve read this far, you will add it to your bucket list. The turquoise water against the red canyon walls is a sight you will never regret seeing.

Havasu Falls
The Navajo Nation
Havasu Falls
Havasu Falls

The Real Supervisors

Finally, meet my two "supervisors," Coco (a brown tabby) and Cody (a tuxedo). They have been supporting my work since the pandemic, often providing "destructive" suggestions by jumping on my keyboard at critical moments. Luckily, I have version control to manage their unsolicited code reviews.

Coco the cat
Coco: Senior Keyboard Analyst
Cody the cat
Cody: Chief Purr Officer