Data School/Machine Learning with Text in Python

  • $299

Machine Learning with Text in Python

Solve text-based data science problems using Machine Learning and Natural Language Processing!

You won't find a better course to learn about NLP and Machine Learning in Python anywhere else! Kevin has a way of making difficult topics very accessible and understandable. I was able to quickly apply much of the theory and code regarding NLP and Machine Learning from this course to my own job.

- Cliff Baker, Statistician

Is this course right for you?

Are you trying to master Machine Learning in Python, but tired of wasting your time on courses that don't move you towards your goal? Do you recognize the enormous value of text-based data, but don't know how to apply the right Machine Learning and Natural Language Processing techniques to extract that value?

In this Data School course, you'll gain hands-on experience using Machine Learning and Natural Language Processing to solve text-based data science problems. By the end of the course, you'll be able to confidently apply these techniques to your own data science problems.

How is it different from other online courses?

Most data science courses suffer from a host of problems: They're poorly taught, lack the necessary depth, and include unexplained or broken code. They don't teach you how to apply what you're learning, and when you do apply it, there's no way to know how well you're doing.

But in this course, we'll go deep into Machine Learning with text, focusing on application from day one. We'll spend most of our time writing Python code, and you'll understand how every single line relates to the problem we're solving. You'll practice what you're learning through carefully crafted lessons and assignments.

At the end of this course, you'll leave with valuable Machine Learning experience, high-quality code that you can reuse to solve future text-based problems, and a wealth of curated resources to help you deepen your understanding of each course topic.

The course was a perfect introduction to Machine Learning with text, and I was able to apply topics covered during the first week to my work. Kevin does a great job of breaking down complex topics and providing a practical, real-world context for them.

- Ryan Cranfill, Data Scientist

Course Description

In this self-paced course, you'll learn how to build effective Machine Learning models using text-based data to solve your own data science problems. The course includes:

  • 14 hours of high-quality instructional videos

  • Well-commented lesson notebooks in Jupyter format (also available as Python scripts)

  • Substantial homework assignments (with provided solutions) to help you practice everything you're learning

  • A list of readings and videos to help prepare you for each class

  • Links to 100+ carefully selected resources to deepen your understanding of course topics

  • Lifetime access to all course materials

  • Money-back guarantee (within 30 days of purchase)

I used to work as a software developer and your course helped me to move on. I now have a job in the NLP/Machine Learning field which I am more passionate about.

- Jose Navarro, Machine Learning Engineer

Course Outline

Each module includes 2 to 4 hours of instructional videos, 1 lesson notebook, 1 to 2 homework assignments, and 15 to 20 supplementary resources.

Module 1: Working with Text Data in scikit-learn

By the end of this module, you'll be able to confidently perform the basic workflow for Machine Learning with text: creating a dataset, extracting features from unstructured text, building and evaluating models, and inspecting models for further insight. You'll also gain an understanding of Unicode, enabling you to troubleshoot encoding-based errors.

  • Extracting features from unstructured text using CountVectorizer

  • Building a MultinomialNB model for text classification

  • Examining a model for further insight

  • Model evaluation:

    • accuracy_score

    • confusion_matrix

    • roc_auc_score

  • Comparing MultinomialNB with LogisticRegression

  • Building a new dataset from individual text files using pandas

  • Unicode basics

  • Handling Unicode errors

Module 2: Applying Natural Language Processing Techniques to Machine Learning

By the end of this module, you'll be able to apply a handful of Natural Language Processing techniques to Machine Learning problems in order to improve the effectiveness of your models. You'll also learn how to perform sentiment analysis and build a simple document summarization tool for your own corpus of text.

  • What is Natural Language Processing (NLP)?

  • NLP terminology and examples

  • Tuning CountVectorizer for better model performance:

    • n-grams

    • stop words

    • corpus-specific stop words

    • minimum document frequency

  • Term Frequency-Inverse Document Frequency (TF-IDF) using TfidfVectorizer

  • Text summarization

  • Sentiment analysis using TextBlob

Module 3: Parsing Text Data Using Regular Expressions

By the end of this module, you'll be able to extract text features from messy data sources using regular expressions. You'll learn the basic rules and syntax that can be applied across programming languages, and you'll master the most important Python functions and options for working with regular expressions.

  • Basic rules and principles

  • Searching with re.search

  • Metacharacters

  • Greedy and lazy quantifiers

  • Match groups

  • Character classes

  • Alternatives

  • Substitution with re.sub

  • Anchors

  • Option flags

  • Efficiently searching for multiple matches with re.findall

  • Improving performance with re.compile

  • Writing readable regular expressions with re.VERBOSE

Module 4: Workflow for a Text-Based Data Science Problem

By the end of this module, you'll be able to create an end-to-end workflow for solving a text-based data science problem using scikit-learn and pandas. You'll gain experience with data exploration, feature engineering, proper model evaluation, model tuning, and generating predictions for new observations.

  • Data exploration and visualization

  • Feature engineering using pandas

  • Custom tokenization using regular expressions

  • Multi-class classification

  • Model evaluation:

    • train_test_split

    • cross_val_score

    • DummyClassifier

  • Searching for optimal tuning parameters using GridSearchCV

  • Chaining steps into a Pipeline

  • Making predictions for out-of-sample data

Module 5: Advanced Machine Learning Techniques

By the end of this module, you'll be able to apply advanced Machine Learning techniques to improve the accuracy of your models and the efficiency of your workflow. You'll learn how to build and tune a multi-step, multi-layer Machine Learning pipeline, as well as how to ensemble and stack your models.

  • Using a Pipeline for proper cross-validation

  • Tuning a Pipeline with GridSearchCV

  • Efficiently searching for tuning parameters using RandomizedSearchCV

  • Stacking sparse and dense feature matrices using SciPy

  • Combining the results of multiple feature extraction processes using FeatureUnion

  • Building multi-level pipelines and feature unions

  • Building custom transformers using FunctionTransformer

  • Improving classifier performance through ensembling

  • Unsupervised document clustering using cosine similarity

  • Basic strategies for model stacking

Practical and easy-to-follow course on advanced topics in Machine Learning. Videos are incredible, full of tips and resources. Outstanding teaching skills by Kevin and his team.

- Miguel Angel Regalado, Digital Analytics Consultant

Frequently Asked Questions

Is this a beginner course?

No. This is an intermediate course, with specific prerequisites:

  • You should be comfortable working in Python.
  • You should understand the basic principles of Machine Learning.
  • You should be comfortable using scikit-learn.
  • You should have at least limited experience with pandas.
  • No knowledge of advanced mathematics is required.

How do I know whether I'm ready for the course?

Review the content from my free scikit-learn course and my free pandas course. If you are comfortable with most of the content, you are ready for the course!

If you are new to Python, I recommend first enrolling in Python Essentials for Data Scientists. If you are unsure whether you meet the course requirements, please email me at kevin@dataschool.io.

What types of people have taken this course?

Here are the job titles of some of my past students:

  • Data professionals: Data Scientist, Director of Data Science, Statistician, Business Intelligence Developer, Analytics Manager, Quantitative Analyst, Data Analytics Architect, Data Journalist
  • Engineers: Senior Software Engineer, Data Engineer, Network Engineer, Back-End Developer, Director of Engineering, Air Pollution Engineer
  • Scientists: Artificial Intelligence Researcher, Chief Scientist, Research Associate, Computational Linguist, Applied Mathematician, Psychiatrist, Computational Chemist, Geophysicist, Cognitive Scientist
  • Other: Graduate Student, Product Manager, Security Consultant, Business Analytics Instructor, Creative Director, Corporate Strategist, Web Developer, Internet Marketer, Lawyer, Entrepreneur, System Administrator, Project Manager, Python Instructor, Kaggle Master

Why should I learn how to work with text?

Most knowledge created by humans is raw, unstructured text. By learning how to transform text into data that is usable by Machine Learning models, you drastically increase the amount of data that your models can learn from.

What will I be able to do by the end of this course?

  • Convert unstructured text into a format that is suitable for Machine Learning
  • Apply appropriate model building, model evaluation, and feature engineering techniques to text-based problems
  • Tune the feature extraction and model building pipeline for optimal performance
  • Build a custom sentiment analysis or document summarization tool for your own corpus of text
  • Extract features from messy data sources using regular expressions
  • Utilize a more efficient workflow with scikit-learn and pandas

Which version of Python do I need for the course?

Both Python 2 and Python 3 are perfectly acceptable.

What libraries will we be using?

The majority of the content will use scikit-learn, though we will also use pandas to help us load, prepare, and visualize data. We will use the built-in re module for regular expressions. We will also make limited use of NumPy, SciPy, Matplotlib, Seaborn, and TextBlob.

Why doesn't the course use NLTK or spaCy?

This course focuses on supervised Machine Learning. scikit-learn is by far the best Python library for solving most Machine Learning problems, including those problems in which the input data is text.

In contrast, NLTK and spaCy are Natural Language Processing (NLP) libraries which specialize in language-oriented tasks such as part-of-speech tagging, dependency parsing, named entity recognition, and so on.

Although basic NLP techniques are covered in this course, Machine Learning is still the primary focus, and thus scikit-learn is the best choice.

Is the course material up-to-date?

After creating the course in 2016, I spent hundreds of hours refining and updating it based on student feedback. In 2019, I updated the course code to use up-to-date libraries, including Python 3.7 and scikit-learn 0.21.

All of the processes I teach in the course are still the ones I recommend today, with the exception of one small section of module 5. (However, I added an addendum notebook to demonstrate the current best practices for that specific section.)

How is this course different from other online courses?

  • The course is application-focused, providing you with skills that you can immediately apply to your own data science problems.

  • The course is taught by an experienced data science instructor.

  • The lesson notebooks are carefully crafted and will serve as excellent reference materials for years to come.

  • All of the code is thoroughly explained, well-written, and compatible with both Python 2 and 3.

  • The homework assignments enable you to immediately practice what you have learned, and the included solutions are fully commented.

  • The 100+ post-class resources build directly on the course material, and will help you to explore each topic in more depth.

  • You will have lifetime access to all course materials.

Who is the instructor?

The course instructor is Kevin Markham: Founder of Data School and former Lead Data Science Instructor for General Assembly in Washington, DC.

I have more than 400 hours of classroom experience teaching data science in Python, and 1,000+ hours of experience creating data science educational materials, mentoring data science students, and training other data science instructors.

Can I see a sample of the course content?

I previously presented 3-hour tutorial based on a portion of this course. The tutorial recording should give you a good idea of my teaching style.

Do you offer any discounts?

Yes! I offer Purchasing Power Parity discounts (also known as location-based discounts) for all of my paid courses. If you're located in one of the 160+ qualifying countries, you should automatically see a discount code at the top of this page.

I also offer student discounts and hardship-based discounts, regardless of where you live. Please email me at kevin@dataschool.io and I'd be happy to send you the appropriate discount code.

What happens when I enroll in the course?

Shortly after enrolling in the course, you will be given access to all course materials. You can work through the course at your own pace.

How much time will the course take to complete?

Past students have said that they spent anywhere from 40 to 60 hours working through the course.

How long will I have access to the course?

You will have lifetime access to all course materials.

What will happen at the end of the course?

You'll leave the course with valuable Machine Learning experience, high-quality code that you can reuse to solve future text-based problems, and a wealth of curated resources to help you deepen your understanding of each course topic.

Will I receive a certificate of completion?

Yes, you will receive an official certificate of completion from Data School after completing the course.

Have students been happy with the course?

In an anonymous post-course survey, students were asked to rate the course on a scale from 1 (poor) to 5 (excellent). On average, students have rated the course 4.70 for "quality of content", 4.84 for "quality of instruction", and 4.60 for "overall value provided by the course". In addition, 93% of students reported that the course had "helped them to make progress towards their personal or professional goals."

What if I'm not happy with the course?

I offer a "Love it or leave it" guarantee: If you don't love the course, I'm happy to give you a full refund, no questions asked, if you cancel within 30 days of purchase.

What if I need help with the course?

You can post a question below any lesson, and I'll do my best to respond!

I have more questions...

Please email me at kevin@dataschool.io. I'd be happy to answer all of your questions!

Join 650+ happy students...

Asif Mehedi (Research Associate)

Having known Kevin from his videos on scikit-learn and more recently pandas, I've long admired his ability to explain difficult concepts in clear language. This new course was no exception. I now feel prepared to use Machine Learning in my text analysis projects.

Jeff Weakley (Creative Director)

Kevin isn't just a programmer/data scientist, he's a great teacher. If I had paid a lot more for this class, it would still have been worth it. After taking a lot of other online courses, I feel like I'm finally getting valuable skills, tools and info I can use and financially benefit from.

Harvey Summers (Data Management Specialist)

I've taken a couple of courses with Kevin and recommend him highly. He does a great job of making the complex simple, and it's amazing how much you learn in just a few weeks. The best course on Machine Learning I've taken so far!

Dr. Kathleen Perez-Lopez (Senior Data Scientist)

In Machine Learning with Text, Kevin does a superb job walking you through this topic at an intermediate level. The classes were totally engrossing. His homework assignment for each class steps you through a process so that you don't get stuck at any stage. He points to a tremendous amount of carefully curated supporting material. I'm looking forward to his next offering.

Charles Franzen (Assistant Education Manager)

Kevin's courses are focused and coherent. Tools learned each week build upon and complement one another, and the classes culminate with a larger-scale project that shows how what you've learned can come together. I learned a great deal, and highly recommend Data School to those looking to explore Machine Learning tools in greater depth.

Chandler McCann (Data Science Practice Lead)

Kevin is an extremely knowledgeable teacher who cares about his students and puts a ton of preparation into his courses and materials. Kevin's approach to teaching data science is logical, well-structured and accessible. I highly recommend this course.

Leo Lillard (Senior Data Scientist)

If you like Kevin's material, you will love his classes. Full of information that will take your Python skills to the next level and will leave you wanting more. This will be the best $300 you spend on training.

Wolfgang Guba (Computational Chemist)

I now feel very confident to use pandas and scikit-learn for Machine Learning. I can highly recommend Kevin's course, he is a great teacher and can explain difficult concepts really well!

Dr. Jovian Lin (Senior Research Scientist)

Kevin is the teacher that everyone needs. He makes complicated concepts simple through a meticulously well-crafted course plan.

👋 Welcome to Data School!

My name is Kevin, and I've taught Data Science in Python to over a million students.

My courses explain data science topics in a clear, thorough, and step-by-step manner.

I'd love to teach you, regardless of your educational background or professional experience.

Thanks for joining me! 🙌