Ai2Go

Open Source and AI

Foundations Principles of Open Source

Open source is generally used to describe software that comes with the essential freedoms granted in advance to facilitate its use, improvement, and redistribution (modified or unmodified) by anyone and for any purpose. Initially introduced by Richard Stallman of the Free Software Foundation (FSF), these permissions are often called the four freedoms which are namely:

  • Run the software for any purpose.
  • Study how the software works through access to source code and freely adapt it.
  • Redistribute copies of the software to anyone.
  • Improve the software and redistribute those improvements to anyone.

Subsequently, the Open Software Initiative (OSI) took a leading role in recognizing OSS licenses and expanded these four freedoms to develop ten criteria to determine whether a license for software is OSS.

  • Free redistribution: Software must be made available for redistribution without payment.
  • Source code: Software must be distributed with source code or well-publicized access to it.
  • Derived works: The license must allow modification of the software and distribution of resulting derived works.
  • Integrity of the author’s source code: Distribution of “patch files” used to recreate the derived work (rather than full source code) must be permitted.
  • No discrimination against persons or groups: License terms must be the same for all licensees.
  • No discrimination against fields of endeavor: For example, limiting non-commercial purposes is not permitted.
  • Distribution of license: Must be no need to execute extra licenses for redistributed software.
  • The license must not be specific to a product: License rights must not depend on the software being distributed with other specified software.
  • License must not restrict other software: The license must not place restrictions on software distributed together with the licensed software.
  • The license must be technology neutral: The license terms must not vary according to the type of technology involved.

As such, any software that does not fall within these criteria was not offered OSS status by the OSI. Please review the discussions relating to the license change made by Mongo DB Inc. from AGPL to SSPL (Server-Side Public License) where the latter restricted the use of software as a service and the OSI's refusal to grant the OSS status to SSPL.

Permissive vs Restrictive OSS

Consequent to the introduction of the four freedoms, many software licenses have been produced throughout the years, differing widely in clarity, length, and legal effect. They can be broadly grouped into two categories:

Permissive OSS licenses

Permissive licenses permit a licensee to freely amend, adapt OSS code, and combine OSS code with proprietary code without placing restrictions (or significant restrictions) on such amendments, adaptations, or combinations (usually called “derivative works”) and how these derivative works can be licensed onwards. MIT and BSD Clause-3 licenses fall within this permissive category.

Restrictive OSS licenses

Restrictive OSS licenses impose licensing restrictions or requirements where the OSS is amended, adapted, or combined with any other software (whether proprietary or OSS). To produce a derivative work, a restrictive OSS license will apply (to a certain extent) to both the original OSS and any derivative works based upon it. This can be of key concern to organizations when they use restrictive OSS alongside their proprietary “closed source” software, as proprietary software could unintentionally be made subject to the OSS license.

Nevertheless, almost all the traditional software licenses produced in the past have focused on copyright protection of software (with some exceptions such as creative commerce licenses for other creative work) and focused on creating freedoms and restrictions based on copyright law.

Machine Learning and Traditional OSS licenses

With the invention of ML, there was an initial movement by several entrepreneurs, including legendary Elon Musk and Larry Page, to create an open-source version of ML/AI that is open source and free to the public. OpenAI, which came into being from this endeavor, released their product ChatGPT under a for-profit and closed-source model, derailing the momentum to create an open-source AI/ML era. However, that has not stopped individual developers from embracing the open-source model for AI/ML. This is clearly visible from the sheer number of AI/ML models that have been released in Hugging Face under OSS licenses. The Hugging Face download page shows that close to 100,000 AI models have been released under the Apache license, while another 50,000 models have been released under the MIT or BSD-Clause 3 license. In this backdrop, a discussion has started on whether the traditional OSS software licenses that are based on copyright laws can regulate the important aspects of AI/ML, notably the models’ weights, architecture, model usage, and datasets.

EU AI Act and OSS

On another front, the European Union has kicked off a debate on OSS and AI/ML with the enactment of the EU AI Act. The Act, which is geared towards regulating the use and distribution of AI in the European Union, gives a lot of prominence to open-source AI to promote innovation in the EU. Recital 61 that relates to General Purpose AI contains the following:

Software and data, including models released under a free and open-source license that allows them to be openly shared, and where users can freely access, use, modify, and redistribute them or modified versions thereof, can contribute to research and innovation in the market and can provide significant growth opportunities for the Union economy. General purpose AI models released under free and open-source licenses should be considered to ensure high levels of transparency and openness if their parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available. The license should be considered free and open source also when it allows users to run, copy, distribute, study, change, and improve software and data, including models, under the condition that the original provider of the model is credited and the identical or comparable terms of distribution are respected.

The recitals make it clear that the EU is embracing the four freedoms of OSS as a vehicle to promote AI in the EU and that the mere assertion that they are covered by an OSS license is not sufficient to get recognition in the EU. The Act goes further to provide a boost to open-source models by providing exceptions from the act if the AI model or platform can be considered open source.

The first exception addresses General Purpose AI models (which likely cover models such as LLAMA3 if it is released under an OSS license) and states that the Regulation shall not apply to AI models that are made accessible to the public under a free and open-source license whose parameters, including the weights, the information on the model architecture, and the information on model usage, are made publicly available, except the obligations referred to in Articles C(1)(c) and (d) and Article D (which covers the copyright law and to provide sufficiently detailed summaries of the content used for training the models).

The second exception relates to the AI systems (which are pre-trained systems to do a specific task such as ChatGPT) and states that the regulation shall not apply to AI systems provided under free and open-source licenses unless they are high-risk AI systems or an AI system that falls under Title II and IV.

OSS Foundation Principles and AI/ML Components

As the push for adopting OSS models for AI/ML intensifies, it is important to understand what components of ML/AI models will need to be released and disclosed in order for a certain AI/Model to be considered OSS under the foundational principles and EU AI Act, and whether the existing licenses sufficiently cover or grant protection to these different elements. Given that the AI/ML models contain a lot of different components, the below classification attempts to differentiate the components that are vital for the exercise of the OSS freedoms from those components that, if released, will greatly contribute to the effective use of OSS by downstream recipients.

Required AI/ML Components to Exercise OSS Rights under the Foundational Principles

Code Components

  • Data Pre-processing: This involves scripts and tools used to clean, transform, and prepare raw data for training. For instance, using Python scripts to handle missing values and normalize data before feeding it into a model.
  • Training, Validation, and Testing: These are the core scripts for training models, validating their performance, and testing them against unseen data. Examples include TensorFlow scripts for model training and separate scripts for validation and testing using datasets like CIFAR-10.
  • Inference Code: This component consists of code used to deploy the model and make predictions on new data. An example is a REST API implemented in Flask that serves a trained model for real-time predictions.
  • Supporting Libraries and Tools: These are additional libraries and tools required to run the model, such as NumPy for numerical operations or Scikit-learn for additional ML utilities.

Model Components

  • Model Architecture: This includes the design and structure of the model, such as the configuration of layers in a neural network. For example, the architecture details of a convolutional neural network used for image classification.
  • Model Parameters (including weights): These are the learned parameters during the training process. For instance, the weight matrices of a neural network after training on a dataset.

Optional Components That Are Good to Have to Exercise the First Principles of OSS

Code for Benchmark Tests and Evaluation

  • Code for Inference Benchmark Tests: Scripts designed to run the model on benchmark datasets to evaluate its performance. For instance, using a script to measure the performance of a trained NLP model on the SQuAD dataset.
  • Evaluation Code: Scripts used to assess model performance on various metrics. This could include code to compute precision, recall, F1-score, and plot ROC curves.

Data Components

  • Training Data Sets: These datasets are used to train the AI model. Examples include the ImageNet dataset for image recognition tasks and Wikipedia text dumps for language models.
  • Testing Data Sets: Datasets used to evaluate the final performance of the trained model, such as the MNIST dataset for handwritten digit recognition.
  • Validation Data Sets: Datasets used during training to tune model parameters and prevent overfitting. A common example is using a validation split from the COCO dataset for object detection models.
  • Benchmarking Data Sets: Standard datasets used to compare the performance of different models, like using the Stanford Question Answering Dataset (SQuAD) for benchmarking NLP models.
  • Data Cards: Documentation that provides detailed information about datasets, including their source, structure, and usage. For example, a data card detailing the characteristics and distribution of the CIFAR-10 dataset.
  • Evaluation Metrics and Results: Metrics and outcomes used to assess the model’s performance, such as accuracy, F1-score for classification models, or BLEU score for machine translation models.

Model Documentation

  • Model Card: A documentation file that provides essential details about the model, such as its intended use, performance, and limitations. For instance, a model card for a sentiment analysis model detailing its training data, performance metrics, and appropriate use cases.
  • Sample Model Outputs: Examples of the outputs produced by the model, like generated images from a GAN model or predicted sentiment labels and confidence scores for a set of text reviews.

Additional Documentation and Tools

  • Thorough Research Papers: Academic papers describing the research and methodologies used in developing the model. Examples include published papers in conferences like NeurIPS or ICML detailing the model's architecture and performance.
  • Usage Documentation: Instructions and guidelines on how to use the model and its associated tools, such as a user guide for setting up and running the model in a specific environment.
  • Technical Reports: Detailed reports providing in-depth technical information about the model, including optimization techniques and hyperparameters used in training.
  • Supporting Tools: Additional tools and utilities that assist in the development, deployment, or usage of the model, such as scripts for automating data preprocessing and model training, and utilities for visualizing model performance and debugging.

Understanding and managing the required and optional components of an AI model is vital for effective use and compliance with foundational principles of OSS. Proper documentation and tools not only ensure adherence to licensing terms but also facilitate easier adoption, reproducibility, and collaboration within the AI community. By focusing on both the necessary and beneficial components, organizations can streamline their OSS usage and contribute more effectively to the open-source ecosystem.

ML/AI and Current OSS

In this backdrop, we will review some of the existing software licenses that are widely used in the ML space to understand how these licenses cover the above components and disclosure obligations. In this review, we will use Hugging Face (https://huggingface.co/) as the source for identifying the trends of software license adoption as it hosts more than 350,000 models, 75,000 datasets, and 150,000 demo applications (Spaces) on its platform, making it the most extensive collection of models and datasets.

We will examine these licenses under two broad categories, namely the traditional OSS licenses that appear to be used most broadly in licensing the ML models and associated material, and the newer category licenses that intend to expand the scope of the license relating to the downstream use by limiting the use cases, such as limiting the number of users or restricting the use of the models in malware generation or the like. You can access the complete license in the license tab on the right side of this page.

Traditional OSS

Apache

The Apache License Version 2.0 is a permissive license that allows users to integrate the licensed software into their own projects, including commercial applications. Users are free to use, modify, and distribute the software provided they comply with the license terms. These terms include retaining the original copyright notice, providing a copy of the license with any distribution, and not using the contributors' names for promotional purposes without permission. Additionally, the Apache License grants an express patent license from contributors to users, making it suitable for large projects with multiple contributors.

In the context of machine learning, the following sections are pertinent to understanding the scope of the license:

  • Scope: Covers "Work and such Derivative Works in Source or Object form."
  • Work: Defined as "a work of authorship, whether in Source or Object form, made available under the License."
  • Derivative Works: Defined as "any work, whether in Source or Object form, that is based on (or derived from) the Work."
  • Source: Includes "software source code, documentation source, and configuration files."
  • Object: Any form resulting from mechanical transformation or translation of a Source form.

Given that the scope of the Apache License is primarily a copyright license, it is geared towards covering the creative elements of the model. Unlike the GPL, it does not contain broad language that would incorporate additional elements beyond the source code and object code. Consequently, non-creative components such as weights, parameters, and raw data are unlikely to be covered by the license.

MIT license

The MIT License is a permissive open-source software license. Users are permitted to integrate the software into their own projects, even for commercial purposes, provided that the original copyright notice and permission notice are included in all copies or substantial portions of the software.

BSD License

The BSD 3-Clause License, often referred to as the "New BSD License" or "Modified BSD License," is a permissive open-source license that allows for the free use, modification, and distribution of software.

The MIT and BSD licenses are minimalistic licenses that grant permission to distribute software with very few restrictions. Due to their simplistic language, these licenses do not explicitly address components such as weights and data associated with machine learning models. Consequently, they may not provide adequate protection or coverage for these specific elements within ML models. As a result, developers using these licenses should be aware that additional measures might be necessary to protect these critical components of their models.

AFL license

The Academic Free License (AFL) is an open-source software license that grants users the freedom to use, modify, and distribute the software, including for commercial purposes. It ensures proper credit to the original authors by requiring that the license and copyright notice be included in all copies or significant portions of the software. The AFL also provides a strong patent grant, offering legal protection for both users and developers against patent infringement claims. Additionally, it includes a disclaimer of warranties, meaning the software is provided "as is" without any guarantees.

The scope of the AFL is broader than most open-source software licenses, as it covers any original work of authorship whose owner has placed the specified licensing notice adjacent to the copyright notice for the original work. Despite the limitation imposed by the term "copyright holder," the AFL has the potential to encompass a wide range of elements within machine learning components (excluding raw data) as long as they are considered original works. This broad scope makes the AFL a flexible option for licensing in the evolving field of ML and AI technologies.

GNU GPL

The GNU Version 2.1, released in February 1999, is a viral software license that allows developers to use, modify, and integrate the software provided that modifications to the components are released under the same GPL license. Importantly, it incorporates the copyleft principle from GPLv2 Section 2(b), which requires that any distributed work derived from or containing GPL-licensed code must be licensed as a whole under the GPL to ensure that the four freedoms of free software are propagated. This provision, often referred to as the "viral" mechanism, mandates that the source code of the GPL-covered software be made available to users, thus maintaining transparency and the freedom to modify and distribute the software under the same terms.

Out of the OSS licenses cited above, the GNU GPL and AFL license have the broadest scope for a license in that it is meant to include “…any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License…”

While the mention of the copyright holder is a limiting factor in the license, the GNU GPL does provide a more robust framework and scope to regulate the newer aspects of ML models.

New or Newer Licenses in ML Space

BigCode Open RAIL-M v1 License Agreement

The BigCode Open RAIL-M v1 License Agreement is a novel software license crafted to address additional considerations related to the downstream use of machine learning (ML) models. This license grants users the freedom to use, modify, and distribute the model, including for commercial purposes, as long as they comply with specified restrictions designed to prevent harmful applications such as generating malware or other malicious content. The license attempts to balance open access with ethical considerations to ensure technology is used responsibly and transparently.

Despite being a new license, it still uses the term "copyright license," which inherently limits its scope to models and elements covered by copyright. Nonetheless, the license makes a significant effort to incorporate ethical and legal use of software in a concrete manner, potentially setting a strong foundation for future software licensing models. This focus on responsible use and comprehensive ethical guidelines could serve as a valuable template for the evolution of software licenses in the ML domain.

Llama Licenses

The Meta Llama 3 Community License is a permissive license designed to offer users significant freedom while maintaining protective measures for Meta. Under this license, users can use, modify, and distribute the Llama materials, which include machine learning model code, trained model weights, and related software, for both personal and commercial purposes. Key stipulations include the requirement to include a copy of the license with any distribution and prominently feature the statement "Built with Meta Llama 3." Additionally, users are prohibited from using Llama 3 materials to enhance other large language models, and entities with over 700 million monthly active users must obtain a separate license from Meta for large-scale commercial use.

While the Llama licenses are free, they will not qualify as open-source software (OSS) licenses as restrictions on use and benchmarking contradict foundational OSS principles. However, the license covers a broad scope, including "foundational large language models and software and algorithms, machine-learning model code, trained model weights, inference-enabling code, training-enabling code, fine-tuning enabling code, and other related elements." It extends beyond a simple copyright license to encompass all forms of intellectual property and includes contract-based language for enforcement. Thus, although not an OSS, the Llama license provides a solid foundation for an OSS model in terms of scope and licensing terms.

Time to Think Beyond Traditional Legal Basis of OSS and Traditional Scope of OSS

Traditional open-source software (OSS) licenses have primarily focused on licensing the copyrightable elements of software programs, offering varying degrees of freedom and additional covenants. However, applying these same OSS frameworks to cover all aspects of machine learning (ML) and artificial intelligence (AI) technologies may not be sufficient. The unique nature of ML and AI models, which often encompass a diverse range of components beyond just software code, necessitates the introduction of new versions of OSS licenses that address the specific nuances of these technologies. The existing OSS licenses, while valuable, may fall short in effectively regulating and promoting the downstream use and benefits of ML and AI models.

There is precedent for expanding the scope of OSS to include contract claims, as demonstrated in the case of Artifex Software Inc. v. Hancom Inc., No. 16-cv-06982-JSC (N.D. Cal. Dec. 5, 2016). Additionally, the Llama licenses exemplify how an expanded definition can incorporate different components of ML into a licensing model. It is now time to develop an updated OSS model, such as an Apache 2.0 or GPL 4.0, specifically tailored for ML and AI models. This robust OSS licensing framework is crucial for technological advancement, especially in the ML/AI sector where developing a model can cost billions. OSS AI has the potential to enhance transparency by opening "black boxes" and ultimately citizen trust in AI models. Proper OSS licensing would enable downstream recipients to leverage existing frameworks to achieve greater innovations without having to build ML models from scratch. This transformative moment for AI, akin to the impact of Linux, can only be realized through a well-defined licensing model grounded in OSS principles.